Why your ML/Data Science teams should be using feature stores

Abstract image of lights and shadows
Abstract image of lights and shadows

As companies are getting mature in using Artificial Intelligence for finding a competitive edge against their competitors, the need for a feature store is becoming more and more important. So, if you don’t know what a feature store is, then read on to learn about its importance and how it can help your ML/Data Science teams.

What is a feature store?

A feature store is basically a data management system that allows companies to easily access and discover data in order to train and execute their models. It allows companies to keep track of the entire lifecycle of the data used by models, from ingestion to serving. A feature store also allows companies to easily reuse features for different projects and ultimately enables teams to have their data pipelines running in a standardized architecture. Therefore, the feature store is an important tool if you want do develop machine learning faster and easier to maintain.

Can your company live without it?

The short answer is yes, but you may want to consider it thoroughly. Some of the problems introduced by not having a feature store as the central piece of machine learning development include:

Unnecessary complexity during deployment.

To understand this point you need to understand that data consumption for machine learning happens in two different moments: training and inference times. During training time machine learning pipelines need to consume data in batches, by nature most traditional DBs and big data options will serve this purpose. However for inference time, the consumption is a bit different, on many cases, it’s not a good idea to rely on a batch source during inference time, except if you’re doing batch predictions, then this should be ok. If one doesn’t have a feature store in place your teams will need to set up different solutions for each new project.

Difficulty in debugging models in production

A good feature store has point-in-time correction built-in. This is a way to go back in time and use the same data that was used to train a model that already got deployed in production. This is very useful when a new model is not performing as expected and it’s not clear where is the problem.

Increased cost due to reprocessing of same features

When not using a feature store what most teams do is that they keep their features dynamic, meaning that they’re recalculated at each training. This means that when a new model is trained the same features will be recalculated, therefore it is a waste of computing time and money. It’s not only a waste of time and money, it also introduces a new problem: each new version of a feature may introduce new bugs.

Feature drift

This is the last point and it’s one of the most important. This happens when the distribution of the data changes and the model’s performance degrades, but the model was not retrained because the new data was not available during training time. When using a feature store one can easily automate the retraining of the model based on feature monitoring. This is a powerful tool to ensure your overall machine learning is as effective as it could be.

Final Thoughts

As many companies are still very early on the adoption of AI it’s better to start it with solid foundations that will allow the teams to move faster and be more productive in a sustainable fashion. The feature store is one, if not the, most important aspect of enterprise machine learning development. It’s a relatively new concept that was not very popular in academia because of the different dynamics of how the projects happen in each context. As companies move forward applying state-of-art research to their businesses they need to make sure that their businesses are still competitive and efficient.

Originally published at https://medium.com on April 17, 2022.