Feature Stores Demistified

What is a feature store, do you really need one, and if so, who should own it?

We all know data is usually the biggest challenge for any ML project. Feature stores aim to enable machine learning applications through centralizing access to high-quality data in both a batch and real-time fashion.

Let’s break down that definition:

  • enable machine learning applications - feature stores are built to provide data for machine learning model training and inference. Although they are now finding a role in Generative AI applications as well, their primary focus has been machine learning

  • centralizing access to high-quality data - finding the right data for your use case is often one of the most time-consuming tasks in data science. Feature Stores give data scientists and ML engineers one place to go find approved, high-quality data to kickstart the model development process. It also promotes reuse of those features, leading to more consistent calculations and less time spent building duplicate data pipelines

  • in both a batch and real-time fashion - feature stores will typically have two interfaces - an offline store used for training and batch prediction, and online store that powers real-time applications through low-latency feature retrieval. Feature stores will also keep data in sync between the offline and online stores, meaning the data will be consistent for training and inferencing, mitigating the risk of training/serving skew.

Feature Store Concepts

So, feature stores provide data to ML applications… seems pretty simple. Create some parquet files in blob storage, then sync that data into Redis and we’re done?

Well, that’s not far off. But let’s walk through all the components and concepts of a feature store to be sure some of the ancillary benefits are understood. Most of these concepts exist for all feature stores, but some may abstract some of these things away from you. The terms I’m using below are consistent with feast, a popular open-source feature store.

Diagram of a feature store’s architecture

Entity

An entity is essentially a business object that features will belong to. With feature stores, you will create an entity for things like ‘customer’ or ‘product’ and then tie features to that entity.

Feature View

Feature Views are the main way you will interact with the store to fetch features. They are a group of features that typically belong to an entity (although they don’t have to) and are linked to a physical data source.

Feature Registry

I like to think of the registry as the brain of the feature store. This is typically a Relational Database that keeps track of all the entities, feature views, data sources, etc.. It knows when features were last refreshed and stores any associated metadata. Any interaction with the feature store will go through the registry in one way or another to get metadata.

Offline Store

This is essentially just the storage layer for historical features. This can be blob storage on your cloud provider of choice or a table in Snowflake or some other cloud database.

Feature Store SDK

Most feature stores will come with a Python SDK for handling interactions with the features such as registering a feature, fetching a feature, or performing materialization (more on that in a sec).

Online Store

The online store is a read-optimized database such as Redis for powering online applications. The online store is typically only populated with the most recent version (by timestamp) of each feature. Data can be streamed directly into the online store or systematically refreshed from the offline store in a process called materialization. Some feature stores will handle that materialization for you, others you need to control when it happens.

Feature Store REST API

This is your way to fetch data from the online store to be used in inference. These will typically provide sub-50ms latency, enabling it to power critical, customer facing applications.

Do You Need a Feature Store?

As you can see, there are a lot of components that power a feature store. When spinning up an open-source feature store like feast, you’ll be responsible for managing the infrastructure for the offline store, online store, feature registry, and online serving API, as well as all of the pipelines creating the features.

If you are a non-large tech company, I would advise against feast due to its complexity and overhead. Luckily, pretty much every cloud provider offers a fully managed feature store, so you just worry about getting the data in and out. Some popular offerings include Tecton (essentially a souped-up, managed Feast), GCP VertexAI Feature Store, Databricks Feature Store, Snowflake Feature Store, AWS, Azure… you get the point.

But even if it is managed, do you really need one? Here’s my take:

  1. Do you need feature data to power mission-critical, real-time ML applications?

    1. If you answered yes to this, I would strongly suggest implementing a feature store

  2. Do you have problems with duplicate feature engineering pipelines or varying feature definitions across your business?

    1. You may benefit from a feature store to centralize features in one place to promote consistency and reuse.

If you answered no to both of those questions, you probably don’t need it. Just build your feature data like you have been and forgo all the additional complexity.

Feature Store Ownership

This is one of the more interesting aspects of feature stores. They’re used by data scientists who also typically are the ones influencing feature definitions, but require data engineering workflows to be populated, and are also a key aspect of ML operations, meaning ML engineers should be involved.

While there’s no one answer, here’s my take on ownership:

  1. Feature definitions - data scientists. They should be closest to the data and understand how specific business metrics should be calculated.

  2. Feature pipelines - the utopian goal here is to enable data scientists to publish their feature definitions and have them automatically executed in production. This assumes your data scientists can write production-grade code (which they should be able to) and there’s a clean mechanism for execution of their code, without going through a hand-off and rewrite process with engineers. If not possible, then the hand-off to data engineering is probably required but this will be a friction point.

  3. Infrastructure - this is ideally owned by ML engineers.

  4. Support - hopefully you have a dedicated ops team who can be the first line of defense on pipeline failures or quality alerts. Depending on the issue, data engineers, data scientists, or ML engineers could be involved for support.

To wrap up, if your ML application depends on consistent, real-time features, a feature store might make sense. Start by exploring the managed options available in your current cloud platform and don’t reinvent the wheel unless you really have to.

If you found this article helpful, consider subscribing to see more content related to Data, AI, and ML Engineering.