How To Build an AI/ML Feature Store With ScyllaDB NoSQL
In this article, delve into the fundamentals of AI/ML feature stores and explore how to get started with your own using ScyllaDB NoSQL.
Join the DZone community and get the full member experience.
Join For FreeMachine learning (ML) feature stores have been attracting attention and usage for business-critical applications ever since Uber introduced the concept with Michelangelo in 2017. In this blog post, we will delve into the fundamentals of ML feature stores and explore why and how ScyllaDB can be a critical part of your feature store architecture.
In order to understand what feature stores are, it’s important first to understand what features are.
What Is a Feature?
In Machine Learning, a feature is a set of data points that can be used to teach a model and make predictions about the future based on historical data. For example, our feature store sample application lets you make predictions regarding flight delays based on historical flight records.
Features are the result of complex data processing and transformation pipelines. Massive amounts of feature data enable accurate predictions and successful machine-learning projects.
What Is a Feature Store?
A feature store is a central database in your machine-learning architecture that contains your real-time and historical features. Feature stores allow your data engineers and data scientists to use the same central repository to discover, monitor, and analyze features.
What does a feature store architecture look like?
Online and Offline Databases in the Feature Store
When we talk about feature stores, users usually differentiate between two kinds of databases in their architecture. On one side, they use an online database, and on the other, they might also have an offline database. These databases serve different purposes.
Offline database: This kind of database stores historical processed features, usually ingested in batches. Offline databases have feature data covering a large time frame from history; hence, they are useful for working with a set of features in a specific period in history.
Online database: This database might contain data from real-time data streams and the offline database as well. Online storage is used to serve the production model and other real-time applications with the most up-to-date feature data. Performance and low latency really matter here. If your database is not capable of delivering real-time features fast enough, then your model might use outdated or inaccurate data to make predictions.
Feature Store Data Modeling: Wide vs. Narrow Table Design
When you are designing the data model within your feature store, be it an offline or online store, you can decide between two types of table designs: wide and narrow. Each has its own benefits and drawbacks. Let’s see actual examples for both and why they might or might not be the best for your use case:
Wide Table Design
The wide table design includes separate columns for each feature. The more types of features you want to store in the table, the more columns you have to create.
Wide Table Layout Example
create table feature_store.wide_example( | |
date TIMESTAMP, | |
feature_id INT, | |
feature_col1 FLOAT, | |
feature_col2 FLOAT, | |
feature_col3 FLOAT, | |
feature_col4 FLOAT, | |
feature_col5 FLOAT, | |
feature_col6 FLOAT, | |
feature_col7 FLOAT | |
) |
This kind of layout can be easy to get started with, but it also becomes more complicated to maintain over time and hard to make changes to. Whenever you want to introduce a new feature (or drop an existing one), you need to modify the schema, which can be complicated.
Narrow Table Design
Narrow table designs are simple and easier to maintain. This is because the number of columns is not meant to increase or decrease in the future, even if you add or remove features.
Narrow Table Layout Example
create table feature_store.narrow_example( | |
feature_id INT, | |
feature_name TEXT, | |
feature_value FLOAT | |
) |
Using this layout, you can get away with using only two fixed columns long-term to store features. One for the name of the feature (e.g., LATE_AIRCRAFT_DELAY) and one for the value of that feature.
In general, narrow tables might require casting the data types when retrieving data because it’s not in the correct form (e.g., the column type is FLOAT, but in reality, the data value is an INTEGER. Fortunately, when we talk about feature stores, online and offline stores already have the data in proper clean number (FLOAT) format, and all values have the same data type, which means this is not a drawback in the case of feature stores.
What Is ScyllaDB, and How Can It Be Used in Your Feature Store Architecture?
In order for machine learning teams to build real-time inference applications, they need databases that can return features at scale with low latency. ScyllaDB is a high-performance, low-latency NoSQL database that can handle high volumes of read and write operations. Furthermore, ScyllaDB is a trusted database for mission-critical feature store workloads at companies like GE Healthcare or ShareChat. Due to its high availability and fault tolerance, It can do the heavy lifting in your infrastructure where performance and reliability matter.
Aside from leveraging ScyllaDB as the online store in your feature store architecture, ScyllaDB is also used as an online/offline hybrid storage solution. With this approach, you can lessen the maintenance burden on your team by having a single database to serve all your feature store workloads.
Users often place ScyllaDB in the center of their architecture to persist and retrieve features and feature store metadata. In this case, ScyllaDB acts as an online store. Other users also use ScyllaDB as their online/offline hybrid storage. Performance is a key requirement in order to speed up model development, and ScyllaDB’s read-and-write performance consistently meets or exceeds user expectations.
In fact, some users found that ScyllaDB could replace multiple databases and serve as a single central store for all their machine-learning data needs. For example, ScyllaDB can replace Redis (online store) and PostgreSQL (offline store) — making infrastructure maintenance less expensive and simpler.
ScyllaDB shines in use cases where you require low latency and high performance. Furthermore, ScyllaDB is compatible with Cassandra and DynamoDB which means if you already use one of these databases, you can seamlessly migrate without having to change your queries.
Tutorial: ScyllaDB Online Store
To help you get started with ScyllaDB as your online store, we’ve created a sample application (also available on GitHub).
- Clone the repository
- Sign up for ScyllaDB Cloud or install ScyllaDB locally
- Create the schema:
cqlsh “node-0.aws_us_east_1.xxxxxxxxx.clusters.scylla.cloud” 9042 -u scylla -p “password” -f schema.cql
- Connect to the instance with cqlsh and import a sample dataset
cqlsh “node-0.aws_us_east_1.xxxxxxxxx.clusters.scylla.cloud” 9042 -u scylla -p “password” scylla@cqlsh> COPY feature_store.flight_features FROM ‘flight_features.csv’;
This command ingests a sample flight dataset:
op_carrier_fl_num|actual_elapsed_time|air_time|arr_delay|arr_time|cancellation_code|cancelled|carrier_delay|crs_arr_time|crs_dep_time|crs_elapsed_time|dep_delay|dep_time|dest|distance|diverted|fl_date |late_aircraft_delay|nas_delay|op_carrier|origin|security_delay|taxi_in|taxi_out|weather_delay|wheels_off|wheels_on| | |
-----------------+-------------------+--------+---------+--------+-----------------+---------+-------------+------------+------------+----------------+---------+--------+----+--------+--------+-------------------+-------------------+---------+----------+------+--------------+-------+--------+-------------+----------+---------+ | |
4317| 96.0| 73.0| -19.0| 2113.0| | 0.0| | 2132| 2040| 112.0| -3.0| 2037.0|MLI | 373.0| 0.0|2018-12-31 02:00:00| | |OO |DTW | | 5.0| 18.0| | 2055.0| 2108.0| | |
3372| 94.0| 74.0| 81.0| 1500.0| | 0.0| 0.0| 1339| 1150| 109.0| 96.0| 1326.0|RNO | 564.0| 0.0|2018-12-31 02:00:00| 81.0| 0.0|OO |SEA | 0.0| 3.0| 17.0| 0.0| 1343.0| 1457.0| | |
1584| 385.0| 348.0| -21.0| 2023.0| | 0.0| | 2034| 1700| 394.0| -9.0| 1658.0|SFO | 2565.0| 0.0|2018-12-31 02:00:00| | |UA |EWR | | 13.0| 33.0| | 1731.0| 2019.0| | |
4830| 119.0| 85.0| -35.0| 1431.0| | 0.0| | 1437| 1245| 136.0| -13.0| 1232.0|MSP | 546.0| 0.0|2018-12-31 02:00:00| | |OO |MSP | | 16.0| 18.0| | 1250.0| 1415.0| | |
2731| 158.0| 146.0| -19.0| 1911.0| | 0.0| | 1930| 1800| 160.0| -8.0| 1800.0|MSP | 842.0| 0.0|2018-12-31 02:00:00| | |WN |PVD | | 7.0| 7.0| | 1807.0| 1904.0| |
ScyllaDB + Feast
ScyllaDB also integrates with feature store tools like Feast. Feast is a popular open-source feature store for production ML. You can use several databases as your online feature store when using Feast, including ScyllaDB.
To set up ScyllaDB as a Feast online store, you need to edit the configuration file of Feast and add your ScyllaDB credentials. ScyllaDB is Cassandra-compatible, so you can use Feast’s built-in Cassandra connector.
# feature_store.yaml | |
project: scylla_feature_repo | |
registry: data/registry.db | |
provider: local | |
online_store: | |
type: cassandra | |
hosts: | |
- node-0.aws_us_east_1.xxxxxxxx.clusters.scylla.cloud | |
- node-1.aws_us_east_1.xxxxxxxx.clusters.scylla.cloud | |
- node-2.aws_us_east_1.xxxxxxxx.clusters.scylla.cloud | |
keyspace: feast | |
username: scylla | |
password: password |
Wrapping Up
Feature stores are necessary to feature engineering and building machine learning models. If you’re building a real-time feature store infrastructure, you need to consider performance carefully. Low-latency, high-performance, and high-throughput requirements make NoSQL databases a perfect candidate as an online storage solution in your feature store.
Published at DZone with permission of Attila Toth. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments