AI on ACID -scaling from the ground up
- Summary:
- Rather than bringing data to one place at enormous transit and egress costs, why not just train on the data in place?
Enterprise efforts to scale AI infrastructure are all the rage these days, but the underlying data infrastructure required gets less attention. VAST Data is pursuing a novel approach that focuses on the physics of how data gets stored on decentralized solid-state drives (SSD). Now it's rolling out a decentralized AI infrastructure that runs directly on this decentralized data store.
The company argues that rather than bringing all the data to one place at enormous transit and egress costs, why not just train on the data in place? This also addresses the data residency mandates that can preclude sending data across borders.
VAST’s traditional competitors in the storage infrastructure business include companies like Dell EMC, IBM, and Pure Storage. The new AI enhancements could help the company compete in the larger market for decentralized cloud database services from companies like Snowflake and Databricks and the emerging market for data stores for training AI-based Large Language Models (LLMs). VAST Data co-founder Jeff Densworth explains:
There are companies in the market that are really pushing this term called a data platform, like Snowflake and Databricks. Both of them came to the end state from different vantage points. Snowflake started with the data warehouse, and Databricks started with the computing engine. And then, they both added the respective gaps in their offering. We came at it from the third vantage point, which is we started building from the storage up, whereas they sourced everything to Amazon.
He argues that classic data warehouses don’t address the need for emerging use cases like computer vision, life sciences, and large language models on large data corpora. Furthermore, future data stores will need to support petabytes or even exabytes in one centralized infrastructure.
Early customers include NVIDIA, which invested in VAST and is using the tech to build out data infrastructures for AI supercomputers. Pixar started using the new architecture for storing 3D assets for making movies. Now its researchers can use the same universal storage system for training better algorithms to de-noise movies and develop better lighting models that allow them to render movies faster.
Using similarity to reduce SSD cost
VAST launched in 2016 to take advantage of new standards for aggregating SSDs more efficiently using the newly ratified NVMe-over-Fabric interface standard. The key innovation was rethinking the way data bits get laid down into an SSD drive.
Traditional hard drives sort the bits into 32-kilobyte blocks. VAST instead breaks data into 2-bit blocks. This increased the granularity for finding duplicate data and organizing it using pointers to redundant bits. They call this more efficient process similarity reduction to contrast it with traditional data deduplication techniques that operate on larger blocks. Densworth says customers commonly see a four-fold increase in compression ratio when moving to the smaller blocks.
SSD drive costs are shrinking more rapidly than traditional hard disk drive (HDD) costs. However, at a unit level SSDs are still about four times more expensive than SSD for data centers. Densworth argues that the similarity reduction efficiency allows enterprises to take advantage of SSD performance at cost parity.
This also reduces the need for enterprises to turn to different storage tiers to organize large data sets. AWS, for example, has seven cost and performance tiers for different types of data. Densworth explains:
Customers have always had to choose between price and performance, either you get cheap capacity, or you get good performance, but nobody ever gave you both. And that is how we emerged into the market. We said we've unlocked access to data at archive or exabyte scale. But historically, an archive is where data goes to die.
The traditional value of a large data infrastructure has been to process transactions more efficiently. But now enterprises are increasingly turning to historical datasets to derive insights through applied statistics. And this requires access to more of that historical data.
Running AI on ACID
With traditional databases, a gating factor was finding ways to share data across multiple systems efficiently. Transaction databases often require ACID transactions that support atomicity, consistency, isolation and durability. This is hard to do across multiple data stores. So the largest transaction databases used a single symmetrical multiprocessor connected to a single storage array to ensure consistency of transactions.
A decentralized system can scale more cost-efficiently, but every change must be propagated across the network. Densworth explains:
The challenge becomes, as you write into these machines, there's a certain amount of interdependencies between them, called shared nothing. Every time you write into one of the machines, those writes need to be propagated across other machines in the network. And as you have more machines, there's more serialization that happens in the cluster. There's a law of diminishing returns that kicks in such that you can't, for example, build an exabyte scale database today because it would fall to its knees in terms of managing all the different transactions that happened into it.
VAST used the NVMe over fabric to organize commodity machines that allow all of the CPUs to see all the data on the SSDs concurrently without having to talk to each other using a shared global namespace. This enables system architects to rethink the requirement of a tight marriage of compute and state in a distributed storage system.
The new infrastructure promises to allow enterprises to build systems that run functions on any node that can see all of the data so they can manage transactions themselves. VAST’s approach to a decentralized global namespace could connect previously disconnected processes for managing transactions, analytics, and AI development into a coherent ecosystem.
The new offering promises to allow data scientists to train new algorithms on versioned similarity-reduced replications of live transaction and IoT data without impacting the system of record. Densmore says:
Imagine having a hundred data centers, some in the cloud, some in your own facility, and some in edge computing sites, all having and having a common view of your data, regardless of where you come in on the network, and fast access and fast transactional access into that space from wherever you are. That's the problem that we've solved here.
My take
Traditionally transaction, analytics and now AI data have all resided within separate infrastructures. This requires a lot of integration glue and processes to ensure consistency of the transaction and flexibility for new analytics models both in the data science phases and later on in deployment.
If this new universal infrastructure works as intended, it could save enterprises a ton of money on hardware and mitigate many data engineering challenges. A few years ago, the common wisdom was that an enterprise needed five data engineers for each data scientist. And this was before the AI explosion from larger language models.
At the very least, this new offering is likely to inspire further innovations in the dusty old domain of storage infrastructure.