Crater Labs is a Canadian AI and Machine Learning research lab, which is made up of academic researchers, application developers and experienced business executives. Typically it works closely with organizations that likely already have experience working with AI/ML models, but need support in scaling these models to solve larger scale problems that will have benefits five to 10 years down the line.
The team at Crater Labs has a tagline that is ‘moonshot with impact!’, which translates to real-life AI/ML projects that are focused on solving the ‘larger problems’. But using artificial intelligence models to solve these ‘larger problems’ can mean grueling demands on technology infrastructure. Particularly storage, given that the data generation that occurs when retraining or pruning models.
Speaking with Crater Labs CTO, Khalid Eidoo, he explains how AI/ML models are effectively living organisms. And as they learn on more and more data, their biases change, their accuracy changes, and their resilience to certain variables changes over time. This is where Crater Labs typically steps in with their customers. He explains:
For example, with the tier one OEM automotive manufacturers, they've been producing transgressions for decades. They know, based on the current type of designs, the types of failures to expect. They know where the failures are going to occur and what they’re going to look like.
And their QA team, which would inspect these parts, knows that process inside out and is AI assisted already. But as they start designing new parts that have fundamentally new designs, they don't know the failure points. They don't know necessarily what is going to fail, where it's going to fail, how it's going to fail.
And so their existing AI models aren't really able to identify defects. That's where we'll come in and work alongside their AI/ML team to come up with advanced models that can synthesize defects within a given model.
Crater Labs started its life in the cloud, working with standard cloud services on AWS and Google Cloud. Eidoo says that for the most part there wasn’t an issue there, but many of its clients had concerns about where their data was being hosted, how it was being hosted, and the cost associated with their models getting larger and larger. Eidoo adds that while cloud storage is thought of as relatively cheap, this is mostly true when looking at cold or semi-warm storage modes. With machine learning and AI, data needs to be available at all times, it needs to be hot and ready to go. Eidoo says:
That's where we noticed that we were actually spending an inordinate amount of time waiting for our data to get prepared and go online so that we could train a model. That meant that we were passing on a lot of overhead, an unproductive cost, to our clients, which was ballooning the cost of some of our research projects.
Crater Labs’ clients are typically larger in size and are very sensitive about their data. And the types of models it is building for them typically scale to terabytes upon terabytes in size. In addition, as noted above, its clients often have geographic and security requirements. As such, Crater Labs decided to build its infrastructure on premise, allowing it to insulate its clients from the costs associated. These costs are particularly pertinent for storage. Eidoo says:
When you have a one terabyte data set of images that you're working with, you're going to balloon that out to several terabytes just in some of the processing work that you do, even before you start doing any training on those models.
And then with every single training run, because we are doing very research oriented work, as much as we're trying to create models that they'll be able to use down the road, we're creating all sorts of logs, all sorts of artifacts as a part of the training.
So, for example, for every terabyte training run that we do, will probably generate between 50% and 75% new data on that one terabyte data set. For the average large client, we're doing several hundred training runs in order to really validate the extent to which these models can be used.
So that's where storage becomes absolutely critical. And even if some of these security restrictions or privacy restrictions weren't in place, doing a lot of this in the cloud just really isn't feasible.
The nature of research is that failure is inevitable. Crater Labs is training AI/ML models for its clients for success, but there will always be costs associated with figuring out what isn’t working. As such, failure is another cost factor to limit. Or, as Eidoo says, it’s important to get to those failures and to understand what the limits are as quickly as possible. He adds:
That's really where we started looking into different storage solutions. As an intermediate basis, we tried purchasing some servers and having large volumes of storage in there, but that isn’t conducive to having a large distributed environment where you have multiple GPUs across multiple servers.
Crater Labs ended up implementing Pure Storage’s FlashBlade technology, which Eidoo describes as being very geared towards dealing with highly unstructured data and having very fast access to it. Equally, he adds, the technology has a fast back plan, so it’s able to push data to Crater Labs’ servers at speeds that are very high throughput over long periods of time.
Crater Labs found the implementation of Pure Storage to be a “non-event”, thanks to the vendor’s high performance transfer tools. Crater Labs managed to move over its data and be online and running in just a couple of hours. In fact, after getting online the organization had a new model running within fifteen minutes of operation.
And based on some recent comparisons carried out by Crater Labs, the results speak for themselves. Eidoo says:
We have one particular case where there's a bunch of data pre-processing that needs to happen in a database, and those databases are about 500 gigabytes in size. And then we do a whole bunch of queries and data pre-processing, where ultimately that 500 gigabyte database becomes about three terabytes of data, which we can then feed it to a machine learning algorithm.
When we run that process in the cloud, just using standard technologies for storage, we find that on average those pre-processing runs would take anywhere between 72 to 96 hours. And that is, going sort of unfettered with the number of compute nodes and storage and whatnot.
That's maxing out the capability of the infrastructure as well as our code in that particular case. When we use our FlashBlade, we are generally getting that process done in under 10 hours. We actually know that it's purely because of Pure, because we are actually using less compute nodes to achieve the same time.
In simple terms, Crater Labs is ultimately carrying out its work in approximately a tenth of the time it used to take when using the cloud.
But it’s not just about speed. Crater Labs was also aware that one of the benefits of using the cloud was its ease of use. Eidoo adds:
It was really about the expansion capabilities, the ease of use. What we liked about the cloud was that we were basically administrator-less, we could implement infrastructure within our code, and that these boxes would just automatically run.
And FlashBlade kind of just runs itself. When we need something, we spin up a container, we orchestrate it and we're off to the races. FlashBlade just seems to handle it regardless of what kind of a project it is that we're doing.
The rise of LLMs
Crater Labs, unsurprisingly, is currently adapting to the sudden onset of Large Language Models (LLMs) in the market. Ever since the launch of ChatGPT, organizations far and wide are demanding answers on how LLMs will affect their business and how they can be utilized to improve their operations.
Eidoo says that LLMs are often trained on millions upon millions of inputs, where full size models are several terabytes in size. Often Crater Labs’ clients want to be able to use LLMs to be able to understand the information that is being sent to them and to identify sentiments that are specific to their industry.
To do this successfully, Crater Labs is focused on pruning models and retaining them so that they are better specialized for the industries that its clients are based in. Eidoo says:
With some of our scientific clients, they want to use these large language models to look at the literature that's coming out of academia to identify and understand the use of particular bio-entities within those documents - look at trends, analyze them - so that they can refocus where their research is being conducted and better understand where the real advances are being made in the industry today.
That’s where taking a larger language model, pruning it down so that it is much more highly specialized, actually requires even more data than just the large language model - even though we're pruning, which is somewhat counterintuitive.
This is because what we're doing with every run, as we take nodes out of that initial large language model, is introducing some new nodes. But then we're also evaluating what happens when you remove certain parts of that larger model. That requires a huge amount of validation.
Simply put, LLMs have changed the way a lot of model training is being done, because you start with a large model, but working towards something smaller and specialized means creating an order of magnitude more data in getting there. Eidoo provides the example of using AI that identifies every single color - if you start removing all of the red color from that model, it might not recognize brown or purple too, and so you have to constantly evaluate and understand what’s going on, which creates more and more data.
As such, Crater Labs’ storage needs are changing too. Eidoo says:
What we're going to need is a high degree of parallelism for this very unstructured data. Historically, with a lot of machine learning, the data is in databases, or it’s in well constructed images. When you start getting into language models, you're literally just dealing with documents. They can be all sorts of different sizes, right?
A research paper could be several hundred pages long and a press release could be two paragraphs. A storage system needs to be able to deal with that data however it comes and feed it to a model that's being trained as quickly as possible, without compromising that access time as well as the throughput because of the size of that data.
And because language models in particular have so much application across so many different entities and are capable of so many different kinds of things, we're seeing that there's a larger and larger demand for those kinds of models, so we need to be able to train multiple models across multiple industries in parallel. And get those out as quickly as possible.
That's where having the parallelism, that fast access, and also not having to worry about how the containers are being set up, really becomes important. We're going through projects with greater velocity than we ever have before.