Main content

The data journey for AI - ingestion to innovation

Fred Lherault Profile picture for user Fred Lherault April 11, 2024
The first in this two-part series from Pure Storage examines how data is used, transformed and moved in advance of being used to train AI algorithms.

Gears mechanism, digital transformation, data integration and digital technology concept © Funtap -
(© Funtap -

There’s a lot written about training AI models – however data scientists spend a lot of their time on the processes that take place before and after model training. In all these different stages, data gets transformed and amplified. In order to create an effective and useful AI model, the data that informs it must be easy to find, accessible, AI-ready and accurate. Organizations should consider how to empower data scientists; support data growth in a sustainable way; and, given the fast pace of change for AI projects, ensure they have the technology to support current and future needs with as-a-service solutions.

Here are the six stages data generally goes through, as well as some considerations regarding how it will be transformed and amplified.

1. Finding and loading data

Does it live in the cloud; on-prem; in a database; is it unstructured or structured? It will likely be from a combination of all of these, real world data sources, transactional and business app data.

  • Data may need to be exported to a format that will be easier to use. This results in the duplication of this data, albeit in a different format.
  • It may need to be copied to a different location for analysis. 
  • Depending on the use case and the scarcity of the source data, scientists may want to “amplify" data through synthetic data generation. Synthetic data can be created by taking the source data and making slight variations. This can significantly increase the amount of data to store. Note: there are rising concerns that synthetic data can “poison” AI training if it’s been generated by AI models and therefore a certain degree of skepticism is employed when considering synthetic data.

2. Preparing data (pre-processing): 

Making it usable, there may be something that’s not in the right format (or missing values) which makes it useless for some types of AI. Or there may be some data that needs to be excluded from the analysis for other reasons.

  • Depending on the type of AI the data will be used for (predictive AI vs generative AI for example), it may also need labeling, in essence enhancing the data with metadata.
  • Feature engineering - the process of selecting and enhancing specific parts of the data in order to improve the performance of the model – can result in creating additional metadata that will need to be stored.
  • For predictive AI, some of the data will need to be excluded from training and set aside for testing in order to validate the results of the training later.

3. Training: 

In this phase, the pre-processed data is mostly being accessed. During training, a different form of data is being created:

  • The resulting models as well as metadata information about these models and which data they were trained on.
  • Checkpoints, which are used as a way to save progress before the training is complete. This innovation in the field of AI is useful to roll back the training partially without having to redo all the work – which is important since GPU resources are limited. With these checkpoints, another type of metadata is created. 

4. Model evaluation following training: 

  • For predictive AI, this is where the data set aside earlier (at the end of stage two) for testing will be useful. More metadata will be generated during testing in order to measure and track the results.
  • When it comes to generative AI, testing means creating new data. Often this data will be kept for further analysis as scientists may want to compare the results over time for coherence or diversity. Additionally, human evaluation can be required in which case not only is it necessary to store the generated content but also the feedback from the people involved in the evaluation.

5. Deployment once trained:

  • For predictive AI, this might not generate data itself, however it is likely scientists will want to monitor it and record how and when the model was used. This monitoring and logging will create a different type of data, which in some cases - especially if AI explainability is required - will be as important as the source data or the model itself.
  • For generative AI, whether all of the content created will be saved depends on a number of factors such as who uses the model and for what purpose. If used in a customer-facing context, many organizations will decide to store all generated content as it may be required later if complaints arise for example. This may result in a lot more data than even the initial source data used to train the model.
  • Newer AI enhancing techniques such as Retrieval-Augmented Generation (RAG) are used to improve the results of generative AI by parsing through additional information or documents not used during the training phase. This may require making that data “AI ready” by pre-computing and storing “vectors” or metadata for all documents that will need to be searched.

6. Circle back:

Building an AI model is not something you do once, but something that you keep developing and improving. The steps above will repeat based on:

  • New source data being created that will require the model to learn from, as it may have different patterns, using techniques such as model fine-tuning.
  • Usage of the AI model – human feedback regarding the results of the model may represent invaluable information to be used to enhance the next iteration of the training.
  • The cyclical nature of AI is also something that will generate auxiliary data since scientists may want to track which version of a model produced which results and maybe even which data was used to train or fine-tune it. Code repositories and artifact stores – common in the world of software development – will be part of the landscape and generate their own data.

Throughout this journey, the initial data is duplicated, amplified, stored in different formats as well as enhanced with metadata. The AI models that are generated will also start creating their own data and usage information. In total, the amount of data, metadata and logging information significantly exceeds the size of the data at the start of the process, and now involves a variety of different formats.

For organizations to ensure they are managing potential data sprawl and keeping sustainability in mind, they must consider their data storage to maximize the impact of AI projects. This needs to include thinking about sustainability and as-a-service models among others – topics that will be covered in part two of this series.

A grey colored placeholder image