Main content

The data journey for AI - sustainability, as-a-service and supporting data scientists

Fred Lherault Profile picture for user Fred Lherault April 25, 2024
In the first of this two-part article, Fred Lherault of Pure Storage looked at the six stages involved in finding and processing data as it’s used in AI models and training. Part two looks at how the process creates challenges and how organizations can approach and mitigate against these.

Data Stream Traffic © kentoh -
(© kentoh -

The multi-stage process involved in getting data ready and useable for AI creates three challenges:

  • The sheer amount of data created and the environmental impact of storage.
  • The large array of tools required to handle this process end-to-end.
  • The complexities of dealing with constantly changing requirements.

Dealing with large volume of data and the impact on sustainability

Not only are data and storage requirements growing, the complexity of handling this can increase, as well as the environmental impact. However, in selecting infrastructure which reduces energy consumption and is designed to better support the needs of AI, organizations can overcome these challenges.

It is important to remember there is no such thing as cold data anymore. At best we’re talking about “warm” data that needs to be made available quickly and on-demand for data scientists. Flash storage is the only solution which can deliver this level of availability for the unstructured data that AI requires to be successful. This is because linking AI models with data requires a storage solution that provides reliable and easy access to data across silos and applications at all times – this is often not possible with an HDD storage solution.

As more organizations are signing up to science-based sustainability targets, they need to think about the environmental cost of storage. Data center operators are implementing more power efficient technology to deal with storage-hungry AI. Offloading this problem to someone else (such as a public cloud provider) will not make it go away. As many will soon be required to report on scope 3 emissions which includes the upstream and downstream environmental cost. Working with a vendor which can reduce the space, power and cooling requirements of storage is a vital way to mitigate against the challenge of storing growing data volumes that result from AI.

Tools to support data scientists

With data scientists spending so much time pre-processing and exploring data, they need the tools, resources and platforms to conduct this work efficiently, as and when they need it. Python and Jupyter Notebooks have become the day-to-day language and tools for data scientists and the data ingestion, processing and visualization tools all have one thing in common – they can be deployed as a container. The ideal platform for data scientists to do all they need is therefore one that will support all these tools, enable them to deploy and run containers quickly and easily and most importantly in a self-service manner.

With 451 Research stating that 95% of new apps are written in containers, it’s become even more vital for data scientists to have fast and easy access. Not enabling this will have a detrimental impact on an organizations overall growth, digital transformation, customer services, innovation - every area of a business is touched if data scientists aren’t properly supported.

Leading AI organizations are now building “Data science-as-a-Service” platforms, leveraging a lot of the tools mentioned above built on software infrastructure such as Kubernetes. To be successful though, these platforms need to provide not only the data frameworks and tools as-a-Service but also the data itself, otherwise it negates the benefit of self-service. Data platforms tightly integrated with Kubernetes and allowing easy sharing, copying, checkpointing and rollback of the data itself are key to success in this area.

Adding the flexibility of as-a-Service consumption

A key concern that IT organizations have about AI is the speed at which the market evolves, which far exceeds the average investment cycle of enterprise organizations. New AI models, frameworks, tools and methods emerge on a regular basis and their adoption can have a massive impact on the underlying software and hardware platforms used for AI, leading to unplanned costs if changes are required in the underlying technology.

As-a-Service consumption models should be considered as an effective tool to increase the flexibility of the AI platform. They will also enable the people building it to easily incorporate new solutions or change their infrastructure as required by the constantly evolving needs of data scientists. Essentially supporting all six of the steps mentioned in the first article.

Additionally, as-a-Service models enable organizations to meet their sustainability goals by better controlling energy costs through lower power consumption and by only using the resources that are needed at that time. Some Storage as-a-Service offerings are also backed by SLAs to pay for electricity use, and they support sustainability goals by eliminating rip and replace tech refresh cycles and the e-waste they generate.

Solutions to handle AI data challenges

The data journey for AI is one of data amplification. At each stage of the AI journey – data and metadata is created and added to. This will require more and more infrastructure to support the development of future AI. Data Science as-a-Service is what data scientists want to handle the demands of AI. It means tools as well as data, provided on-demand and through automation. Achieving it requires the right software and hardware infrastructure, combined with the right consumption model in order to make it a success and take an organization from data ingestion to innovation.

A grey colored placeholder image