Main content

KubeCon + CloudNativeCon - where next for AI and cloud native?

Chris Middleton Profile picture for user cmiddleton March 22, 2024
Summary:
How can the cloud-native community make AI work for their internal and external clients? An expert panel shared some insights – and the challenges ahead.

AI and connected business tech © Funtap via Canva.com
(© Funtap via Canva.com)

While not the only topic of conversation, Artificial Intelligence was the spectre at the feast in Paris this week – or perhaps the enormous panini in the free food bags. AI was certainly the lurking subtext for numerous panels and keynotes as 12,000 delegates gathered in the vast Pavilion 7 at the Expo Porte de Versailles.

Springtime for AI, then, as balmy weather put smiles on faces and – just occasionally – pulled visitors outside, blinking at the weird golden orb in the sky. (Was it NVIDIA’s market cap, or Sam Altman’s disembodied but totally trustworthy brain? No, thousands of coders in black vendor t-shirts, it’s called the Sun! So, run – run for the shadows!)

But I digress. So, what are the implications of that huge uptick in business for workers at the sharp end: the developer community?

That was the question for a panel featuring: Annie Talvasto, CMO at DevOps provider VSHN; Lachie Evenson, Microsoft Principal Program Manager, and governing board member of event organizers the Cloud Native Computing Foundation (CNCF); Sudha Raghavan, Oracle SVP of the OCI Developer Platform; Arun Gupta, Intel Vice President and General Manager for Open Ecosystem, and Chair of the CNCF governing board; and Bill Ren (Ren Xudong), Huawei General Manager of Open-Source ICT Infrastructure, and another CNCF board member.

In her opening remarks, Oracle’s Raghavan spoke to another subtext of this year’s event:

It's not all about GPUs! You can run AI workloads, the right ones, on CPUs too [cue a raised fist of triumph from Intel’s Gupta]. Indeed, the demand is so high for GPUs, and the innovation is behind that demand, then if we can convert some of that innovation to run on things that we actually have quite a bit of, things might just go a little bit faster!

Indeed. Gupta then spoke to the other subtext:

Don’t look at Gen-AI I as the only hammer, and try to use it to hit every nail. 

The way I think about Gen-AI is, it’s cool, but when we go into a CIO discussion, it’s ‘How can I use Gen-AI?’ And I’m like, ‘I don’t know. What do you want to do with it?’ And the answer is, ‘I don’t know, you figure it out!’

Think of it not as a solution looking for a problem, but what are the problems you are actually trying to solve? And how can Gen-AI help you – as a part of that?

Round of applause for Gupta. Yet the prevalence of exactly these conversations in every boardroom in the land is staggering. No wonder eight of the top 10 most valuable companies on Earth are technology providers! Their customers are desperate to be sold something to drive up their own share prices.

VSHN’s Talvasto offered a fresh perspective on the hype surrounding generative tools and Large Language Models (LLMs):

As technology professionals, we don't have to think about how AI can help us, but about how we can help AI, then adding those building blocks. So, instead of thinking only about how AI and machine learning can improve cloud-native work, we can think about how we contribute to making AI and ML work better.

Wise words – and, hopefully, in support of real business aims. If not, the trough of despond that delegates discussed in my previous report may open up, rather than the instant productivity boost that many CIOs imagine will appear shortly after talking up their share prices. (“Just say ‘AI’ and everything will be fine!”)

Huawei’s Ren put it most succinctly:

Use this Generative AI as a new engine for further development.

But to get to that point, how do cloud-native platforms need to evolve to better accommodate Gen-AI adoption? Gupta said:

Google’s Bob Killen said something like, ‘If inference is the new Web app, then Kubernetes is the new Web server’.

Of course, Kubernetes is the de facto compute platform. But I think the disconnect that we need to learn, and embrace, is that the ML ‘cycle’ – how you do data collection, how you do data cleaning, how you do feature store, and how you do testing, and validation – how does the cloud-native concept map to that? 

For example, would each of those be a microservice? And does that add an extra overhead in terms of doing the handoff between them? That is something that cloud-native needs to embrace much better, so that we could be more a first-class citizen.

We are, of course, providing the infrastructure to get it going. But can we do more than that?

Capturing specific demands

Once again, the question of fulfilling real business needs raises its head again – and of developers and engineers thinking more like business assets.

Talvasto added:

Cloud-native is a really good starting-point platform for developing AI and ML applications and models. At the same time, Kubernetes and cloud-native have developed for a wide range of use cases. But now, with the raise of AI and ML, that is becoming a much bigger use case. 

So, building and making sure that cloud-native actually responds to those specific needs and scenarios that are unique to AI and ML workloads. There’s a lot that needs to be done.

She also made a simple but important point about cloud-native tools and applications:

Cloud-native was built for CPUs rather than GPUs, so we do have to start considering how we actually meet those specific scenarios and requirements.

And it goes all the way across everything, given the data-centric nature of ML and AI, because ML and AI development are different from traditional engineering. So, the cloud-native ecosystem and community must learn how to adapt to those changes. We need to really stand up to that challenge as well.

Oracle’s Raghavan added:

It’s about how to make Kubernetes become more the default for all AI and ML workloads, and not having the data scientists and data engineers have to think about how to configure it, and how to use whatever hardware, the GPUs, most efficiently.

And being very cost effective about it, all the scheduling solutions, and not making it so difficult on a per-workload basis to configure. Out-of-the-box templates, understanding that these AI workloads will have predefined templates. So any data scientist that wants to do an experiment doesn't have to go through the journey of learning it on their own.

Huawei’s Ren observed:

For Generative AI or LLM applications, we need to capture their specific demands. Maybe they need extremely high bandwidth on the backbone network, and also extremely high resource usage and efficiency. And to treat the GPU as heterogeneous compute power equally and agnostically. We also need to really transform our community to capture those new ones.

At this point, Microsoft’s Evenson struck a pragmatic note, saying:

What was said on the conference stage yesterday is Kubernetes is powering the largest AI workloads today. Mistral said it, the folks at Ollama said it. But I want to dig into that a little bit.

Kubernetes is the platform that's out there that's ubiquitous. It's incredibly flexible. And we've been able to make it understand how to run this new workload, but it wasn't designed to run it when it was built. So, that speaks to not only this community, but also to the extensibility that we built into that platform through this community.

But in saying that, it was not easy. You need to be one of those big companies. We need to make it easier to run for everybody. We need to provide open-source alternatives and open-source platforms, so that companies that are looking to start investing and understanding how Gen-AI can impact their business can take models and not worry about data, governance, or security concerns. And start to tinker with them on their local environments and get familiar with them. And from there start to build solutions.

He added:

Each innovation cycle is much more rapid than the previous one, because we leverage the tools and platforms we built in the last generation. But we still have a lot of work to do to make the hardware, software, scalability, schedule ability, and fault tolerance much better. That will come with time. 

But I do want to be clear. I feel like we are talking about this as some future state, but people are running it today, in production, in large quantities.

My take

A perfect note on which to end diginomica’s in-depth coverage from KubeCon + CloudNativeCon in Paris. Et maintenant pour moi: l'énorme baguette, puis l'Eurostar. N’est ce pas?

Loading
A grey colored placeholder image