Main content

KubeCon + CloudNativeCon - seize the moment to develop open data sets as proprietary walls rise, urges Linux chief

Chris Middleton Profile picture for user cmiddleton March 21, 2024
A powerful presentation on open data from the Linux Foundation’s chief had a sting in its tail for its Paris audience.

Image of a brick wall
(Image by Mabel Amber from Pixabay)

With Kubernetes (and KubeCon) turning 10 this year, Linux is like the “old man in the room” at 33, said Jim Zemlin, Executive Director of the Linux Foundation. 

Speaking away from the main stage at KubeCon + CloudNativeCon in Paris, Zemlin hailed it as the biggest open-source event in the world. And its host, the Cloud Native Computing Foundation (CNCF), as second only to the Linux Foundation, of which it is a member, in size.

Backslapping aside, Zemlin’s theme was the role of open source in generative AI, with the two technologies presented almost as conjoined twins or entangled particles at the event so far. In this regard, the community is doing well as an enabler, but should think much bigger and be more ambitious on data, he said. More on that in a moment.

Zemlin started by praising the open-source and cloud-native family:

I think the work that's going on in CNCF around enabling Generative AI, around abstracting a lot of the complexity of the hardware away from AI developers and the people building the models, is just incredible. 

But it might be easier to think about the role of open source more broadly in Generative AI by looking at it from a whole-stack point of view – starting with the CPU and going all the way to the top, to the data, and what is going on in open source at each layer.

This is where challenges begin to emerge, he suggested, picking up a subtext from the event as a whole:

At the CPU layer, we definitely see a lot of concentration around NVIDIA, which is clearly the market leader. And the [NVIDIA] GTC Conference is going on in San Jose. Unfortunately, we were the largest event this week until Jensen [Huang, NVIDIA CEO] decided to do his in the same week – and that is so much bigger, and desertedly [sic] so.

Holy Freudian slip! But Zemlin continued:

They're doing amazing work, enabling the creation of these frontier Large Language Models. But there is some work to do in the open at that layer. And one of the things we're doing at the Foundation is something called the Unified Acceleration Foundation. 

Right now, most of the AI workloads are enabled by CUDA [NVIDIA’s CUDA-X deep learning stack]. And a lot of folks in the industry, like Arm, Qualcomm, Google Cloud, VMware, and others, decided they would get together and create an API abstraction standard above that – CUDA could still take advantage of it – to provide a little more GPU-accelerated computing choice. 

That's a good role for open source. It's something that supports a much broader set of innovation. […] I think that market will balance out over time, right, if you take it up. One more layer to the infrastructure layer, which is where y'all live. But I don't really need to talk about the role of open source there. It's pretty much all open source!

Certainly, the scarcity, expense, and complexity of the GPU layer has been a recurring theme of the event so far, with a number of presentations exploring how to maximize, share, and make more efficient the use of scant computing resources.

With a market cap of $2.2 trillion, FY net income of nearly $30 billion (up from $4.3 billion in the year before), and the gaming sector to service too, NVIDIA is not quite at the fossil-fuel stage of squeezing profits out of scarce product, but murmurings of unrest are there if you listen.

Zemlin continued:

If you then take it up one more layer to the development and tooling layer that you use to build Large Language Models, it’s the same story: pretty much all open source. 

But if we take it one more layer up to the foundation models themselves, and particularly to the development of frontier models, you have a mix of open and closed, with OpenAI being the most advanced frontier foundation model at present.

But open-source foundation models like Mistral and Llama are really nipping at their heels. And with many more to come, I might add, meeting that same level of performance.

Intentionally open, high quality data sets

However, among the other key challenges of the AI age – or, at least, of the hype cycle surrounding the popularity of tools like ChatGPT, Midjourney, and others – are safety, security, privacy, deep fakes, and the growing storm over large-scale copyright infringement. 

A report due for publication next week is thought to reveal that more and more enterprises are alarmed by the regulatory risks of some vendors’ actions, and are pausing their projects. What solutions might the open-source and cloud-native community offer? Zemlin said:

If you look at some of the things around LLMs, around AI safety and security, this is an area where we're seeing some good starts. But there is a real opportunity for open source to do more.

Let's take the immediate risks around things like content provenance and non-consensual sexual imagery [including deep fakes]. Sometimes the answer to problems in tech is more tech, though a lot of people are sceptical about that. But in this case, I think it's true. 

There's an opportunity for the open-source community to build tools that help with tracking content’s provenance, with unlearning [teaching AIs to forget], to catch AI safety problems, and which can work with security around prompt-hacking, and so on. We're already seeing some of these in the Linux Foundation’s AI and big data project around intersectional bias, plus other tools for AI safety.

We're also supporting a standard called C2PA – this is the Coalition for Content Provenance and Authenticity. It's creating an immutable digital watermarking technology that can track data through the generative AI supply chain. And we're seeing these in Sony and Leica cameras at the source of content creation.

Good news. So, what about the very highest level of the stack, the data itself? (Kudos, as ever, to this community for seeing data as the apex). He said:

Even with the largest and best-performing LLMs like Llama 2, Mistral, and so on, the data sets that were used to train those are not open. So, what we do need are open data sets so that the open-source community can build foundation models using open data. 

There are some good things out there: the Allen Institute is doing good work, Common Crawl is doing good work, but there are clouds looming in this area.

The first risk is that the data walls are starting to go up. Companies like Stack Overflow and others are starting to say, ‘Hey, you can't trawl our data unless you pay a licence fee.’ 

So, the question is, will the open community have a data set that is as good as some of these closed-model makers? One that we can all use to better understand this ecosystem, how these models work, and so forth. You have to know the data in order to truly understand how these models function.

So, what’s the answer? He said:

This is where the open community could think even bigger by pooling resources to create intentionally open, high-quality data sets to compete with a lot of the closed data sets that are coming out, in order to provide researchers and the people who create these LLMs with better data. 

The Linux Foundation is putting its toe in the water in this area. We have an effort called the Overture Maps Foundation, started by Amazon, Meta, Microsoft, and TomTom. We raised about $30 million to create a large geospatial mapping data set that is intentionally open, under a licence that the Linux foundation also created: the Community Data Licence Agreement. 

Just like open-source licences, this is a licence specifically for data, where you have both a permissive Apache-style licence and a restrictive GPL-style licence that creates an easy intellectual property regime for many-to-many data sharing.

He added:

So, on the open data front. we're not thinking nearly big enough. And there are opportunities to bring a coalition of the willing together to go out there and develop huge, high-quality data sets that are intentionally open. 

That’s so we can continue the kind of open innovation that we're seeing in generative AI, at every single layer of the stack.

My take

A strong presentation from the Linux Foundation’s chief – and one with a sting in its tail. The subtext is clear: the AI community as a whole may be beginning to pay the price for some vendors’ decision to scrape the pre-2021 internet for training data. The effect of that has been to alarm many data holders, who are beginning to put up walls for the future. Working collaboratively on open data sets will be part of the solution, if the community seizes the day.

A grey colored placeholder image