Main content

How NASA is using AI to search its own data universe

Chris Middleton Profile picture for user cmiddleton August 30, 2023
A case study on how to apply AI sensibly to trusted data, opening new pathways to the stars.


NASA is using AI to unlock the mysteries of the universe, said a communication that pinged into my inbox like a message from a distant star. In fact, it came from intelligent search platform, Sinequa – named from the Latin expression ‘sine qua non’ (without which, not) which has come to mean ‘without x, something else is impossible’.

‘Without intelligence, search is impossible’ would appear to be encoded in the company’s name, therefore. But what about the message concerning NASA?

The US space agency is doubtless using AI to explore the universe from its bases on Earth – and, of course, via its telescopes, satellites, space stations, probes, and landers, not to mention the rovers trundling across Mars. But as far as Sinequa is concerned, NASA is really using AI to unlock the mysteries in its own existing data.

With seven main operating centers, nine research facilities, and – not including private and government contractors – 18,000 staff (200,000 fewer than helped Armstrong and Aldrin land on the moon in 1969), NASA is a big operation. And that means Big Data (remember that?), spread across reams of different repositories, websites, and archives.

So, NASA’s Science Discovery Engine (SDE), built on Sinequa’s AI-infused intelligent search platform, is helping researchers discover the mass of data in its own bounded universe – and in others too. In total, some 84,000 datasets and over 715,000 documents, located across 128 different sources, including websites and archives. Zetabytes of data, no doubt: after all, space is the biggest thing there is.

According to Sinequa, the SDE has also been integrated with over 44,500 scientific applications, models, and tools, and is capable of understanding over 8,900 different scientific terms – and counting.

Its five core areas are: astrophysics, biological and physical sciences, Earth science, heliophysics (the study of the sun), and planetary science. Impressive stuff, and clearly research data that is more than just rocket science.

Ulf Zetterberg is the company’s co-CEO (with Chairman Alexandre Bilger). He explains that NASA has designated 2023 ‘A Year of Open Science’ to inspire scientists, researchers, and other interested parties to see the benefits of open-source technology, and of collaborative, open-science practices. He says:

The SDE makes NASA’s open-science data, software, and information more discoverable and accessible to a wider audience. This openness promotes interdisciplinary science, encourages innovation, and fosters collaboration within and across disciplines. This is a major, ongoing initiative for NASA. And it's an initiative of the Biden administration to support this open-source science infrastructure to make our data more available and easier to use.

So, the SDE is using intelligence to pull together disparate data into one place, so NASA can start connecting its own dots, as it were? Zetterberg says: 

Yes, that’s the foundation. But it has also allowed research outside of NASA’s domain, because ‘they don't know what they don't know’, so to speak. So, it's a way to make all scientific data easily accessible and understandable. We have also tried to structure it in a way that makes it easier and more intuitive for non-domain experts to trawl through and make sense of it.

Of course, this was the impetus behind Sir Tim Berners-Lee conception of the World Wide Web at CERN: to make academic data easier to find and share. But the big bang that caused created an ever-expanding universe of trillions of data points.

So, for years traditional search has been like pointing a tiny, handheld telescope at the cosmos and trying to see everything from it. The result has been that trustworthy information has got further and further away from us, lost in the background radiation of hype, relentless advertising, and companies’ attempts to game search algorithms.

Zetterberg continues:

The plan is even to promote the SDE to schools. So ultimately, it can be used by anyone from middle schools right the way up to the most advanced researchers. It’s a way consolidate everything and make it accessible, but also make it easier to consume.

Which is where AI, machine learning, and natural language processing comes in. But rather than unleash, say, a tool like ChatGPT on a mass of undifferentiated data, complete with hallucination-inducing blind spots, the SDE is solely about helping researchers uncover what may be hidden in trusted, peer-reviewed sources.

So, like a number of other companies in the emerging field of enterprise AI, Sinequa’s work is about unlocking hidden truths in bounded, verified data, rather than an AI system that pretends to know everything about everything. Zetterberg explains:

Many search engines that are for academic research force you to be extremely specific in your queries, which means you have to know what you're looking for in the first place. So, this is a way to make it easier and more intuitive. NLP allows the system to interpret both the question as well as the data. As in, ‘This is probably what you are looking for’.


In short, this is not generative AI. The SDE is not a system that generates answers based on masses of human data scraped off the Web. It is more about understanding the question and finding data pulled directly from scientific observation. That said, Zetterberg acknowledges that, at some point in the future, generative AI could be used to summarize that bounded data.

He adds:

The problem for NASA back in 2017-18 when they announced this open science initiative was, first, untying all the knots internally to make all the different resources from scientists inside NASA more accessible. But today, it’s very much an ongoing, expanding thing. They want to talk to other agencies about how to capture data. And other departments are custodians of different types of scientific data. It’s very much part of a larger initiative in the scientific community: to democratize data, to make it more easily accessible for scientists. And to be able to go outside your own domain and learn.

So, the initiative is about finding answers that NASA already knows, which may be buried in some obscure archive (‘we don’t know what we do know’, perhaps). But by pooling all this data, might the result be new answers? Findings that come from enabling new connections between different data types? Zetterberg says:

Absolutely. Think about our times. Everything about climate change, for example. There is lots of climate change research that can take giant leaps forward. Data is empowering, and knowledge is empowering. But before this, it was so time-consuming. So, if you can look at open science from a bigger perspective, you can tie into other agencies and other institutions. And it will empower and speed up research. And hopefully, on climate change, it will create trust: explain that this is actual data that we have captured over the last 20 years.

But is Zetterberg worried about the hype-driven rush to adopt – and trust – the likes of ChatGPT? Is he concerned that people playing with free, cloud-based tools within organizations (rather than the strategic use of enterprise AI on trusted data) is creating an equal but opposite force in research? The lazy use of Web-scraped data, whipped into a confection by generative toys?

Generative AI is here to stay. But there is a growing sobriety, I think, in large organizations. A realisation that the datasets [that ChatGPT and others have been trained on] could be weak. Private consumption is great, but in the corporate world, institutions are already much more careful. They see the power in it, but they're not jumping straight in without fully understanding the consequences. Because for most organizations today, data is their business. They don’t want to just hand it over to ChatGPT for free, right?

My take

A great initiative. After all, small steps and giant leaps have long been what NASA is about. And it’s a good model for AI adoption – certainly much better than giant steps forward followed by giant leaps backwards.

A grey colored placeholder image