Elsevier sees promise in small language models and graph data

George Lawton Profile picture for user George Lawton March 24, 2023
Summary:
Taking a more academic look at the Large Language Models hype...

ChatGPT

While much of the tech press is driving hype around large language models like ChatGPT, the academic press is making a more measured bet on small language models and graph connected data. Academic researchers are less forgiving of issues like hallucinations and misinformation creeping into their narratives than consumers. 

At least, that’s the contention of Erik M. Schwartz, VP of Product Management for Knowledge Discovery at Elsevier. His team has been leading efforts to transform tens of millions of scientific papers and institutional data into a linked semantic web on top of a Neo4J graph database. For the moment, they are leaving it up to academics to write their own summaries, which is vital since publishers are starting to prohibit input from the likes of ChatGPT. 

This kind of measured approach allows them to grow their revenues while respecting the cornerstone of trust that has allowed Elsevier to lead the nearly $30 billion academic publishing industry. It is also helping Elsevier transform from a publisher of academic material to a data and analytics company. Schwartz explains:  

Knowledge discovery and connected graphs help us accelerate the dissemination of information. They also help us solve new use cases. It’s about getting the information out there that is easier for end users to engage with, helping them find connected information, and helping to help build new experiences.

Every year the service handles about three-quarters of a trillion queries. Increasingly these involve structured queries that surface over four billion relationships between entities and metadata associated with papers. Schwartz said they decided to build it on Neo4J tech because it was the best solution for handling their structured query problem at scale: 

It wasn't just a one-time solution that solves structured queries, it was a way for us to move into the future and solve new use cases that we hadn't anticipated.

What’s old is new again

The idea of linked data goes back to the early days of the web when Tim Berners-Lee laid out a vision for read-write browsers built on powerful NeXT Workstations. However, the first Mosaic browser that ushered in the popularity of the web was only optimized for consuming content from static web pages. Berners-Lee revisited the idea in the mid-2000s with the notion of a semantic web, which led to work on linked data specifications. Berners-Lee called this Web 3.0, which he once told me is utterly different from the recent Bitcoin-fueled Web3 hype.

The tech got some traction among web search engines for linking limited content that could be pulled into information cards. Query a movie on Google, and you can automatically get a list of the cast, ratings on IMDB, nearby theaters playing, and showtimes. Linked data has done wonders for engagement when applied to specific types of data within a web page. Google claims Rotten Tomatoes increased click-through by 25%, the Food Network increased visits by 35%, and Nestlé increased click-through by 82% after adding structured data to its site. A similar approach to automatically structuring data could drive additional monetization opportunities. 

The problem has been that organizing linked data for the cards took a lot of work. This either required a new process for the content producers or a lot of manual effort to organize the linked data after the fact. This is where a new generation of knowledge discovery tools and graph databases could automate this process, according to Schwartz:

Making this data available via search engines via a structured search or recommendation system helps our internal teams drive new monetizable innovations. So how do we come up with new products and services that we can offer to the market? How do we create new metrics? How do we create new indicators that are valuable to both researchers as well as research leaders that need to measure their performance and compare themselves to others?

The new search stack

Schwartz is enhancing the underlying platform for three fundamental types of knowledge discovery:

  • Search engine supports lexical search using traditional keywords and vector search. The latter is an AI base search built around small language models optimized for understanding scientific, technical content. This can help it disambiguate queries for stress fractures from emotional or physical material stress. 
  • Structured search is where the graph database comes into play. These kinds of searches start with an identifier like an academic institution, a researcher, or a particular line of research. It helps searchers tease apart what institutions are driving research and what organizations are funding it and identifies potential conflicts of interest. 
  • Recommenders help personalize content recommendations for researchers or institutional researchers. 

Elsevier is exploring how improving these types of information can enhance the value it can bring to researchers and institutions above and beyond simply selling subscriptions. For researchers, they have developed a service called Scopus that helps them perform knowledge queries across 90 million articles published and indexed on Science Direct. On the administrative side, SciVal allows managers to zero in on how an academic institution compares to peers regarding papers, citations, impact, funding, patents, Twitter mentions or other metrics. They can also measure it globally or in specific scientific or technical domains. 

Schwartz’s team has been automating a new step in the publishing process that captures, matches, enriches, links and joins data into the graph database as a last step in the publishing process. This creates a new runtime layer on current knowledge for users to search and explore. Users can create new experiences on top of this existing linked data. They can also solve new problems, like identifying potential conflicts of interest in finding peer reviewers. 

Elsevier is also launching a new author profile service optimized for search engines for building out linked data cards for summarizing information about researchers. They are constantly looking for new ways that these new knowledge discovery capabilities can enhance the user experience, drive brand loyalty, or drive new revenue-generating opportunities. Schwartz explains: 

We as a business are really trying to figure out what those new offerings are going to be. If you think about where Elsevier sits, not only are we a publisher, but we help the researcher and their entire lifecycle with everything from when they want to discover what’s going on in the world to staying up to date in their particular field, to publishing and submitting their work to then assessing themselves, and understanding how they place in the world.

Some of the places that we’re applying these techniques today are helping us get a better understanding of experts in the field. So an example of an expert is when I’m in the submission process, and I’m looking for someone to peer review my paper, who are the best candidates of people that are out there in the world that can that are available to peer review my paper and give valid feedback about the science that I’ve claimed. 

We’re thinking about how we can apply these tools to that discovery process to modernize that experience and accelerate researcher’s engagement.

Maintaining the gold standard

It’s also essential to keep the information itself pristine. Peer-reviewed journals are still the gold standard in academic research, and Reed-Elsevier needs to approach new AI tools cautiously to maintain the trust it has built up over the years. 

Comparing their approach to the latest crop of content generation tools powered by large language models, Schwartz says:  

We imagine those tools can help to do summarization. They can help to do the writing. They can provide reviews. But we haven’t incorporated any of those into our existing workflows yet, because it’s critical that we provide the utmost clarity and transparency around the work that we do and around the publishing content so that we’re not creating assertions, for example, that are in contradiction to what the author was trying to publish.

We take a very conservative approach to adding those new tools. There are a lot of tools that sit outside in the ecosystem that people are using to do content generation. However, we have not incorporated those tools yet. I think we see a future where we dramatically help improve the processing time, but we need to be very careful that those tools are providing value, and they’re accurate, and they’re repeatable so that we can produce reliable results.

Neo4J itself is banking on a growing role for graph databases to power an explosion in new use cases for the next generation of AI. Neo4j CEO Emil Eifrem explains: 

The next stage is graph data and machine learning were you use the graph as a part of your machine learning pipeline to build machine learning systems. 

If the world is becoming more and more connected, and data is describing the world, then data is becoming increasingly connected, which places this huge amount of tension against data systems that are not built to be able to operate on relationships. And this is what is leading to this explosion of use cases.

My take

A rush to putting generative AI behind everything has the potential to create new risks and liabilities for existing businesses. At the same time, early work on the semantic web was hampered by a complicated and inefficient process that essentially limited their use to the search cards you pull up on Google. 

A more measured approach could use these new AI technologies to structure enterprise data into curated semantic webs. This will improve search and analysis while providing a trail of breadcrumbs back to the source to ensure transparency and explainability. 

Graph data promises a more structured approach to organizing raw data streams. Traditionally these have required an additional step after operational data has been collected into high performance databases. The tools for ingesting and querying data are getting faster which should ease the onramps and applications. 

Loading
A grey colored placeholder image