How Frontiers Media is empowering scientists with de-centralized data products
- Summary:
-
Academic publishing specialist Frontiers Media is using Confluent technology to ensure its customers have access to the data they need in a flexible and effective manner.
Frontiers Media has adopted a de-centralized approach to data processing that helps scientific researchers collaborate and innovate, using Confluent technology.
Luca Dalla Valle, Technical Lead at Frontiers Media, which is the sixth largest research publisher in the world, and his team build data-processing solutions for the business. After a period of investigation, they decided that a de-centralized approach based on data mesh principles is the best approach for data product development:
By promoting this organizational shift to de-centralization, we believe that we are going to improve our company's data ecosystem and support future business growth. Our journey is far from over. The road is steep, but we believe we have the right foundations in place, so we feel prepared.
Frontiers’ mission is to help scientists collaborate more effectively and to innovate faster. As part of this effort, the organization wants to provide scientific researchers the best open-science platform, using the latest technologies. On the adoption of AI, Dalla Valle says:
As an example, we make extensive use of our AI assistant in our products, which gives them seemingly magical properties. Training AI requires lots and lots of datasets.
Outlining the problem
Frontiers’ data collection process was originally based on a data lake, which provided a centralized aggregation for all the data in the company’s applications. The company’s customers require fresh, high-quality data. However, Frontiers’ applications were mostly using relational databases and a normalized structure, according to Dalla Valle:
We thought we could leverage this centralized architecture and put a set of Kafka clusters on top of it. We started streaming changes and events to Kafka straightaway. But those event streams were normalized because they were replicating the database structure of the source data lake, which in turn was replicating the relational database structure of the source applications.
The end result was that data streams produced “little value” for customers. Anyone looking to use data would have to undertake complex aggregations to extract meaning from the information. His team knew they required a different approach:
We thought, ‘Why don't why don't we do it ourselves? Why don't we provide this service to our consumers? Why don't we make life easier for our present and future users? Why don’t we implement de-centralized logic in a single place, so everyone doesn’t have to do their own data-processing work?
The company recognized it would need stream-processing infrastructure, electing to use ksqlDB database technology. However, while this technology provided an answer to the team’s challenges, it didn’t provide an effective long-term solution Dalle Vale recalls:
The more we started implementing this architecture, and the more we started expanding our use cases, the more we also became aware of drawbacks.
Overcoming data challenges
There are a number of key problems with this previous stream-processing infrastructure, according to Dalla Valle. Ownership for the data solution was split between teams, which made it difficult to solve any technical issues. While his team introduced workarounds, these spot solutions impacted data quality. To make matters worse, the team lacked deep domain knowledge on the data they were treating:
The bottom line is that it started to become very prohibitive for us to implement all these solutions. And we started to become a bottleneck. So, we sat down and decided something needed to change. And all of a sudden, we had a realization.
Dalla Valle and his team remembered the data mesh concept and saw the potential to switch from a centralized to de-centralized architecture. This switch meant transferring the ownership of data event-streaming solutions to the source application teams:
Pushing the responsibility to these teams meant they could leverage their deep domain knowledge on data to implement higher quality solutions and minimize communication overheads.
To make this transition, Frontiers Media needed to translate these principles to a new way of working at a practical level. It decided to adopt the outbox table pattern to push events reliably, using Confluent’s Change Data Capture (CDC) Connector to move those output messages into Apache Kafka:
The big reason we chose this approach was that it had a lower barrier of entry for the source application teams because they’re using tools that they’re already familiar with. By adopting this approach, we increased flexibility because we weren't forced to go fully normalized. We can now create fully de-normalized, or even partially de-normalized, events that are based on the use case.
Delivering business benefits
Dalla Valle says Frontiers’ journey towards Confluent-enabled de-centralization has had a profound impact on his team:
We were originally specialized in building stream-processing solutions. Now, we are a team of enablers for other teams. We sit down with the various source application teams and understand their pain points in developing stream-processing solutions and develop shared tooling to solve those pain points.
This shared tooling has been supported by the creation of generalized company practices for key areas, such as governance policies and general configuration. Dalla Valle’s team is also working to transfer knowledge across the organization:
We are more involved in sharing the knowledge that we learn when we’re building stream-processing applications, and we organize training workshops and developer documentation to facilitate this knowledge transfer.
Dalla Valle recognizes the move towards de-centralization comes with a unique set of challenges and the journey is far from over. He says data visibility is one key challenge and his team is working to help the business make better use of its information assets:
This de-centralized architecture means data sets that used to be centralized in a single place are now widespread across different teams in the organization. So, me and my team are introducing what we call a search engine for datasets, or as it's commonly referred to, a data catalogue.
Yet significant progress has already been made and Dalla Valle says the long-term aim is to give internal users the tools they need to exploit data effectively:
We are confident that by pursuing these initiatives that we will drive the adoption of the data mesh even further in Frontiers.