Top three use cases for streaming data pipelines - what, how, why
- Summary:
-
Using workarounds to pipe data between systems carries a high price and untrustworthy data. Bharath Chari shares three possible solutions backed up by real use cases to get data streaming pipelines flowing.
Today’s enterprises need to get more out of their existing data infrastructure and enable teams to quickly discover, understand and apply data assets to power real-time use cases. The modern data stack and its resulting mess of point-to-point connections aren’t cutting it.
The tools commonly used to pipe data between operational and analytical systems in the modern data stack — ETL, ELT, reverse ETL — are stop-gap solutions that, combined with a broken, untrustworthy architecture, create a chaotic and unscalable data foundation.
As a quick refresher, there are five key challenges with today’s data integration strategy:
- Batch-based pipelines deliver stale, low-fidelity data
- Centralized data and data team cause bottlenecks
- A patchwork of point-to-point connections result in immature governance and observability
- Infrastructure-heavy data processing adds to the total cost of ownership (TCO) of data pipelines
- Rigid, monolithic design make changing existing pipelines challenging, which leads to increased pipeline sprawl
Streaming data pipelines solve these challenges by breaking down data silos and providing governed real-time data flows — getting data to the right place, in the right format, and making it easy for different teams to produce, share and consume self-service data products.
In this article, we’ll share three main use cases commonly implemented when organizations take this approach to building pipelines.
Use case 1 – Cloud database pipelines
Amid the rapid adoption of cloud computing and massive cloud investments, an ad-hoc approach has resulted in decentralized teams creating distributed islands of data estates, unintentionally locking away the business value of data in silos and self-managed, on-prem databases. This has not translated well in the modern era, when many teams need access to real-time data for building new cloud applications and services.
The solution is to use fully managed connectors to build streaming pipelines from on-prem, hybrid and multicloud data sources to cloud-native databases. Streaming pipelines connect, process, govern and share data across heterogeneous sources so you can curate data in real time for cloud databases such as MongoDB Atlas, Amazon Aurora, Azure Cosmos DB and Google AlloyDB.
To sync data across databases and other systems, Change Data Capture (CDC) is a real-time, cost-efficient alternative to batch data replication. In streaming pipelines, log-based CDC continuously captures inserts, updates and deletes on a source database’s transaction log to be propagated downstream. CDC maintains data integrity, ensures zero data loss or duplicates, and delivers high-fidelity data in real time with minimal impact on performance.
Tech benefits
Cloud database pipelines eliminate data silos by providing every system and application with a consistent, up-to-date and enhanced view of the data at all times. Any data-dependent system can consume enriched data streams in real time as soon as new information becomes available. Developers gain instant access to the right data — in the right format — and can fully leverage cloud databases to build new features and capabilities faster.
Business benefits
Streaming pipelines decrease the time, risk and cost of modernization by allowing enterprises to take an incremental approach to migrating legacy on-prem databases. And using a fully managed, cloud-native data streaming platform and pre-built connectors reduces TCO. Ultimately, there’s faster innovation from the ability to power real-time apps with data streaming at scale.
A real-world example
SecurityScorecard is a global leader in cybersecurity ratings and digital forensics. Its services rely on using accurate real-time data from many sources to discover customers’ security risks.
Using streaming pipelines, SecurityScorecard streams billions of records from all their databases tracking breaches. The company also uses stream processing to enable continuous data analysis and detect security risks in milliseconds. Previously, SecurityScorecard used 80 ports to scan the internet in a month and a half — now, 1400 ports take only a week and a half. Its team was able to rapidly scale, govern data better and lower operational costs.
Use case 2 – Cloud data warehouse pipelines
Organizations often use ETL pipelines to extract, transform and load large volumes of data from operational databases into on-prem data warehouses. But ETL pipeline technology is decades-old, and it assumes that data has one final destination: the data warehouse. Moreover, as cloud data warehouses emerged (e.g., Snowflake, Amazon Redshift, Azure Synapse, Google BigQuery), new ELT and reverse ETL tools were created to move data back into operational systems and applications to serve operational and BI use cases, yet they are also hyper-focused on the idea of a single centralized data warehouse.
The solution is to use fully managed connectors to build streaming pipelines from hybrid or multicloud data sources to cloud data warehouses. Sink connectors can take the same data powering your data warehouse and instantly send it to analytics applications and other downstream systems in any environment.
Cloud data warehouse pipelines also enable you to continuously process data in flight to bring real-time, analytics-ready data to your cloud data warehouse. Then, any data-dependent system can consume the enriched data streams as soon as new information becomes available. With this model, you can govern streaming data to ensure security, compliance and data quality in the cloud while allowing for ease of sharing within and across organizations.
Tech benefits
Streaming pipelines eliminate the operational complexity and high latency of point-to-point data pipelines and batch ETL/ELT processes. With built-in stream governance, engineers can discover, understand and trust streams going to and from their data warehouse. As a result, they can minimize time spent on data integration and transformation and instead focus on building analytics with ready-made other data products.
Business benefits
With self-service data products, team productivity increases — this means faster time to market and a stronger competitive advantage. Businesses can also unlock real-time analytics and advanced AI/ML capabilities to make real-time decisions. And finally, a fully managed platform with pre-built connectors substantially lowers TCO.
A real-world example
Picnic is Europe’s fastest-growing online-only supermarket. To provide a lowest-price guarantee to customers, it must process over 300 million unique data events a week.
Picnic redesigned its data pipeline architecture to address scalability issues with legacy message queues. The company uses connectors to build streaming pipelines from RabbitMQ and to Snowflake and Amazon S3, enabling the data science team to use self-service, real-time data for predictive analytics.
Use Case 3 – Mainframe integration
Mainframes (IBM Z systems) are at the center of government agencies, financial institutions, healthcare organizations, insurance agencies and many other enterprise organizations. However, they weren’t designed to interact with today’s modern cloud-based systems and applications. Mainframe code was written decades ago on COBOL, and developing in and adjacent to that environment is an antiquated, pricey process — which is why so much data has stayed locked away in mainframes for so long.
Streaming pipelines provide a way to modernize the mainframe model, accessing mainframe data to stream it across any environment. With this, you can integrate your mainframe with various hybrid, multicloud data destinations to unlock the value of that data.
Tech benefits
By building streaming pipelines that capture and continuously stream mainframe data from IBM z/OS and Linux, you can power new cloud applications with minimal latency. Fully managed connectors connect mainframes to any data source such as IBM MQ, VSAM, IMS, VSAM and DB2 and cloud destinations in real time.
This allows for an incremental transition from legacy, monolithic systems to an event-driven architecture. Mainframe pipelines also create a forward cache of mainframe data, which helps organizations innovate without disrupting existing workloads and allows engineers to work in a language they’re familiar with, like Java.
Stream processing allows you to continuously join, enrich and aggregate IBM zSystems data in flight, as well as share it in real time with any downstream system or application. Finally, stream governance ensures data quality, security and compliance while enabling engineers to readily discover, understand and trust available data streams.
Business benefits
Bringing real-time access to mainframes allows teams to build new cloud-native applications and accelerate development time with ready-to-use mainframe data. It increases the return on investment (ROI) of your IBM zSystemand, by redirecting requests away from mainframes, significantly reduces MIPS and CHINIT consumption costs. Finally, future-proofing your architecture with streaming pipelines paves an incremental, risk-free path toward mainframe migration and avoids disrupting mission-critical workloads.
A real-world example
Alight Solutions is a leader in technology-enabled health, wealth and human capital management. To quickly bring new competitive digital products to market, Alight needed to integrate data from numerous back-end mainframe systems. This effort hinged on building a mainframe pipeline to reduce the number of sources that applications had to ping to get needed data.
Streaming pipelines successfully reduced their mainframe costs, accelerated delivery of new applications and increased scalability. The engineering teams now have secure, consistent access to consumer data and have also freed up talent to experiment and innovate.
Why streaming data pipelines matter (and why now)
There’s a grave cost associated with doing nothing and maintaining the status quo with batch based pipelines. The current development environment is all about accelerating cloud adoption. Enterprises are doubling down on how to maximize their existing tech stack, marry business and IT requirements, weigh TCO considerations, and gain a competitive edge.
Streaming pipelines liberate data from silos and maximize the usability of data, making it easy to produce, share, consume and trust data assets, while ensuring quality controls and security policies are applied across the data estate. They redefine how systems, applications and teams work together by providing self-service access to well formatted data products so organizations can unlock new use cases.