Google Cloud launches BigLake to ‘unify data lakes and warehouses’

Derek du Preez Profile picture for user ddpreez April 6, 2022
Google Cloud has today made a series of data announcements that aim to simplify organizations’ approach to analytics.

Image of Google Cloud logo
(Image sourced via Google Cloud)

Google Cloud is seeking to address what it calls the ‘data-to-value gap’ in the enterprise, with a series of new products that aim to make data more accessible and easier to work with. 

Central to the new releases this week is the preview of BigLake, a data lake storage engine that aims to unify data lakes and warehouses. Neil Raden has written on diginomica about how organizations have progressed their use of data from data lakes to warehouses and everything in between, which is worth reading for some context. 

However, Google Cloud’s BigLake aims to reduce the burden on buyers that have data stored in multiple locations, removing the need to worry about the underlying storage format or system. The idea being that managing data across disparate silos both increases risk and cost - by needing to duplicate or move data between sources -  but also makes the data harder to work with. 

Gerrit Kazmaier, Google Cloud’s VP and GM of Database, Data Analytics, and Looker, said: 

Today, data exists in many formats, is provided in real-time streams, and stretches across many different data centers and clouds, all over the world. From analytics, to data engineering, to AI/ML, to data-driven applications, the ways in which we leverage and share data continues to expand. Data has moved beyond the analyst and now impacts every employee, every customer, and every partner. With the dramatic growth in the amount and types of data, workloads, and users, we are at a tipping point where traditional data architectures – even when deployed in the cloud – are unable to unlock its full potential. As a result, the data-to-value gap is growing. 

With BigLake, customers gain fine-grained access controls, with an API interface spanning Google Cloud and open file formats like Parquet, along with open-source processing engines like Apache Spark. These capabilities extend a decade’s worth of innovations with BigQuery to data lakes on Google Cloud Storage to enable a flexible and cost-effective open lake house architecture.

Google Cloud’s data journey to date has very much focused on acting as a united platform across multiple data sources, as we saw with its BigQuery Omni announcement in 2020, which allows customers to carry out analytics across their Google Cloud, AWS and Azure environments. We’ve also highlighted how customers, such as Delivery Hero and Home Depot, are using Google Cloud data tools to meet changing business needs. 

Alongside the BigLake announcement, Google Cloud also revealed Spanner change streams. Spanner is Google Cloud’s globally distributed SQL database, and the addition of change streams can be seen as an acknowledgement of the growth in popularity of organizations moving to event-driven architectures. Kazmaier said: 

Another major innovation we’re announcing today is Spanner change streams. Coming soon, this new product will further remove data limits for our customers, allowing them to track changes within their Spanner database in real time in order to unlock new value. Spanner change streams tracks Spanner inserts, updates, and deletes to stream the changes in real time across a customer’s entire Spanner database. 

This ensures customers always have access to the freshest data as they can easily replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, or store changes in Google Cloud Storage (GCS) for compliance. With the addition of change streams, Spanner, which currently processes over 2 billion requests per second at peak with up to 99.999% availability, now gives customers endless possibilities to process their data.

Vertex AI advancements

Google Cloud’s AI products are powered by Vertex AI, its managed platform that includes ML tools needed to build, deploy and scale models. The cloud provider today also made a number of announcements relating to Vertex that include: 

  • Vertex AI Workbench (GA) - Vertex AI Workbench aims to bring data and ML systems into a single interface so that teams have a common toolset across data analytics, data science, and machine learning. Google Cloud customers can access their BigQuery directly from within Vertex AI Workbench. 

  • Vertex AI Model Registry -  Avvailable in preview, Vertex AI Model Registry provides a central repository for discovering, using, and governing machine learning models, including those in BigQuery ML. Vertex AI Model Registry aims to make it easier for data scientists to share models and for application developers to consume, with the hope that they can turn data into real-time prediction and decisions, and to generally be more agile in the face of shifting market dynamics.

Commenting on the announcements, Kazmaier said: 

With native integrations across BigQuery, Serverless Spark, and Dataproc, Vertex AI Workbench enables teams to build, train and deploy ML models 5X faster than traditional notebooks. In fact, a global retailer was able to drive millions of dollars in incremental sales and deliver 15% faster speed to market with Vertex AI Workbench.

With Vertex AI, customers have the ability to regularly update their models. But managing the sheer number of artifacts involved can quickly get out of hand. To make it easier to manage the overhead of model maintenance, we are announcing new MLOps capabilities with Vertex AI Model Registry. Now in preview, Vertex AI Model Registry provides a central repository for discovering, using, and governing machine learning models, including those in BigQuery ML. 

This makes it easy for data scientists to share models and application developers to use them, ultimately enabling teams to turn data into real-time decisions, and be more agile in the face of shifting market dynamics.

My take

The major cloud providers are seeking to bring data infrastructure and tooling into their platforms, as vendors such as Confluent and DataStax continue to gain popularity with buyers that are looking for better ways to manage their data. Ease of use, access and manageability are all front of mind for buyers that are faced with greater complexity and a need to understand data in real-time. Google’s Data Cloud Summit is running this week and we will be tuning into the customer sessions to find out more. 

A grey colored placeholder image