Enterprises that operate across multiple countries and regions face a growing quandary. They want to standardize their processes and data so that they can analyze performance and trends across all their operations and markets — and apply AI across the full dataset. But privacy regulations vary by region and country, which means that Personally Identifiable Information (PII) has to be protected and stored according to different rules for each jurisdiction. The quandary is that effective analysis demands keeping all the data in one place, but privacy and data residency regulations require that sensitive PII data be kept apart.
Privacy engineering is the umbrella term for the use of technologies to help organizations solve this quandary, but it's been a struggle to find an effective means of meeting these contradictory requirements. In most cases, PII means customer data or employee data, and it's held in or passed between many different applications, from CRM, e-commerce, marketing and analytics, to recruitment, payroll, HCM, and so on — often with separate instances serving individual countries or regions.
Wrapping privacy and security tools around so many different data stores and processes is a complex and costly task. It has to balance the requirement to safeguard PII with the need to make data available for action and analysis by various interested parties across the organization and beyond. The penalties for getting this wrong can run to multi-million-dollar fines imposed by regulators for privacy breaches or compliance failures.
Solving for data residency
A big part of the challenge is the sheer complexity of safeguarding PII across so many different data stores and applications, multiplied across separate instances to comply with local data residency and privacy regulations. The root cause of this complexity is that the PII is dispersed across the architecture and has to be protected at every step. What if the PII data were abstracted from the architecture from the outset, and only shared with applications and analytics, in an anonymized, masked or encrypted form, as and when needed? This is the approach taken by Skyflow, a pioneer of an emerging concept in privacy engineering called a data privacy vault. Its CEO, Anshu Sharma, explains:
The elegant solution is, what if your own people couldn't see the data? There's no reason when I'm registering for TikTok, or for a pharmacy, or for an airline, why any employees of that company needs to see my phone number. There is nothing in my phone number that's useful to them other than to call me or text me, which is done by Twilio, or in Europe's case, Ericsson's Vonage.
In payment, same thing, my credit card number is useless to you as a company, unless you're calling Checkout.com, or Stripe, in which case, send them the credit card number, but you don't need to see it. Same with analytics, [in our] Snowflake partnership and other similar ones. If you could de-identify the data, you can still do all the machine learning and analytics. In fact, you can do more, because now you don't have to worry about the consequences.
Its most recent offering extends this principle to training generative AI models, ensuring that sensitive data is not exposed when building a custom model using a public service such as OpenAI's GPT.
Just two years after launching its privacy vault, Skyflow has already built up an impressive customer list, ranging from established brands including IBM and Lenovo, to fast movers such as loyalty card program operator BambuMeta and healthcare startup Nomi Health. The customer base circles the globe, from North America and Europe to India, Japan and South Africa, reflecting the appeal of its product for global companies grappling with the challenges of data privacy and local data residency laws. Instead of having to replicate their Customer Data Platform (CDP) or or other system of record in each separate regulatory area, they can store just the PII in a local data privacy vault, which serves the data each application needs via an API call. Sharma comments:
I just assumed people had solved data residency, in somehow magical ways. It turns out, the best that the market has to offer today is, get one more instance in one more country. But at the end of the day, if you're launching a product, are you going to do it on six CDPs in six different countries? As more countries add these laws, we're going to end up with 20 of them. And then what does a marketing manager do who wants to find out, 'Do millennials buy my product more or less?' Do they run a report in each of these 17 Snowflake instances, and then generate a super report? The whole thing is a little bit of a mess.
A key technology in Skyflow's offering is its use of polymorphic encryption. Instead of simply encrypting the data in a single pass when it goes into the privacy vault, Skyflow encrypts each field separately. For example, a phone number typically consists of a country code, an area code and a local number. By encrypting each of these components separately, it becomes possible, for example, to find all records with the same area code without having to decrypt the data. Using another encryption algorithm on the income field, an analytics application could then calculate the average income for that area, again without decrypting any identifiable data. Another application might use a redaction algorithm to fetch the last four numbers of the local number for identity verification purposes. All of these versions of the data are created from the outset to optimize query performance, and different forms of the data can be called via the API depending on the access policies that govern the application.
For customers — who are spread across consumer brands, financial services, digital healthcare and ISVs who build the privacy vault into their own products — the goal is to create a clean environment where PII can be protected and prepared for zero-trust access by other applications, exposing only what's needed to complete the task. Sharma elaborates:
Imagine you've got a company with 30,000 or 100,000 employees. You can't quite have your IoT data, or your application data from your products or whatever, just end up in your main data lakes, willy-nilly. You ideally want to have a smaller environment — you can call it a clean cloud or a clean data-sharing environment. You make sure that before the data gets into your other destinations, you de-identify the data, you detect that you've appropriately polymorphically encrypted it, it generates the right tokens for you with the right shapes.
And you control and manage sharing. If you are a conglomerate in pharmaceuticals, you may decide that your team that does marketing can have access to demographic information. Whereas the team that is meant to report to the FDA adverse reactions to your vaccine has detailed information access, but it's controlled to fewer people ... This is not departmental sharing, which is what you would end up doing here with the Snowflakes and Databricks of the world. This is controlled sharing of what fields who can see where.
Packaging this up as a resource that developers can access via an API massively reduces time to deployment. Customers who have contemplated building a similar capability for themselves speak about reducing timescales from years to months, or from months to days and hours. Boe Hartman, co-founder and CTO of Nomi Health and a former CTO at Goldman Sachs, says this of Skyflow:
We were up and running in hours, rather than the months it would take to build and implement even a fraction of this.
All this has led to growing confidence at the company since we last spoke to Skyflow two years ago. Citing an article on privacy engineering in the IEEE's Computer magazine published last October, Sharma believes the concept of a data privacy vault is winning mainstream recognition. He says:
We are no longer saying, we are a way of doing data security, or we are this thing for that guy. We are head-on saying, 'We are a data privacy vault company.' We created this category, and now the market is coming to us. People have recognized this internally for many years — I think we talked about the Netflix's of the world. Now, as a whole, the market is basically realizing that data privacy vault is a core new architectural element ...
We are a core architecture block, but we've simplified it so that it's just an API call. It should to you feel as easy as calling Twilio or Okta ... It's a new way of handling PII. But the interactions from your engineers' perspective is, 'When I get the data, you make this API call. And when I need to use the data, you make these calls.' And then you're done.
One of the ongoing challenges as digital technology continues to evolve is that the way things were done it the past often gets in the way of implementing the best way to do things today. The vertically integrated application stacks of old are an impediment to their growing need to join together and analyze data from across the organization. But as organizations bring together the old data stores and open up access to the underlying data, privacy compliance headaches multiply.
A fresh approach is needed. Instead of centering the data architecture on individual applications, there's a need to move to a more composable Tierless Architecture, in which each application and service can work with the data it needs simply by calling an API. In this new landscape, it makes a lot of sense to keep PII in a specialized data privacy vault that can respond to those API requests with just the right information tailored to the specific context. By very carefully thinking through how to structure that API with its use of polymorphic encryption and other techniques, Skyflow has positioned itself well for that role.
By isolating PII in this way, it becomes much simpler to protect it with zero-trust access protocols, while also allowing for data residency needs by distributing instances of the vault across a network of data centers. As Sharma points out, this happens to map more closely to people's behavior in the real world than keeping all of an individual's PII in a central store. When he travels to India, he uses a local phone number and credit card, neither of which he's be likely to use when he's in the US, so it makes no sense to have that data replicated elsewhere. He comments:
What's the point of having all this data being moved around? You're not only creating regulatory challenges, it's also useless!