Using machine intelligence to protect sensitive data

Profile picture for user kmarko By Kurt Marko August 23, 2017
Summary:
Amazon Macie, a DLP service that uses machine learning and NLP along with Google Cloud DLP API lead the charge in the automatic identification of sensitive information. Others will follow.

AI artificial intelligence machine

Can machine intelligence in the form of Amazon Macie andGoogle Cloud DLP API solve once impossible problems?

Deep learning AI algorithms have revolutionized natural language processing (NLP) and automated image analysis and enable features that were once the stuff of science fiction that now seem as routine. Whether it's online text translation, consumer chatbots or automatic face detection and tagging in photos, predictive analytics and deep learning enable features once seen as impossible.

As I've discussed many times over the past few months, whether for cyber security like malware detection, conversational UIs, or specialized industry applications, AI is reshaping the world of enterprise software, with significant implications for every business.

One area of emerging promise for machine intelligence enhancement is a vexing problem facing every organization; data protection and privacy. No one wants to see their company or personal secrets posted to Wikileaks, or exploited by a competitor. While data loss prevention (DLP) software has been around for over a decade, it's been of limited utility in stemming data breaches. In large measure,  this is due to the complexity of configuring DLP, reliance on document metadata, tags that may not exist and limited to no ability to process and understand the underlying unstructured data, whether, text or rich media.

The shortcomings of DLP software are behind the widely-held axiom that DLP is a process, not a technology. However new services from AWS, Google and others sure to follow will significantly shift the balance towards technology-centric, automated DLP approaches.

AWS leapfrogs the competition

AWS got everyone's attention when it announced the Amazon Macie, a DLP service that uses machine learning and NLP to automatically identify sensitive information or unusual activity around such data, at its recent AWS Summit. Macie, which initially works with data in S3 storage, stems from technology developed by Harvest.ai, a startup using AI-based algorithms to detect data breaches, advanced persistent threats (APTs) that can proactively stop data exfiltration. Amazon quietly acquired Harvest early this year, and it didn't take long to transform this product built on AWS into an actual AWS service.

Macie uses machine learning (ML) to automatically discover, classify, monitor and protect sensitive information, which can span personally identifiable information (PII), protected health information, intellectual property, legal or financial data, source code, security certificates or keys and database backups. It does this by building and continuously analyzing predictive models that look for the typical sources of sensitive information, along with suspicious access patterns such as changes to policies and access control lists that indicate accidental or deliberate overexposure of information or unusual attempts to access content. Although primarily an alerting engine, through integrations with Amazon CloudWatch (the AWS event monitoring service) and using a suite of prebuilt Lambda functions, Macie can automate incident response and remediation processes.

Once pointed at an S3 target, Macie automatically scours the data and classifies it by file and content type, using both file metadata and the raw content. Like other ML software, AWS pre-trained Macie's machine classifier using a corpus of common content types. Once classified, Macie automatically assigns a decile risk level. Both the classifications and risk scores can be customized using the AWS management console, which also displays a dashboard summarizing content types and risk profile. Unlike most DLP products, Macie's degree of automation means it's incredibly easy to setup; after configuring some account security settings, just point at some S3 buckets and let it chew away at the data.

Google DLP: meant for developers, but powerful capabilities

It doesn't hype the AI connection, but Google likely preceded AWS in applying learning algorithms to DLP over a year ago when it introduced a DLP scanning engine for enterprise Gmail customers. Although DLP policies are organized by a set of rules, the underlying content scanning almost certainly uses deep learning modules to automatically detect things like credit card or personal ID numbers even when they're embedded in rich attachments like slide decks and images. Last January, the company extended the DLP feature to protect any file in Google Drive.

The G Suite DLP feature uses technology Google later exposed as a standalone service via a set of Google Cloud DLP APIs. As its product manager summarizes the features,

The DLP API gives GCP users access to a classification engine that includes over 40 predefined content templates for credit card numbers, social security numbers, phone numbers and other sensitive data. Users send the API textual data or images and get back metadata such as likelihood and offsets (for text) and bounding boxes (for images).

Like all ML algorithms, the DLP API generates a probability score used to classify the likelihood that a given set of data matches one of the protected data types. Of course, the algorithms work on text, as is best seen by trying this demo application, but what makes Google's system so impressive is that it also works with images, for example, numbers from a scanned credit card as it showed in this presentation. The DLP API currently works with data stored in Google Cloud Storage (object) BigQuery (data warehouse), and Cloud Datastore (NoSQL database), but since it's a REST API, can also be integrated with external data sources.

My take, others to follow

Using machine intelligence in the form of DLP information classification and identification is the most significant advance in the product category's history. Although Amazon and Google are leading the charge, others will surely follow. For example, the Azure Information Protection service currently uses metadata and policy labels to classify data sensitivity. Adding machine intelligence to automatically tag and scan data seems like an obvious extension to this DLP and rights management engine.

Similarly, SAP is applying ML to a variety of business processes such as invoice processing, brand recognition and penetration and customer sentiment analysis. Given the vast amount of enterprise data stored in SAP systems, adding AI-powered DLP seems obvious and when asked about the possibility, Markus Noga, SAP's VP of ML, admitted he'd heard the request from customers, but that DLP currently isn't a high priority. I expect that to change.

DLP is also a natural fit for IBM's cognitive service, Watson, which has already applied ML technology to security threat analysis via the QRadar Advisor. Indeed, an IBM Research paper has documented a deep learning system for DLP that can automatically detect sensitive content in Office documents and PDFs. Don't be surprised when an evolution of this technology shows up in an IBM service.

The problem with traditional DLP software is the inherent limitation of understanding only metadata and text content. It's a fatal flaw in today's rich media world where people are more likely to share information using photos, voice or video clips. AI, particularly deep learning, is the key to unlocking the significance of such data and will become a prerequisite in future DLP software where it's not good enough being able to scan emails when your millennial employees are collaborating on Facetime, Duo, Snapchat and WhatsApp.