Revisiting ethical AI, part two - on data management, privacy, and the misunderstood topic of bias

Profile picture for user Neil Raden By Neil Raden November 26, 2020
No, you can't program your AI for empathy or ethics. But you can certainly confront the problem of bias. In part two of revisiting AI ethics, we examine how bias, data management, and privacy should be addressed.


In Part 1, I proposed that the term "AI Ethics" created a new role for those with a background in ethics as a discipline. This "ethicist" role was asserting itself unnecessarily by expanding their influence into AI development as consultants.

I should qualify this gripe because I am acquainted with some professionals employed as ethicists, who provide a valuable service in organizations developing AI as products. 

Moving on, I spent some time reviewing what is evident in AI ethics issues, covered the central topic of the Social Context and finished with what you need to think about. 

This article starts with the all-important, widely-discussed but often poorly understood topic of bias.

There are many types of bias:

  • Sample Bias - when some members of a population are more likely to be sampled
  • Prejudice Bias - preconceived opinion not based on reason or experience
  • Measurement Bias - incorrectly measure variable
  • Algorithmic Bias - systematic, repeatable errors that create unfair outcomes
  • Implicit/Insidious Bias - Attitudes and stereotypes that affect our decisions in an unconscious way
  • Technical Bias - issues in the technical design
  • Anchoring Bias  - rely too heavily on the first piece of information given about a topic.
  • Stupid bias - (via Bertrand Russell). The fundamental cause of trouble in the modern world is that the stupid are cocksure while the intelligent are full of doubt
  • Confirmation Bias - tendency to interpret new evidence as confirmation of existing beliefs

You can't program ethics, empathy, or the concept of harm into an AI, But here's what you can do:

Ethical implementation of AI. 

  • Scrupulous methods to remove bias and catch it creeping back in. 
  • Answering the question, "I can build this, but should I?" 
  • Not allowing your objectives to overwhelm people's privacy. 
  • Resist digital phenotyping. It's almost impossible not to discriminate or annoy. 
  • Using your influence to encourage your partners, stakeholders, regulators, legislators... Everyone to apply the same ethical behavior.
  • Do not use data brokers unless they are thoroughly vetted.

The largest part of AI work is data.

At the start of a PA/ML project, a data source is when you inherit all of the bias that went into it all the way down the lineage. You also don't know the context of the collection and storage decisions. You cannot remove bias from data computationally. You must use your judgment, do experiments, consult with others. "Mieux vaut prévenir que guérir" is another widely used proverb, understood by all French natives. It means, "It is better to prevent than to heal." ML models do not work on raw data; Algorithms impose requirements. Raw data contains errors; Columns may be redundant or irrelevant. That's why you must spend a great deal of time on: 

  • Data Cleaning to delete duplicate rows or redundant columns
  • Outlier Detection and removal
  • Missing Value identification and imputation
  • Feature Selection with statistics and models
  • Feature Importance with models
  • Data Transforms to change data scales, types, and distributions
  • Dimensionality Reduction to create low-dimensional projections

More prescriptions:

  • Understanding relationships with embedded connectors is key
  • Data discovery is dynamic, not a one-time mapping to a stable schema
  • AI-driven discovery is becoming a must-have (but beware of the "Deus ex machina" effect 
  • No algorithm is entirely sufficient out the box; 
  • Expert input to relearn and adapt, a cooperative approach

Data doesn't speak for itself


One guiding principle about data should never be overlooked: data should never be accepted on faith. How data are construed, recorded, and collected results from human decisions about what to measure, when, and where and by what methods. The context of data - why the data were collected, how they were collected, and how they were transformed-is always relevant. There is no such thing as context-free data; data cannot manifest the kind of perfect objectivity that is sometimes imagined.

At a certain level, the collection and management of data may be said to presuppose interpretation. "Raw data" is not merely a practical impossibility, owing to the reality of pre-processing; instead, it is a conceptual impossibility. Data collection itself is already a form of processing.


On the topic of privacy, the typical approach is Anonymization, Tokenization, and Encryption. Encryption is too cumbersome, Tokenization is even worse, and Anonymization doesn't work at all. Typically, data management simply masks or removes PII or other private, sensitive information, anonymizing (unless the company uses differential privacy, a relatively new technique I described previously). It is a simple process for a skilled data miner to combine anonymized data sources with other data sources to re-identify (de-anonymize) people from what is supposed to be protected records. Consider a record with five or six PII properties. If you remove or mask them, you have probably diminished the analytical content of the dataset. I like to say, solving human problems requires human data, not statistical aggregations. Also, all of the other attributes still identify the person, just not uniquely. But if some of those attributes match up with attributes in other datasets, a join or two breaks the anonymizing scheme. 

It gets worse, Imperial College's Yves-Alexandre de Montjoye, demonstrated in previous studies looking at credit card metadata that just four random pieces of information were enough to re-identify 90% of the shoppers as unique individuals. Privacy erosion of smartphone location data, researchers were able to uniquely identify 95% of the individuals in a data set with just four spatio-temporal points.

I can say 50-60% of ethical gaps that harm people and company reputations are in the data, not just careless developers and malicious actors. 60%-70& of the articles and papers are focused on algorithmic bias. 


When AI is developed by an ensemble team and creates havoc, which is to blame?

A growing school of thought that Anthropomorphizing AI shifts responsibility to the model away from the firm. I ran across an Ai in production named Roscoe. If that application went rogue, I'm pretty sure that we'd "Roscoe went off-the- farm," rather than names of principal developers, or the brand or the company.

Anthropomorphism is endemic in AI. Whether it is a dangerous formulation that distorts the true nature of AI, or it's as McDermott famously complained, "wishful mnemonics, "understand," "learn," "decision-making," "problem-solving. Reid Blackman, Ph.D., is very opposed to anthropomorphizing in AI. In a LinkedIn conversation recently, he explained: "ML calculates weighted variables. Describing these calculations as 'deliberations' is as anthropomorphic as calling its outputs' decisions.' People decide what data to train the ML on. People determine what the objective functions and hyperparameters are. People choose which model to use. People choose to use the ML outputs in *their* deliberations and decisions about what to do. People make decisions and therefore bear responsibility for the consequences. Talking about the software's 'decisions' is incorrect." 

Responsibility needs to flow to all of those who had a role, even executives or the Board of Directors. If the work was aligned with the directives or strategy of the executives or Board, they need to take some responsibility. Otherwise, we might as well give up on pretending to care about privacy, ethics, and safety.

Final advice

  • It's effortless to mess up on bias; Find diversity in your team to match your target.; Don't underestimate how difficult this is
  • You will need infrastructure and data architects; Start with what you know that has up-stream potential
  • The big wins will come; DIY tools are great learning tools, but not for production
  • The data work is hard, but AI-driven tools and catalogs are helping; Create inclusive teams
  • Your work can be used maliciously; Policymakers need to learn about these threats
  • AI world must learn from cybersecurity; Ethical frameworks for AI need to be developed and followed
  • Use inclusive data. Diverse teams are critical for building solutions that meet the needs of diverse audiences
  • Remember the facial-recognition systems fiasco
  • Data provenance affects the accuracy, reliability, and representativeness of the solution
  • Prevent unintentional bias with accurate representation in the data of all the groups you are trying to serve
  • View bias from a broad scope: social, gender, and financial; any bias that might affect disparate groups
  • Data audits and algorithmic assessments flag and mitigate discrimination and bias in machine-learning models

My take

We have a long way to go. OpenAI GPT-3 ("Generative Pre-trained Transformer 3") is an autoregressive language model that uses deep learning to produce human-like text. It is testing on a supercomputer Microsoft assembled on the cloud. They are running with an astounding 175 billion variables. There are some keyword combinations here worth noting. Microsoft + supercomputer, supercomputer + cloud, 175 + billion + variables. As Joe Biden would say, "Here's the thing" - everything we've figured out, or at least proposed about ethical practices - let's just say it's going to be an all-day job.