Data science myths and realities - do data scientists really spend 80% of their time wrangling data?

Neil Raden Profile picture for user Neil Raden April 20, 2022
Summary:
Do data scientists really squander the bulk of their time cleaning data sets? Not necessarily - but for robust machine learning models, we do need better data management platforms.

truth-on-board

Do data scientists really spend 80% of their time wrangling data?

Yes and no. The implication is clear: if this stat is accurate, then the burden of provisioning data for their models impedes data scientists' ability to use their data science skills.

Lost in that argument is that the “data wrangling” itself involves considerable data science skills. In addition, the wrangling provides downstream benefits to others, using the findings the scientist uncovers. Lastly, it’s ridiculous to make a blanket statement when the work of data scientists is not uniform across industry and data platforms. 

This 80/20 claim first appeared at least ten years ago, and it still endures. There is no clear, rigorous evidence that this 80% measure is accurate. Depending on the circumstances, it indeed varies widely by organization, by application and, certainly, by the skill and tools applied. However, it is impossible to deny that sourcing data for analytical and data science uses is a significant effort, regardless of the percentage cited. 

Nevertheless, acquiring valid data for investigations is a  crushing problem of managing data in an increasingly complex, hybrid, distributed world. It is too great for even highly-skilled analysts and scientists to handle alone. The solution is a platform that provides coherent, connected services like data relationship discovery, data flow, sensitive data discovery, data drift, impact analysis and redundant data analysis. The entire suite has to be driven by AI working in concert with experts to foster relearning and adaptation. In place of inadequate approaches, a semantically rich data catalog buttressed by a knowledge graph is the key to deriving value from the effort for all this effort to be effective. 

Things to consider are: 

  • Why embedded machine learning technology to populate and maintain a knowledge graph is essential to tackle the job of managing data discovery by mapping relationships in the distributed  data not evident in manual processes.
  • The data discovery process is dynamic, not a one-time ETL mapping to a stable schema.
  • What actual metadata is not neatly arranged in a drawer but active throughout the whole process from discovery to a dynamic semantically rich data catalog.
  • Why even machine learning-driven software is inadequate if it stops at metadata such as column names and does not investigate the actual instances of the data itself.
  • The role of continuous learning. As experts inspect the results of the models, their input as additions, deletions, or corrections is fed back to algorithms to relearn and adapt.

Better tools are needed to improve highly skilled and compensated professionals' productivity (and job satisfaction). In more traditional ways, even those performing analytics in organizations can benefit from an intelligent and integrated product taking them from data ingestion to an active, semantically rich data catalog.

It can’t be done with traditional methods. There is too much data and a diversity of sources for programmatic solutions. The data scientists (we use the term “data scientist” broadly to mean anyone using data for analytical and quantitative work) need some help. Interestingly, that help comes from the same disciplines they use in their work. The solutions today that work have, at their core, machine learning technology. 

The promise of machine learning and AI-imbued applications catapulting us to impressive capabilities is energized by leaps in processing, storage and networking technologies, the ability to process data at a fantastic scale and the expanding skill sets of data scientists. This technology bedrock allows for an innovative approach to data management, not possible even a decade ago. 

Today's volume of data adds complexity to the problem, but making sense of it at scale is needed.

Only a few years ago, things seemed to be more orderly. Before the onset of big data, followed by “data lake” and cloud object stores, the data warehouse's primary data repository for analytics. Technology for extracting and integrating data for the data warehouse was Extract, Transform and Load (ETL). ETL was front-loaded in the data warehouse development process, pulling information from somewhat known data sources to a known schema. Once settled, ETL mostly ran as a steady process. Mostly. Data warehouses are stable but not static, so there is usually some continuing development with ETL, but for the most part, the routines run as production. 

Sourcing and integrating data for data warehouses was not easy. First of all, the source systems were not designed to be data providers for a data warehouse. The semantics didn’t line up, and there were data quality problems 

The current fascination with “digital transformation” has organizations struggling to ramp up skills in machine learning, AI, deep learning or even just simple predictive models. Data sources for consideration exploded. For example:

  • Social media platforms offer a wide variety of views of their data,
  • Data.gov contains over a quarter million datasets ranging from Coast Guard accidents to bird populations, demographics to Department of Commerce information. 
  • Healthdata.gov contains 125 years of US healthcare data, including claim-level Medicare data, epidemiology and population statistics. These are just a few of thousands of external data sources. 

Even within an organization, disjointed data sources devised to capture data within a single domain are now seen as critically important data for new applications not possible before. For example, Population Health Management, as an application area, requires, at a minimum, the following data sources:

  • Patient Demographics 
  • Vital Signs 
  • Lab Results 
  • Progress Notes 
  • Problem Lists and Diagnoses 
  • Procedure Codes 
  • Allergy Lists 
  • Medication Data 
  • Admission, Discharge and Transfer 
  • Skilled Nursing and Home Health 
  • Social Determinants of Health 

No data warehouse can conveniently integrate all of this data. There are too many domains, too many data types and the sheer effort of cleaning and curating would overwhelm any data warehouse schema. The logical location for this data is some variant of cloud and on-premises, distributed via Hadoop or Spark or a data lake (or lakes). These data repositories make for a convenient way to manage the ingestion of the data, but they lack the functionality to energize it, to give it meaning for the investigator. 

The problem arises because none of these data sources is semantically compatible with the others. Combining and integrating data from multiple adds to the richness of the models. Therein lies the 80% problem. 

Data science work is very often one-off. It is a multi-step process involving data profiling, some data cleansing, continually transforming data from different sources into a single format, saving the data, naming It something they can remember, and keeping track of versions. Each investigation starts with a model and selects the data for it. The creation of training data involves more data handling and multiple runs or versions of the model are also named and saved. Another contrast between ETL and data discovery today is that ETL is always mapped to a stable schema. 

My take

There is considerably more data handling for each experiment, quite different from extracting curated data from a data warehouse. This is why it is so time-consuming. It is a natural productivity killer for data scientists. Even when using tools designed for big data/data science, there are many steps, and often, multiple technologies are employed, with incompatible metadata and weak to non-existent hand-off. There is a better way.

A grey colored placeholder image