Why Are We Still Cleaning Data in 2026?
By Nasly Duarte
Data scientists spend 80% of their time cleaning data.
That stat is from 2016. It's 2026 now. A full decade later. And the number hasn't moved.
Think about that.
In ten years we've built large language models that write code, autonomous vehicles that navigate cities, AI systems that diagnose disease from imaging. But we still can't figure out how to stop dirty data from entering a system.
Or maybe we just stopped asking.
The industry's response to the 80% problem has been to build better cleaning tools. Better ETL pipelines. Better data wrangling platforms. Better preprocessing libraries. Faster joins. Smarter imputation. More efficient deduplication.
We got really, really good at mopping the floor.
But nobody turned off the faucet.
Let's look at what "data cleaning" actually means:
Inner joins — combining records from two separate sources that should have been connected from the start. Why are they separate? Because two systems captured the same information independently and nobody enforced a shared structure.
Null handling — filling in missing values. Why are there missing values? Because the system allowed someone to submit incomplete data without flagging it.
Deduplication — removing records that appear more than once. Why do duplicates exist? Because multiple people entered the same information in different places and nothing prevented it.
Type casting — converting text to numbers, strings to dates. Why is a price stored as text? Because a human typed "$6,000" into a free-text field instead of entering a number into a validated field.
Normalization — restructuring data into a consistent format. Why is the format inconsistent? Because the system accepted any format the user felt like giving it.
Every single technique exists for the same reason.
The system that created the data didn't enforce structure at the point of entry.
That's not a data problem. That's a design problem.
And here's what concerns me: we're training an entire generation of data professionals to accept this as normal. Preprocessing isn't taught as a workaround — it's taught as a core competency. As if the mess is a given and our job is just to clean it up faster.
What if it's not a given?
What if instead of building smarter cleaning tools, we designed systems that produce clean, structured, validated, relational data the moment it's created? Not cleaned after the fact. Not transformed downstream. Not wrangled into shape by a team of engineers. Structured from the start.
What would that eliminate?
No more inner joins because the data is already linked at creation. No more null handling because incomplete records are blocked before they enter the system. No more deduplication because data is generated once, from one source, through one process. No more type casting because outputs are typed by design, not by human input. No more ETL because there's nothing to extract, transform, or load. The data is already where it needs to be, in the format it needs to be in.
I'm not saying preprocessing knowledge is useless. If you work with data today, in the real world, you need every one of those skills. The data is messy. The systems are broken. The cleaning has to happen.
But I am saying we should stop treating the mess as permanent.
The 80% number hasn't changed in a decade because we've been optimizing the wrong side of the equation. We keep investing in what happens after data enters a system. Almost nobody is investing in what happens at the point of creation.
That's the question I can't stop thinking about.
What if instead of training people to clean data, we designed systems that never produce dirty data in the first place?
What would that change about how we build technology? What would that change about how we teach data science? What would that change about the 80%?
I'd love to hear from anyone who's working on this — or thinking about it.
#DataScience #AI #DataGovernance #DataEngineering #SystemsThinking #CEFModel
No comments:
Post a Comment