Big Data

Stop overdoing it when cleaning your big data

Enough is enough--your big data might actually be getting too clean. Find out why it can be useful to keep bad, garbage data.

Image: iStock/gvictoria

When you got a job as a data scientist, I bet you didn't imagine you'd spend so much time cleaning up bad data. Don't feel badly--none of us did.

When data science rolled on the scene, many of us who were already in the data warehousing and business intelligence fields thought we'd entered a new era where sophisticated algorithms and artificial intelligence was the order of the day. To a certain extent that's where we went, but we were never fortunate enough to get rid of that grimy aspect of the job: data cleansing.

Bad data has been around for as long as good data has existed--it's the yin and yang of the data world. And since data builds the foundation of everything a data scientist does, there's no good way to avoid the cleansing aspect of the job, unless you get a sous scientist. Since most of us don't have that luxury, we often spend our days keeping garbage data away from the data store.

Being the good data scientist that you are, I'm sure you've done your diligence to clean up your data, but are you overdoing it? Shiny, clean data makes me suspicious.

SEE: Special report: How to automate the enterprise (free ebook) (TechRepublic)

Not so fast with that sponge

When I was training to be a Six Sigma Black Belt at Motorola, my Master Black Belt gave me good advice about cleaning data: don't do it. It's a very different approach than what you would find in the data warehousing world, but I get it.

To a statistician, especially one that works on Six Sigma projects, inputs are a given. You don't remove bad data because it looks bad--you deal with it as an outlier or variation that's causing a quantifiable error in the process. This could be a measurement error (i.e., related to the actual instrument used to take the measurement), or it could be a legitimate variation in the process.

I recently did some work at Chevron where we took measurements on pipe thickness periodically to assess corrosion rate. In some cases, the current readings showed more thickness than the previous readings, suggesting the pipes actually grew. Our knowledge of the physical world informed us that pipes don't spontaneously grow, but that's what the readings showed; we accounted for that with the understanding of how pipes are measured. There are a number of reasons why we got those readings, including imprecise instruments, inexperienced operators, or localization inaccuracies.

The most popular idea was to throw this data away on the rationale that pipes can't grow; although, as my learned Master Black Belt warned, this would be unwise. Any predictive statistics run on this data would be inaccurate, and therefore any inference from this data would be flawed. This is the kind of overcleaning that should be avoided in your data cleansing efforts, as counterintuitive as it may seem. Note that artificial intelligence techniques are not immune from this errant data cleansing.

SEE: Big data's biggest problem: It's too hard to get the data in (ZDNet)

Where to draw the line when cleaning data

There is data that just needs to go. This is where--great wisdom notwithstanding--my Master Black Belt's advice can't be applied categorically. To be frank, it's more of an art than a science to know where the line is, so let's start with an obvious extreme.

Using our example of pipe measurements, let's say a reading came back as zero--no pipe thickness at all. That means there's no pipe there, and the nasty liquid that's supposed to be in the pipe is all over the place. That's just not possible. And the fact that the reading is exactly zero would lead me to believe the measuring device wasn't working, and nobody caught this before it made its way to the data store.

This is the true definition of garbage data: If there's an obvious error in your data, don't let it through. Obvious errors can show up in the absence of humans (not catching an error of automation) and in the presence of humans (the infamous user error).

Bad code (written with good intentions, of course) is another culprit for bad data production. I have no problem cleansing a data store after a significant bug is discovered and then fixed.

SEE: Big data policy (Tech Pro Research)

Why you should keep bad data

Garbage data should be kept away from your production analytic store, but I wouldn't be quick to eradicate the data. Like most data professionals, I don't like the idea of deleting data. The fact that you're getting bad data from a particular source is useful, and you'd probably like to understand why the data came in so bad. The best practice is to clean it with a staged transformation, but keep the original bad data because it's useless for analysis and possibly for other reasons.

Summary

Data cleansing is not a glamorous job, but it's a job of the data scientist nonetheless. As such, if you're coming from the database world, it's best to follow the data cleansing advice that comes from the world of statistics: Be very careful about the data you delete or modify for the purposes of analysis. Many analytic techniques can account for variation and error, and many artificial intelligence techniques can recognize and neutralize errant data, so it's best to let the downstream techniques do their job.

However, you're not completely off the hook. Obviously bad data must be identified, quarantined, and investigated. This data comes from somewhere or someone so it behooves you and your organization to get to the root cause. In the meantime, don't let this data throw a wrench into your analysis. Sure, your algorithms may be able to absorb it, but there's no sense in taking the risk.

Review your data quality strategy and answer these important questions: What are your rules? In trying to protect the analytic engine, are you going too far and overcleaning your data?

If you scrub too hard on a data store, you just might damage it.

Also see

Visit TechRepublic