Don't sabotage your data science efforts with garbage

The greatest data science team in the world can't save you from bad source data. Learn five ways to make sure your data is not garbage.

Image: iStock/Devonyu

The Kryptonite for any data scientist is low quality data. You could invent the cleverest algorithm the world has ever seen, but it would render useless when fed bad data. As they say, "Garbage in, garbage out."

I'm currently working with a large oil and gas company to improve the safety of their refineries, by helping them adopt a more risk-based inspection strategy. The optimal application of risk would be purely quantitative -- use historical inspection data to identify high-risk areas that require more attention. This approach is being challenged due to the confidence some people have with the existing, historical inspection data. It's a valid challenge that's commonly faced by data professionals. To defend your data science, you must have good data quality techniques.

1: Clean sources

It all starts with a clean source. Housecleaning is much easier when you're starting with a relatively clean house -- the same goes for data cleansing.

There are tough questions being asked at my oil and gas client about how the data is collected. For instance, you may see places where the thickness readings of a pipe are larger in 2015 than they were in 2012. I'm no physicist, but I'm pretty sure pipes can't just grow in thickness over time. We haven't done a thorough root cause analysis as to why we're seeing such dubious data, though it's worth investigating. I favor this approach 10 times over any sort of data cleansing mitigation.

2: Develop an answer key

Before you can claim high data quality, you must know what high data quality looks like. In some cases, this may not be possible. In my pipe measurement example, it's impossible to know exactly how much thinner a pipe should be after three years -- that's why you inspect. However, in some cases you do know what high data quality looks like.

It's best to have an answer key, especially if you're applying statistical techniques to determine data quality; a simple one-sample t-test can tell you the quality of your data.

If you're mining a company's email server for employee sentiment, your algorithm should exclude any spam that made its way into the server. Spam in this context is pretty obvious, so the inverse (non-spam) should be as well, and this would be your answer key.

3: Remember integrity rules

Integrity rules are conditions in the data that must exist if your data is clean.

I worked with a large tech firm on the construction of a customer registry for their government sales. The customer registry served as customer master data for four or five data sources. To integrate each data source, we interviewed the product owners about the ACD (add, change, delete) nature of their data; then, we installed ACD audit logs on their tables to see what actually happens. In almost all cases, there were rows deleted from tables that should never be deleted, and rows added to tables that were supposed to be static.

Consider the logic rules in your data that should apply if there's no data corruption, and build audit scripts to tell you when there's a violation. For instance, if there's a foreign key that points to a non-existent primary key, you have a problem.

4: Employ expert systems

If hands-off quantitative risk assessment doesn't fly at my oil and gas client, we will interview experts to see if we can replicate the process they go through to clean the data before they analyze it. This is an expert system, which is a rule-based replication of how a human expert would determine good data quality. An expert system works well as long as: 1) you have actual experts (hint: check their results and ignore their title); 2) they can clearly explain what they do; and 3) what they do can be translated into clear-cut rules.

As with most things, the theory oversimplifies the pragmatics, so be careful. Your experts may have had unconscious competency for quite some time, and therefore find it difficult to explain what they do. Try explaining to a grade-schooler how you drive a car. It's not that easy.

5: Include machine learning in your arsenal

As recursive as it sounds to use machine learning to cleanse the data you'll use for machine learning, it actually works. There are two systems: one for cleansing and one for analyzing; you need to make sure to keep their solution spaces separate -- two different problems. But there's no reason why you can't teach a computer to learn what clean data looks like, especially if you have the answer key.

It still makes me nervous to rely solely on a computer to cleanse input data using machine learning; you never really know how well the cleansing algorithm will work, even with today's advances in machine learning. Amazon's pretty great, but it still recommends movies I would never watch. Even still, it doesn't hurt to include machine learning in your arsenal to combat poor data quality.


I've described five ways to make sure you don't sabotage your data science efforts with garbage. Some of the tactics can be used right away, and some may take time to develop.

You should get serious about feeding only the highest quality data into your data science algorithms. Otherwise, you'll quickly see the quality of your data science team erode.

Also see