I’m never surprised by what I find in client data, but it’s still amusing to find data anomalies that are clearly wrong.
For instance, I’m working with a large oil and gas company that’s interested in measuring the thickness of pipes. They hire contractors who specialize in this area to take periodic measurements, and then their inspectors analyze the data to see if anything needs to be done about pipes that are getting too thin.
Unfortunately, preliminary temporal analysis looks a bit dubious. Although the contractors take measurements in the same place on the same pipe, a literal interpretation of the data would suggest that some pipes actually grow in thickness over time! Not probable. What’s more likely is a measurement error.
To err is human
Measurement error is the amount of process variability that can be contributed to collecting and measuring the data.
Imagine two timekeepers capturing the cycle time — in thousandths of a second — of a pro football player running a 40-yard dash. Will they have exactly the same time? Probably not. But there is only one true time, right? So what value will you use as the cycle time of record? Whatever it is will contain some degree of error due to the way the data was collected.
This is an important concept to take seriously as a data scientist and as a consumer of data science, because the value of any algorithm is dependent on the quality of its inputs. And although everyone is comfortable with this rationale, what’s often overlooked is the reliability of the system used to collect the data.
Measurement System Analysis (MSA) is a structured, mathematical method for determining how much of your data quality problems are caused by the measurement system. The automotive industry established a widely accepted rule of thumb about measurement error: under 10% is best, but it shouldn’t be more then 30%.
Let’s say you’re analyzing the overall performance of our football player above. Over the past 100 sprints, you notice an average time of 4.523 seconds, with a standard deviation of 0.132 seconds — not bad. The total variation will be caused by more than just measurement error — for instance, some days he may not be feeling great. However, some of that variability has to do with the fact that the timekeepers can’t possibly record the true cycle time, so they do the best they can. But, we don’t want them accounting for more than 30% of that 0.132 seconds of standard deviation.
Techniques for reducing or eliminating measurement error
If your measurement system contributes more than 30% of your overall error, something must be done to improve it. And even if you’re under 30%, or even under 10%, the goal should be to eliminate measurement error altogether. Statisticians will have a hard time with that comment, because you cannot statistically eliminate measurement error — but don’t use that as an excuse; you can and should set zero as your goal. The most efficacious strategy is to get humans out of the way.
I hate to point the finger at our own species, but if you want a precise measurement, a human is the wrong tool. In our fictitious football player example, and our very real pipe measurement example, humans are at the root of our measurement problems — you don’t need a root cause analysis to figure that out. In fact, the National Football League (NFL) switched over to electronic timing over a decade ago for this very reason.
Using computers and other automated/electronic means to record measurements is obvious, but what’s not so obvious is that even a computer can’t guarantee true measurements. When I worked with a large financial institution on cybersecurity, we faced a big problem during a time-series analysis using various (non-human) data collection points. In some cases, we found a transaction approved before it was ever initiated (the humor of it never wanes with me). This is, of course, not what happened; there was a time-syncing issue between different servers.
Measurement error like this surfaces explicitly, although most electronic measurement system error goes unnoticed because the measurement comes from a sole collection source. You should devise a way to collect the same measurement from at least two sources.
Finally, formalize a process to eradicate measurement error. Fortify your skills in Sources of Variation (SOV) analyses; these are specialized analyses created for the specific purpose of isolating the nature and magnitude of variation. Once you isolate the largest contributor to measurement error, take specific action to eliminate it. For instance, if excessive file I/O is compromising the integrity of the times reported in your web logs, move the web server to a machine that’s less active. There’s more value in the process than the antidote; you would not have found the excessive file I/O if you weren’t looking for it, and you wouldn’t realize your server was throwing off erroneous timings if you weren’t taking shadow measurements.
You can’t manage what you can’t measure, and you can’t measure without the right tools. All data analysis is dependent on its underlying data, which must be collected somehow. If that collection process is throwing off bad data, your analysis is doomed from the start.
Take time to analyze your data collection process and measurement system before building your fancy data algorithms. Stay away from human data collection systems; make sure you have multiple readings for the same data point; and formalize a process to exterminate measurement error.
You already deal with enough data quality issues — don’t let measurement error exacerbate the problem.