Top 5 biases to avoid in data science

In data science, there are some important unconscious biases to steer clear of. Tom Merritt lists five biases for data scientists to keep in mind.

Top 5 biases to avoid in data science

Bias isn't necessarily bad--it's our mind's way of being able to make a decision quick. It's an evolutionary advantage. For example, I have an absolute bias against walking out in front of moving cars. However, you need to be aware of what your biases are to know when they're serving you well and when they're getting in the way or making things worse. Data science is fertile ground for unconscious bias to cause problems. If you're not aware of your bias, you can easily draw wrong or even dangerous conclusions. Here are five biases for data scientists to keep in mind.

  1. Selection (or sample) bias. This one is easy to fall into. It happens when the selected data is not representative of the cases the model will see. An all too frequent example of this is facial recognition trained predominantly on images of people with fair skin leading to algorithms that can't accurately identify people with darker skin.
  2. Confirmation bias. This is where you toss out information that doesn't fit your preconceived notion--and it can be subconscious as all get out. You have to work hard to take in new data with an open mind.
  3. Survivorship bias. This is where you select your data points because they're successful. Looking for data on what makes a product succeed? Don't just choose the successful products, you need data from the failures and the middle performers too.
  4. Availability bias. This is where you use the data that's easy to get. You need to look at all the data points that reasonably might inform your analysis, not just the stuff that's around. The menu from the Mexican restaurant last night is not a great data set for a nutritional study. A related bias is anchoring, where we give more importance to the first bit of data we get only because it was first.
  5. False causality. A great example is that a large cluster of firemen correlates with higher property damage. Obviously these firemen are causing property damage. Or, do more damaging fires need more firemen? A similar one to avoid is the clustering illusion--sometimes random things cluster.

There are a lot more where these came from and I know you've probably heard of all of them. Just make sure you don't let your guard down against them. Remember, the weak link in data science is often our brains. If you know how your brain works and correct its quirks, you've improved how well the whole system works.

Subscribe to TechRepublic Top 5 on YouTube for all the latest tech advice for business pros from Tom Merritt.

Also see


Image: iStockphoto/Peshkova