With Big Data all the rage, many IT leaders are forgetting the most basic price of admission to the Big Data world: clean data. Predictive analytics and other Big Data novelties are downright sexy compared to the slog of gathering, normalizing, and cleansing data, but without clean data, your Big Data initiatives are likely to take longer, cost more, and deliver fewer benefits. Here’s how to get a jump on cleaning your data:

First, identify the problem

Most data cleansing and MDM initiatives are scrapped since they’re accurately perceived as long and costly efforts with little immediate value. In the case of a Big Data initiative, there’s a clear end state, and often a clearer way to quantify where data cleansing money is going. Working backward from the end state helps demonstrate the value of data cleansing. Start with the problems you expect Big Data to solve, the benefits of gaining the rapid responses and refinements characteristic of Big Data, and then compare the costs of repeatedly performing cleaning versus biting the bullet and doing it right the first time.

Then, find your data

With a well-articulated problem that Big Data can solve, your next major effort should be locating the data required to solve that problem. While it may seem obvious, the average company has many sources of “the truth” for each business process, combined with bits and pieces of data representing one business event scattered across the enterprise. It’s perfectly acceptable to pare back the scale of your Big Data plans to accommodate the current state of your data, especially if your company is new to the game. Some early, small successes are far better than getting caught in the weeds of trying to solve all your data problems at once and never actually delivering any value.

Build clean data into the process

For your initial analytical efforts, creating what amounts to a “data middleware” that processes a batch of data from multiple systems into a usable format may be an attractive option. However, you’ll quickly end up building increasingly elaborate conversion programs, and eventually create what amounts to a standalone IT system designed merely to make up for the deficiencies of your other IT systems. As Big Data demonstrates its value, you’ll quickly see the weak points in your data acquisition systems. While data warehouses are no longer in vogue, they might be the solution to gathering clean and consistent data from a multitude of systems.


Anther flaw in many IT environments that Big Data rapidly identifies is unnecessary complexity in data acquisition systems. For mature companies, decades of fields and flags whose meaning has long been forgotten, and that users have long since learned to skip over or fill with gibberish, are the bane of rapid analytics. As business counterparts show frustration with the cost and complexity of cleaning data, remind them that much of that complexity might be self-inflicted. While it may not be a popular message, there’s a cost to each and every field and function larded into software when it comes to rapid and timely analysis.

While data cleansing is unlikely to steal the pages of the major business newspapers the way Big Data has, it’s a critical price of admission for even the most rudimentary data analysis. While disjointed systems and an unkempt data warehouse might have been fine for a handful of custom reports that took IT several months to produce, expectations are shifting toward near real-time analysis of massive amounts of data. With interest in the analysis side of data at an all-time high, it’s not a bad time to suggest efforts to clean that most critical aspect of any Big Data project.