Extract, transform, load (ETL) is the most talked-about strategy for extracting data from multiple sources and systems, and then recombining it into new datasets for query purposes. ETL has been used in enterprises to populate data marts and warehouses for years. But ETL won't be enough to link data into close associations so it can be normalized in order to eliminate some of the extraneous "data fishing" that is likely to occur when analytic queries go after multiple forms of the same data in an attempt to assimilate that data to produce answers. For this, you need data normalization.
What is data normalization, and how does it work?
For relational datasets, normalization is applied to minimize data redundancy so there is only one way to know a fact. The goal is relevant to big data analytics because factual confusion generated by disparate data variations can impair one's ability to arrive at accurate information. If the pathway to meaningful results from big data queries is clouded with these inaccuracies, enterprises begin to lose on their big data investments because the results they get from analytics are diluted by the inaccurate data the analytics are using.
Data normalization is a technical database operation performed by a database analyst with the assistance of normalization tools; the goal is to associate similar forms of the same data item into a single data form. Based on how close the association of these various data permutations of the "same" data item are, the data variants can be normalized as a first normal, a second normal, or a third normal form, with the third normal form representing the loosest association of two data forms.
Data normalization has been used by enterprises for years for their online transaction systems, but an interesting quandary now rests in the fact that these same enterprises have often opted in the past not to normalize data for the data marts and warehouses that they use for queries.
One reason normalization has not been used with data marts and warehouses is the painstaking and time-consuming work that can be involved in building out a multitude of database tables to establish all of the relationships between "like" data items so they can appear as a single data item when they are processed. The propagation of multiple tables (where there used to be one) also has the potential to slow down processing. Consequently, it was an easy choice for many IT departments to just extract data from multiple sources and pile it into a data repository in a data mart or warehouse for general query processing. It didn't seem to matter too much if the data got "old," because it could always be refreshed on a nightly or weekly basis, and the reports run against the data were generally good enough to meet the objectives of the ad hoc reporting requests for which they were used.
However, this isn't the case with big data, which is being heralded into enterprises with such huge expectations. Most enterprises have already written goals into their big data roadmaps that include being able to run real-time analytics against well-orchestrated data for instantaneous and actionable information. In batch, the expectations are no less.
Normalization + ETL
Decision makers want immediately actionable data. To fit the bill, this data must provide a "single version of the truth" that decision makers can trust. The logical way to get to this degree of precision is through data normalization in conjunction with ETL rather than through just ETL.