Extract, transform, load (ETL) is the most talked-about strategy
for extracting data from multiple sources and systems, and then recombining it
into new datasets for query purposes.
ETL has been used in enterprises to populate data marts and warehouses for
years. But ETL won’t be enough to
link data into close associations so it can be normalized in order to eliminate
some of the extraneous “data fishing” that is likely to occur when analytic queries
go after multiple forms of the same data in an attempt to assimilate that data
to produce answers. For this, you need data normalization.

What is data normalization, and how does
it work?

For relational
datasets, normalization is applied to minimize data redundancy so there is only
one way to know a fact. The goal is relevant to big data analytics because
factual confusion generated by disparate data variations can impair one’s ability
to arrive at accurate information. If the pathway to meaningful results from
big data queries is clouded with these inaccuracies, enterprises begin to lose on
their big data investments because the results they get from analytics are
diluted by the inaccurate data the analytics are using.

Data normalization is a technical database operation
performed by a database analyst with the assistance of normalization tools; the goal is to associate similar forms of the same data item into a single data form. Based on how close the association of these various data permutations of
the “same” data item are, the data variants can be normalized as a first
normal, a second normal, or a third normal form, with the third normal form
representing the loosest association of two data forms. 

Data normalization has been used by enterprises for years
for their online transaction systems, but an interesting quandary now rests in
the fact that these same enterprises have often opted in the past not to normalize data for the data marts
and warehouses that they use for queries.

One reason normalization has
not been used with data marts and warehouses is the painstaking and time-consuming
work that can be involved in building out a multitude of database tables
to establish all of the relationships between “like” data items so they can appear
as a single data item when they are processed. The propagation of multiple
tables (where there used to be one) also has the potential to slow down processing. Consequently, it was an easy choice for many
IT departments to just extract data from multiple sources and pile it into a
data repository in a data mart or warehouse for general query processing. It
didn’t seem to matter too much if the data got “old,” because it could always
be refreshed on a nightly or weekly basis, and the reports run against the data were
generally good enough to meet the objectives of the ad hoc reporting requests for which they were used.

However, this isn’t the case with big
data, which is being heralded into enterprises with such huge expectations. Most enterprises have already written
goals into their big data roadmaps that include being able to run real-time analytics
against well-orchestrated data for instantaneous and actionable information. In
batch, the expectations are no less.

Normalization + ETL

Decision makers want immediately actionable
data. To fit the bill, this data must provide a “single version of the truth” that decision makers can trust. The logical way to get to this degree
of precision is through data normalization in conjunction with ETL rather than through just ETL.