Big data refineries: Where the rubber meets the road

Discover why some businesses are struggling with their big data refineries, and what's the key to the data distillation process that you should keep in mind.

Image: Ken Straiton

The Merriam-Webster dictionary defines refinery as a "place where the unwanted substances in something (such as oil or sugar) are removed: a place where something is refined." So it's no accident that the process of sifting through avalanches of big data for a few golden nuggets that can revolutionize business insights is known in present IT vernacular as the big data refinery. Unfortunately, many enterprises are struggling with their big data refineries.

The first part of the operation -- plugging into mountains of data and sluicing through raw material in the form of structured and unstructured data -- is as straightforward as lighting a stick of dynamite. Big data vendors and even internal enterprise application programming interfaces (APIs) easily enable sites to plug into Internet of Things (IoT) raw data, corporate website and social media raw data, and data from a bevy of systems of record that report on inventory consumption, sales volume, factory productivity, product rework, delivery schedules, and customer service.

The second stage of the refinery is where the rubber really meets the road; this is where this raw data must be distilled into something that is greater than the mere sum of its parts.

During the data distillation process, organizations articulate their business goals and rules for big data. This includes IT and end users meeting to define big data retention policies, and which data is needed for long-term trends analytics vs. the data that is comparatively short-lived and needed for real-time analytics and response. Data in this phase is aggregated and normalized at higher levels of sophistication than the raw data it originally started as. To arrive at the right distillation of raw data, IT must understand what the business wants to get out of the data, and how the data can best be summarized for analytics to support these business "needs to know." IT then performs data normalization and applies appropriate algorithms to shape the data for the most optimized analytics possible.

Hortonworks, which provides an enterprise Hadoop platform, talks about how companies "can enhance their ability to more accurately understand the customer behaviours that led to the transactions" in a nice overview of the data refinery process. Companies are able to glean more incisive insights into virtually all aspects of the business with well-distilled data that can then flow into their analytics.

This brings us to the other side of the distillation process that organizations must consider, which is how the results of analytics on distilled data will subsequently be distributed and used.

As in traditional batch reporting, there are generally two usage modes for distilled data: offloading of the data to localized data marts where para-IT users in different business areas can query this data with their own analytics reporting tools; and finished reports and dashboards from this data that are distributed to executives, line managers, and others that contain the critical and actionable business information they are seeking.

How well these reports and data marts hit their mark when it comes to delivering critical business information will be the yardstick that determines if further adjustments will be made to data refining and distillation. If the process works well, IT can almost count on there being enhancement to the data refinery, because business and what it needs to know are always changing. The key is keeping the final products of data distillation (and their fit with the business) in mind for the refinery as much as the raw data that enters the process.