Assuming you have defined the mission-critical areas of the enterprise where Big Data is needed, the next step is developing an end-to-end work flow for Big Data that gets it from the point of collection to the point where it can be actively queried for answers to the questions being asked by the end business.
The job is easier said than done, because many enterprises begin with Big Data modeling and application development as departmental functions. However, as Big Data begins to get “spun out” throughout the enterprise, corporate managers are sensing a need to centralize the resource. For most, this centralization means hosting Big Data in the data center, with IT assuming both a custodial and a management role for systems.
The transition to Big Data centralization in the data center is just beginning to happen-which is why it is imperative for IT to start thinking now about how the workflow of Big Data is going to fit with other data center operations.
In essence, there are three primary workflow stages for Big Data that IT must address.
The first stage involves collecting all of the Big Data that you’re going to need for a particular application. This can be a daunting task because Big Data can literally come from anywhere in the enterprise.
For example, several years ago, a large enterprise with exceptionally stringent security wanted to create a “dossier” on each of its thousands of employees. What the company wanted was a “headshot” of each employee, a fingerprint for each employee, and also text-based information about each employee, such as what the employee’s current and past job functions and achievements were within the company. The result was a mix of structured and unstructured data that originated from several different locations within the enterprise. They ended up being brought together into a single, semi-structured employee “container” called “John Jones,” “Mary White,” etc.
Today, this task of “containerizing” data under a single label (like “employee”) is even more complex, because companies also want to add social media data that originates outside of the enterprise. The key for the data analyst is to clearly define with participating business units just what they want in their Big Data “containers”-and then to devise a strategy to locate and collect the data.
The next workflow stage involves preparing Big Data so it can be readily analyzed by the applications and queries that will be run against it. Often, this job involves considerable integration with other systems. Let’s say, for instance, that a retailer wants to analyze its biggest “social influencers” over social media and also study these influencers’ personal buying histories. The query will likely require a cross-examination of both Big Data coming in from social media and also transaction data that could come from an order system on a mainframe or an SQL server. It will be left for IT to determine the system integration points and to affect the integration, and it will be IT that determines when data can be pulled from these different systems into a “composite” record in a data mart or warehouse that applications and queries can be run against. This will require revisions to existing data center workflows that address scheduling and how data in warehouse repositories and transaction systems are updated and synchronized.
The third workflow step is mining the data for information with applications and queries.. In most cases, these requests can be run independently and on demand by end users-but it will be up to IT to tune systems for performance and service levels.
The bottom line for IT is that adding Big Data to the datacenter will not be a “turnkey” event. There could be major revisions to existing data center workflows, new integration requirements between systems, and even new service level expectations from end users-because we are entering an era where Big Data analytics will be expected to operate in the same real time environment that transaction processing does.