Big Data, which comes into the enterprise unstructured and unorganized, first needs to be "prepped" so that it can be processed by a business analytics program. Here's what you need to do.
Now that business analytics are here and enterprises are grappling with their own "big data," it's time to set some technical strategies in motion to harness these assets. Fortunately, solutions for the data center that can deliver both high performance computing (HPC) and big data analytics are becoming increasingly scalable and affordable—even for medium-sized businesses.
The main challenge initially is getting your big data ready for analytics computing. Big Data, which comes into the enterprise unstructured and unorganized, first needs to be "prepped" so that it is able to be processed by a business analytics program. This is no small task, as "cleaning up" big data goes through several phases. These phases include:
- An automated process of data deduplication, where duplicate data records are removed;
- An additional data cleanup process (which may need to be manual!) where erroneous information is removed and/or corrected in the data;
- A revisitation of data retention policies with end business users, so there is mutual agreement as to how long data in specific data sets is going to be maintained;
- A review of your data retention policies with outside auditors or examiners, to ensure that your data retention policies meet the compliance standards for your industry.
- Depending upon the volume of unstructured data in your data center, these technical tasks can be prodigious-even if you have a set of automated tools to address them. Nevertheless, sanitizing your data so it can readied for high quality business analytics is a necessary first step-and a step that you need to have business user support on, since it requires time and investment.
A second area of big data/analytics preparation is ensuring that you have the right servers and software in place to run business analytics with big data. Traditional transaction servers can't be redeployed for this work, because they were designed to serially process transaction records of fixed record lengths. In contrast, big data analytics requires simultaneous and parallel processing of data that does not have fixed record lengths. If you incorporate big data analytics processing in your data center, you will need specialized server clusters to do it. The good news is that many of these big data/analytics processing solutions are now scalable server-wise and price-wise. They also come with software automation that can run these big data/analytics workloads without a great deal of manual intervention from IT. System automation is a real benefit if you are implementing your first business analytics with parallel cluster computing-because this is an area where most data center staffs don't yet have much experience.
The third cornerstone of a technical IT big data strategy is revising the workloads and operations of the data center itself. Most data centers prioritize work around transaction processing. Other computing, such as batch work and report generation, is run at a lower priority. But with today's demands for real-time or near-real-time business analytics, all of this changes. If the end business wants to see emerging buying trends at the same time that sales are being made-both transactional and analytic workloads will require top priority. This reprioritization will significantly impact how data centers organize their operations and workloads, and will likely require an interdisciplinary IT project team to come up with a plan. At the same time, IT infrastructure management software should be evaluated to confirm that it can handle "mixed" workloads of transactions and analytics that are running at high priorities on a mix of transaction and analytics servers.