Get expert advice about managing data quality, public clouds, governance standards, and much more in your big data projects.
Big data projects are well underway in most companies, but we are only now beginning to distill a set of IT management best practices for big data that can be plugged into IT playbooks. This is what we've learned, and how companies are incorporating big data projects into well managed and capably executed IT efforts.
1: Big data projects are iterative in nature and require agile and prototypical approaches
Not only do new sources of big data constantly emerge, but the questions that business decision makers want to ask of this data are continuously evolving. This is why sandbox environments that enable data analysts and scientists to quickly query big data and then publish the results are critical to the big data value process.
The data that these queries operate on is not neatly structured into fixed record length systems of record (SOR) where you know that the end product will be an order, a customer, or a part record, so you need an iterative process that can operate in this unpredictable data environment.
2: Consider a productive role for the cloud in your big data strategy
Large enterprises in particular tend to shy away from the use of public clouds because they are nervous about security and governance. However, in many cases public clouds are ideal environments for rapid big data analytics prototyping, as long as you move your prototypes off the public cloud as soon as you are through running them.
Public clouds can also be economical places to stash and to archive your raw big data. Public clouds, and how you choose to use them, should be clearly articulated in your IT policy.
3: Use your SOR data as a matrix for big data
One of the greatest challenges in big data projects is finding ways to organize the data for best results. Many companies have discovered that they already have an organizational framework for their big data in their SOR data. For this reason, many companies use the data vectors from their SOR data and simply overlay these organizational frameworks on their big data.
Customer data is a prime example. Within the SOR customer master file record you already have the customer's name, address, and possibly other demographics. If you later choose to add web storefront usage patterns and propensities from this customer, you can append the web-based big data to the SOR data for a more complete picture of the customer and how that customer is interacting with your company.
4: Prune your big data as soon as possible
There is a tendency for companies to maintain all of their incoming big data in raw form, even though much of it may never be used. The concern is that future queries might require big data that is not being used today, so IT is playing it safe by just keeping all of the data.
However, there is an equally strong argument for sizing down the amount of big data that you accumulate. Some of this data, such as jitter from network and machine handoffs, is likely never to be used. There is also overhead data from website interactions.
Developing criteria and methods for stripping away data that you strategically consider to be unimportant for the long as well as for the short term is one way to control the data deluge, and the cost of storing it all.
5: Establish governance standards for big data
Governance standards can be established based upon big data business use cases. Who should have access to the data, and how much access should various individuals have? Are there data privacy issues involved? What other governance issues should you be concerned with? Is this data (and any resulting big data application prototyping) acceptable in a public cloud environment? Defining governance guidelines should be an upfront task in every big data project.
6: Understand the data quality tradeoffs between in-stream, real-time, and batch analytics
Real-time big data analytics must be real-time, so there is limited opportunity to clean the data or to perform data quality checks. Consequently, those making decisions based upon this data should be apprised of the data's potential to mislead because of possible data quality issues. This is a good reason to vet real-time big data analytics requests for the potential risks they introduce in the area of data quality. Most companies are running batch analytics on big data, so in these cases data quality risks are lessened because there is time to clean data and ensure data quality.
7: Plan for disruption
You never know what's around the corner when it comes to new big data sources and methodologies that will become available. Every IT manager engaged with big data should accept big data technology and business disruptions as a way of life — and enact strategies and architectures that are sufficiently malleable to accommodate these new developments.