Project failures are more likely when there is no preparation. Here's what to consider when preparing for big data projects.
I recently tried a new chicken teriyaki recipe and had great results. I'd like to credit the cook, but I give myself a B in cuisine, at best. The trick was in the overnight marinade, as prescribed in the recipe and dutifully followed.
Data is like that, too: If you don't prepare data in advance for optimal performance, it isn't going to please those who consume it. In fact, poor data preparation is a leading cause of big data project failures—and those who are managing such projects can't afford for it to be. For this reason alone, it is critical for organizations to have a big data preparation strategy and methodology, and to faithfully execute it.
A data preparation strategy should contain the following elements:
- A thorough understanding of present and future business questions the data is expected to yield answers for. Knowing the areas of the business where big data analytics are to be applied establishes a business context for the data and helps to shape the data-gathering and execution strategy. The objective in this phase is to identify which data in your enterprise are relevant to key business questions, and which aren't. You can also expand the business questions and the data you seek as business needs change, but initially it is best to keep the data focus tight.
- Data centralization. Data must be normalized so it is consistent and everyone throughout the enterprise uses the same data. This makes it essential to house all data for analytics in a centralized repository maintained by IT, even though you may choose to populate different subsets of this master data for specific business areas.
- Identification of data sources that must feed into the central analytics information repository. Once business cases and questions are defined, datasets and sources should be identified that can be used in aggregate to answer the burning questions of the business. These data sources can come from within or without the enterprise.
- Identification of future data sources that are likely to become relevant. At the same time, it isn't too soon to begin identifying additional data sets or sources that might be needed by the business in the future. These data sources will not initially have data prepared, but their identification will provide a roadmap for future data preparation.
- Defined data preparation methodology. There are three fundamental steps to move clean data into a central data repository. First, data is extracted from its source. Then, it is transformed into a format that is compatible with the data destination it is going to. Lastly, it is loaded into the destination repository. The important part is the transformation. If the same data field is going to flow into a new destination but that destination has a format differing from the original, the data must be transformed into the new format for the data to work and to be consistent in its destination. This is a tedious step if done by hand, so automation tools are needed.
- Selected effective data preparation tools. There are myriad data preparation tools on the market, so companies are advised to try them and work with vendors that offer strong support and training. The goals should be to prepare your data so it is of highest quality and to choose tools that are easy to use and that provide a means for automating data preparation steps.
- How to become a data scientist: A cheat sheet (TechRepublic)
- 60 ways to get the most value from your big data initiatives (free PDF) (TechRepublic)
- Feature comparison: Data analytics software, and services (TechRepublic Premium)
- Volume, velocity, and variety: Understanding the three V's of big data (ZDNet)
- Best cloud services for small businesses (CNET)
- Big data: More must-read coverage (TechRepublic on Flipboard)