There can be many steps to data preparation when it comes to big data, but some of the principal processes include collecting the data, cleaning it for abnormalities, filtering it so you can discard data that you know you won't need, and normalizing the data so it is congruent and can integrate with the rest of the big data you have in your data repositories.
According to a poll by data mining and crowdsourcing firm CrowdFlower, data scientists and analysts can spend as much as 80% of their time cleaning and preparing data, which is often a painstakingly grueling and laborious process.
For this reason alone, the idea of outsourcing the data cleaning and preparation process is beginning to gain traction within organizations.
SEE: IT leader's guide to big data security (Tech Pro Research)
"There are mountains of data that come into organizations from diverse sources, but there is also data locked up in internal company systems themselves," said Rob Consoli, chief revenue officer of Liaison Technologies, which provides data preparation and integration solutions. "Companies can only begin to unlock the informational value of this data if they can find a way to aggregate and query data from disparate systems in innovative ways."
Amazon is one example. It is transforming retail with its recent acquisition of Whole Foods and now, with its deal with Kohl's to handle merchandise returns in several cities—but it also relies on an IT strategy that can quickly amalgamate big data from new and diverse sources so it can automate business operations like inventory control and gauge customer demand at the same time that it is expanding its business.
Companies understand the importance of being able to exploit all of the data they collect. They also understand that to get to the point where data is quickly aggregated and blended, that they have to prepare the data first so that it can all work together. When this process is tedious and ties up your key people, there is no way that you are going to be able to compete with more "fleet of foot" companies that have mastered time to market with their data preparation and analytics.
Companies like Liaison and others offer cloud-based services to assist companies with data preparation efforts, and the idea is gaining traction— but there are also concerns.
High on the list is governance. Just last year, Ponemon Institute surveyed 1,864 IT practitioners and 70% reported that they felt managing privacy and data protection in the cloud was more difficult than managing it within their own network.
SEE: Big data in 2017: AI, machine learning, cloud, IoT, and more (TechRepublic)
A second area of concern is data safekeeping in a multi-tenant cloud solution. How do you ensure that your data won't be shared with others?
Consoli says that his company addresses these concerns by ensuring that it is certified to all of the key industry standards such as HIPPA and PCI. His company's preferred practice for preparing data is also not to keep it—but to deliver it back to companies in a clean form that they can curate in their own data repositories.
Outsourcing, of course, is not a strategy that works in every situation—but if more companies find that they can save time and money by outsourcing data prep instead of doing it themselves, more will make the move.
Meanwhile, IT leaders contemplating outsourcing all or some of their big data preparation can begin by considering these three things:
- The ROI the project can return
- Concern about job loss due to outsourcing. If your company already has internal data managers, cross-train them and be proactive about it to allay their fears of joblessness.
- Ensure that your governance and data security standards will be met if you use a cloud-based vendor.
- 6 big data privacy practices every company should adopt in 2018 (TechRepublic)
- How to optimize your company's big data for future use (TechRepublic)
- Why enterprises are finally paying up for big data security (TechRepublic)
- Big data and digital transformation: How one enables the other (ZDNet)
- Workday makes its big analytics bet, launches Prism Analytics, data-as-a-service, benchmarking (ZDNet)
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.