In the big data context, data harvesting can have different definitions and applications. Some practitioners define it as scraping off data from a variety of web-based sources for the purposes of data aggregation and analysis; in other cases, an organization harvests its internal data, which is drawn from various systems. In both cases, the goal is to identify and separate the relevant data items from a large body of data so the separate items can be used in analytics queries.
Data harvesting is sometimes compared to the oil refining process. But while the process of extracting crude oil from the ground and then refining it has evolved into a fine science over many decades, data harvesting and refining is still a work in progress.
There are tools that allow you to extract, transform, and load (ETL) data into smaller data marts that end users can use for their analytics. There are also tools for intelligently analyzing data to determine which data really matters in a given business context, as well as tools that enable you to aggregate dissimilar data types to come up with new data models that will hopefully yield breakthrough answers.
Nevertheless, there is only so far that automation and tooling can take you. At some point an organization must operate on its data with business insights that enable it to use data and data tools to its best advantage.
SEE: Special report: How to automate the enterprise (free ebook) (TechRepublic)
For an organization to optimize its data use, IT leaders have to apply business insights into the best way to harvest data and get the most out of it during data preparation. Companies shouldn't be content to just set up report queries of end product data and do nothing more. Here are three best practices that some organizations use to improve data refining and harvest yields.
1: Define your business cases
Companies need to define the business cases for which they want their data to yield answers. In this way, data harvesting, refining, and discovery is given a mission, and data can be extracted and sent to data marts to facilitate queries that yield answers. In these applications, organizations can better understand when customers are most likely to buy a product, or why there is a greater incidence of a particular ailment in a certain geographic area.
2: Come up with a strategy for your leftover data
Unfortunately, after addressing first-tier concerns (e.g., why and when customers make purchases, or why certain locations get hit hard by a particular medical condition), some companies still leave 80% of the data harvested on the table. In the oil refining process, this is the point where the mid-level gasoline and diesel fuels have been sorted out and must be reevaluated to see if there are other uses.
Companies have three choices when it comes to addressing their data leftovers: throw it away, or keep it indefinitely until your storage costs soar, or further refine and/or explore the data leftovers to see what you can do with them. Most organizations are opting to keep their data, which has forced them to revisit their data storage strategies and to make decisions on whether they want to store some of their seldom-used data in the cloud.
3: Globalize your data
Companies that choose to further refine and/or explore their data should globalize it. For instance, Sales wants data that tells it which items are selling the most where, so a business case is created, and data is collected and directed to answers Sales' questions.
After Sales has no further use for it, the data can be discarded or stored, but what if other business areas within the company were made aware of what the raw data contains, and could find another use for it? Perhaps the original data showed that sales were down in the Northeast, and Sales focuses their reps in that region on customer relationship building—but that's all they do with the data.
When a customer service manager sees this data, she discovers that product warranty claims are highest in the Northeast. With this information, she can take corrective actions in service, and forward new information to Engineering and Manufacturing and possibly other departments.
In short, companies are asking: Who could possibly ever use this data that we are harvesting? The end goal is to leverage the value of data throughout the company as much as possible by making the data universally accessible.
These three seemingly small steps in data utilization are still largely unexploited in companies. This is why a full exploitation of data in data harvesting, preparation, and refining should go hand in hand with simply querying and reporting on data in order to maximize data yields.
- Harvest big data in two ways for different objectives (TechRepublic)
- Look at batch and big data reporting as integrated, not separate, IT approaches (TechRepublic)
- Data curation takes the value of big data to a new level (TechRepublic)
- Clear out dark data to make room for useful big data (TechRepublic)
- Research: Companies lack skills to implement and support AI and machine learning (Tech Pro Research)
- Google's data-driven approach to superior user experience, revisited (ZDNet)
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.