How to tackle data discovery

Just because it's a big, ominous task doesn't mean you can ignore it. Get control of your data so you can learn from it.

big-genomic-data-visualization-vector-id1173772412.jpg

Image: Nobi_Prizue, Getty Images/iStockphoto

By its nature, data discovery is tedious, cumbersome and confusing. With data pouring in from everywhere, business objectives continuously being fine-tuned, and staff at a loss as to where to start first, assessing data and various data combinations for value and then trying to extract insights from data can be formidable tasks.

SEE: Hiring Kit: Market research analyst (TechRepublic Premium)

Business analytics provider Microstrategy defines data discovery as: "The collection and analysis of data from various sources to gain insight from hidden patterns and trends. Through the data discovery process, data is gathered, combined, and analyzed in a sequence of steps. The goal is to make messy and scattered data clean, understandable, and user-friendly."

To maximize value from data of all types, organizations have to do data discovery.

Here are some steps that organizations can take to make data discovery easier and more valuable to the company.

Define a set of repeatable data cleaning processes, and operationalize them

Data, like gold or silver, comes embedded in dirt and rocks. You have to remove what's irrelevant to get to the relevant. You can't guarantee business value from data until you know that your company is working with clean and accurate data.

SEE: Tableau business analytics platform: A cheat sheet (free PDF download) (TechRepublic)

Standard processes should be in place at every point where data enters your company to ensure that data is coming from vetted sources and that it conforms to your corporate governance standards. Erroneous and duplicate data must be identified and eliminated. In other cases, data must be normalized so that different data names referring to the same data item are standardized to a single data name. If you use third-party sources for data, their data cleaning techniques should also be vetted. 

Fortunately, there are data cleaning tools and automation available for performing many of these data cleaning tasks. It's often hard to justify the ROI of investing in these tools, but like corporate security, they are a necessary investment to preclude the potential of arriving at faulty business decisions from poor data. 

Keep your data fresh

Like yesterday's news, data ages quickly. Operational processes should be in place to refresh data at regular intervals, whether those intervals are real-time, daily, weekly, or monthly. Data relevancy needs of business units also change so rapidly that the data that is useful today might not be useful six months later. To avoid storing and continuing to process data that is no longer relevant, IT should meet with business units at least annually to determine which data is still relevant, and which data no longer needs to be retained. This helps tamp down the volume of data you're storing as well as your storage costs.

SEE: Artificial intelligence requires trusted data, and a healthy DataOps ecosystem (ZDNet)

Use machine learning for pattern recognition

There is a place for machine learning, a subset of AI data processing, where hidden patterns in the data that a human-developed algorithm or observation might miss. This makes your data discovery process all the more powerful because it broadens the field of data insights you are like to find.

Don't forget about dark data

There are troves of dark, unstructured data in the form of photos, videos, and paper-based documents that are cached away in corporate storerooms and closets. As part of their digitization efforts, companies should review this dark data and determine which should be digitized and linked into data repositories and which should be discarded.

Also see