Image: ClearStory Data

Big data analysis can offer a clear path to critical business insights. New tools and platforms have made the process of understanding trends within a business much faster and, in some cases, exponentially cheaper.

The early days of big data usually saw analysis completed on data from a single asset, but modern data capture and data processing mechanisms have modern companies routinely dealing with a number of seemingly disparate data sets. A myriad of assets potentially means a cornucopia of new insights to be discovered, but these sources are often difficult to coalesce.

ClearStory Data, a startup based in Menlo Park, California, is tackling the problem of multi-source data analysis using what it calls data harmonization.

“Data harmonization is about identifying information and data relationships between the sources, automatically during that matching, to be able to drive a holistic visual insight,” said Sharmila Mulligan, CEO and founder of ClearStory Data.

The company offer three main value points:

  1. Harmonization and blending the data
  2. Accelerating data discovery
  3. Increased collaboration on the dashboard to identify new insights

The early days

Mulligan got her start in big data early, working at Cloudera when it had only a few employees. This is where she was exposed to the insights that companies were gleaning through tools such as Hadoop. After that, she worked alongside fellow ClearStory founder Vaibhav Nivargi at a company called Aster Data Systems.

At Aster Data, on the platform side, they saw that companies were funneling data in from existing internal repositories and private sources, as well as external data sources, using the new big data platforms as a cheaper data hub.

To analyze the data, they were using a pre-existing, off the shelf business intelligence (BI) tools and slapping it on to their big data platforms. “That’s where everything would grind to a halt,” Mulligan said.

The problem is that those companies would end up with Mulligan called an “impedance mismatch” between data in the storage, usually from multiple sources and diverse in structure, and the visualization layer. Traditional BI is built for relational repositories, data of known size and origin, and things such as KPI-based reporting. Those architectures weren’t built to handle the diverse sources you get with big data.

Mulligan and Nivargi figured that they could build a better tool to unify and analyze those data assets, so they started ClearStory.

Creating harmony

Mulligan points out that almost everyone in big data is doing multi-source analysis now, but many are perpetuating what she referred to as “dumb blending.” That is where the user has to figure out the data relationships and determine how to make it all work together on their own.

This is where the concept of data blending comes through as a true value add. Just as data is, itself, an asset; data variety is an asset as well. Most of the companies that ClearStory works with have at least 6 data sources. Sometimes its 9, 12, or even 14 separate data sources.

There are quite a few problems that come up when you are dealing with multiple data sources. Data set can be in different formats, at different levels of granularity, with different specific points of data included in the data sets.

When a user begins an analysis in ClearStory, they can drag and drop data sets into the tool and it will rank the potential data relationships from 0 to 10 to determine how many aspects they share or how many potential insights they contain. To speed their data processing, ClearStory uses on the Apache Spark query engine.

Nivargi said that they start with data at the source, and glean the surrounding metadata as well. For example, if a Microsoft Excel file is brought in, they might look at the concepts of the data embedded in the file and build a metadata model around that.

“As we see more and more of this data, as we periodically get more data from that source, our understanding of what the data represents gets stronger and stronger,” Nivargi said.

An analysis in ClearStory is called a data story to represent the idea that it is living and evolving. As the software understands the metadata, such as geographical data, time data, or categorical data, Clearstory can make recommendations to the user on other data points that may be relevant. For example, if a certain product has been selling well for no apparent reason, the tool will recommend looking at data on the time and geography of sales and the mentions of the product on social media.

In addition to data and metadata, Nivargi explained that all data sources have signals as well. This could be how your peers or competitors are using the data, which of the data sets have been used in the past, or if two parties are selling the same external data. All of these signals go into the brain of the tool to help determine the best way to blend the data.

Crafting collaboration

After the data has been processed and the analysis has been completed, the results are visualized on a dashboard where employees who have access can collaborate and discuss the findings.

“Collaboration is extremely critical because, traditionally, what’s happening in data analytics, or data science if you will, is it’s been an individual sport,” Nivargi said.

According to Nivargi, finding insights is easier when you involve people with different domain expertise. Before, the medium for discussion was emails and shared screenshots. With ClearStory, users can actively participate in the analysis with on-screen discussions and the ability to annotate visualizations. For example, if you happen to notice a disparity in a data visualization, you can mark that point on the graph or chart and add a comment with your theory on its origin.

To date, ClearStory has raised $31.5 million in venture capital funding. Investors include Kleiner Perkins Caufield and Byers, Andreessen Horowitz, Google Ventures, DAG Ventures, and Khosla Ventures.