How query speed can solve data silo issues

With the onslaught of big data, companies continue to face uphill challenges in data aggregation for better queries.

istock-802301522qualitycontrol.jpg

Image: iStockphoto/Gorodenkoff Productions OU

The promise of big data has always been that it would avail more information that could be aggregated with what companies already knew from their transactions—with the end result being greater data insights and major business breakthroughs.

Unfortunately, the big data deluge has also created myriad data lakes and individual departments that work off their own data, so the data silo problem remains.

SEE: 60 ways to get the most value from your big data initiatives (free PDF) (TechRepublic)

"This causes significant pain for organizations because the data is so distributed across the company that it can't be brought together so it can be queried," said Tomer Shiran, co-founder and CEO of Dremio, which provides a Data-as-a-Service platform.

Shiran cites the example of a large cruise ship company that wanted to achieve a 360-degree view of its customers.

"The company wanted to understand all of the attributes of its customers," said Shiran. "To get the total customer view, it had to gather all of its customer data across a diverse set of systems, whether it was reservations, or casino activity, or other transactional and big data repositories."

Five years ago, this likely would have been attempted in a Hadoop environment that could process large payloads of data with a goal of ultimately processing this data into a central data repository—this approach is still widely used in companies today.

What Shiran and others argue for is that there is a better way to speed data queries than waiting for such colossal data consolidations to occur.

"There are really two elements that are needed so companies can perform effective and speedy data queries," said Shiran. "The first requirement is that you have to be able to access and query your data regardless of where it resides. For example, you might need to run a query across data contained in both AWS S3 and in an Oracle database.

"The second requirement is that you want speed of data query. Taking time to consolidate all data into a central data repository by using techniques such as ETL can't provide this—nor can simultaneous access to a diversity of data marts and silos that are distributed across the company. What you need is a way to accelerate your data queries."

So how do you accelerate your data queries without having to perform lengthy data ETLs and data consolidations?

"A sound data query acceleration technique is used in Google Search," said Shiran. "When you ask Google a question, it goes out and accesses data from web servers that are all over the world."

The process is facilitated because both structured and unstructured data are accessed with the help of inverted data structure indices. The indices store a mapping from content such as words or numbers, and then direct you to specific words in documents and web pages.

Consequently, what you get from your Google query is an aggregation of information from top web sources, but not necessarily from every web source that exists on the web. This speeds your query time because what you're doing is accessing a predefined aggregation of data drawn from a subset of sources. You are not necessarily chugging through every single data source that could be tapped for your query.

"What you're doing is creating smaller subsets of data that we call 'data reflections,'" said Shiran. "This enables you to quickly process your query and get the results. The user can also set the interval at which he or she wants to see the data refreshed."

Companies like Dremio create the initial system data aggregations, but DBAs can modify this data to fine-tune it to specific business needs.

Shiran recommends that companies start small when they begin using data query accelerators, and that they then begin to leverage the accelerators across more use cases and business areas as users and IT gain familiarity.

Shiran cautions that nothing can be done by any system alone: "For every application and the data that it processes, there is already a small number of subject matter experts in the company who understand the data, and how it can most effectively be used," he said. "These are the people who ultimately know the data patterns and what can be learned from them."

Also see

By Mary Shacklett

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...