Big Data

Democratize big data by using distributed data lakes

When formulating your big data attack plan, consider putting preparation and analytics tasks into the hands of end users who use their own data lakes.

Image: iStock

A democratization of the big data and analytics process can't come soon enough for many organizations.

This point was made clear during a talk last week with Michele Goetz, principal analyst for Forrester, and Ben Szekely, vice president and founding engineer for solutions and pre-sales at Cambridge Semantics, a provider of big data analytics tools for end users.

Goetz shared this Forrester survey research: 67% of companies can't access big data, 59% can't integrate it, and 56% say that the update process for big data is very slow. This isn't good news for businesses that have spent several years investing into big data, and that are undoubtedly expecting more aggressive returns on their big data investments. Meanwhile, Forrester projects that on a worldwide basis, the amount of data under management will soar from 4.4 zettabytes to 44 zettabytes.

"There is a need to introduce intelligence into the process of interpreting all of this data so the consumers of this data can be empowered to use it through self-service data access," said Goetz.

Many companies are coming to this conclusion, as 59% of the respondents in the Forrester survey said that they will either expand or implement new data preparation capabilities within the next 12 months so they take better advantage of the data they are collecting.

However, dealing with the big data preparation process has been anything but fast. Research reveals that data scientists can spend from 50 to 80% of their time collecting, cleaning, and preparing big data, which comes in all sizes and formats. If your plan is to democratize and distribute these data preparation tasks, which are already burdensome, tools have to be built for the task.

SEE: Data lakes: The smart person's guide

Szekely discussed an attack plan for big data that was organized around distributed data lakes throughout the enterprise, with the various data lakes being worked by different end user departments. There is merit to the idea; when IT and/or data scientists clean and prepare data, they do the job clinically, abiding by classic data normalization and cleansing rules. However, when business users with specialized expertise in sales, marketing, manufacturing, purchasing, finance, customer service, and HR get involved, they can not only check the data, but they can enrich the data further with business value that is based on their experience.

"What we want to do is to drive transparency into the process," said Szekely. "We want to turn tribal data knowledge into an entire data asset....Companies can help to facilitate this by adopting a big data architecture where the big data sandboxes throughout the organization are turned into product zones."

Companies like Cambridge use graph-based data discovery and analytics to create big data preparation tools that end users without an IT background can put to use. "The goal is to create a self-service analytics approach for end users that enables these users to visualize and discover data and to contextualize it in the business on their own," said Szekely. "The data can also be prepared so it conforms to corporate governance requirements and so it is traceable."

If data preparation and analytics tasks can be placed into the hands of end users who use their own data lakes, this distributed process (and workload) could not only speed times to decision for data, but also deliver an innovative means of managing the daily data deluge that most organizations face.

Also see


About

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...

Editor's Picks