Big Data

Data lakes are an epic fail, but this open source project might change that

Data lakes are problematic because they centralize data but don't make it accessible. Open source project Dremio may have a solution.

Data lakes have become a big deal. It's too bad they don't work. Not most of the time, anyway.

While there is a variety of reasons for data lake failure, perhaps the biggest isn't really about the "lake" part but rather the "data." That is, most employees in any given enterprise aren't qualified to cull and cleanse their data, making it impossible for the people best-suited to use the data to actually make use of it.

Not that this is stopping enterprises from throwing money at data lakes.

Magic beans

In fact, interest in data lakes continues to grow. Ironically, enterprises seem to think they can just buy a data lake "solution' off-the-shelf, leading Gartner analyst Nick Heudecker to quip: "The lack of sophistication and rigor in data lake RFPs is staggering. Save time and just write, 'Please sell me magic beans' on an index card. Then go work on your resume, which you'll need in 6 months." A bit harsh? Perhaps. But it's also reality that enterprises have been far too trusting of the hype that vendors keep shoveling around data lakes.

Data lakes promised to be the next generation of data warehouses, a central place to dump all of a company's data. Unlike the warehouse, however, data lakes allow companies to dump data into the lake without ordering it beforehand. The problem with this approach, however, is that it simply delays the inevitable need to make sense of that data.

SEE: Big data policy (Tech Pro Research)

One of the reasons that no vendor can magically transform any particular enterprise with a data lake is that the messiness of a company's data evolves from organizational chaos, which software doesn't easily tame. In a research report, Gartner noted:

CDOs, often new to the role, are eager to claim ownership of data lakes because they're hyped as an innovation center and these new executives want to put their stamp on a cool new thing. The goals are typically vague, like "democratizing data access" or creating some single version of the truth by combining systems of record in the lake. These CDOs approach the challenge as a purely technical one, forgetting that data silos are a reflection of the organization that created them.

Data analysis tools assume the data sits in what amounts to a unified database. This simply isn't the case.

There is, however, an even more fundamental problem. Namely, no matter what process went into the creation of that data, and stuck it in different silos, accessing the data remains a task that only a data engineer can effectively do. Not a data scientist. Not a business analyst. No, it's the data engineer, and they're in short supply, with perhaps 100 data scientists or business analysts in a given company to every one data engineer.

Good luck getting time on her calendar.

Data for the 99%

If Dremio has its way, you won't have to. In a conversation with Dremio CMO Kelly Stirman, he acknowledged that the "data lake has been a sinking ship for some time." Why? Because most data lake projects never fulfill their promise because it's a deeply technical product that only code can unlock, and that code is written by the 1%, as it were: The data engineer. The Dremio open source project seeks to solve this.

SEE: How AI and machine learning can help solve IT's data management problem (TechRepublic

By making it easy to build Data-as-a-Service, Dremio hopes to make infrastructure and data accessible to and usable by a much broader range of users than data lakes can. Amazon Web Services (AWS), GitHub, and other tools all have ensured you don't have to worry about the underlying infrastructure to get work done. Data lacks this kind of "as-a-Service" approach today. Dremio seeks to resolve this by offering a catalogue, a bit similar to Google's index, that allows someone to search for relevant data sets.

Dremio also gives the enterprise tools to organize data using a visual interface without having to resort to a command line. Also, while that data sits in silos throughout that supposedly organized "lake," Dremio virtualizes access to all of an enterprise's different data sources. This makes it appear that the data resides in the same place, like tables in a relational database, which is how BI tools think about data.

In other words, that data lake investment just might pay off, but only if we use tooling like Dremio to liberate data for non-data engineers.

Also see

bigdata.jpg
Image: iStockphoto/ConceptCafe

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox