Big Data

Data lakes: The smart person's guide

A data lake is a set of unstructured information that you assemble for analysis. Deciding which information to put in the lake, how to store it, and what to make of it are the hard parts.

Image: iStock/iSergey

The concept of a data lake is perhaps the most challenging aspect of information management to understand. A data lake can be thought of not as something you buy, but as something you do. "Data lake" sounds like a noun, but it works like a verb. This guide is an entry-level summary about data lakes.

Executive summary

  • What it is: A data lake is a set of unstructured information that you assemble for analysis.
  • Why it matters: Analyzing structured information—that which neatly fits into a database's rows, columns, and tables — is a relatively straightforward process; however, analyzing unstructured information is hard. Data lakes, most commonly evaluated with the Apache Hadoop open-source file system, aim to make that process simple and affordable. Thus, your business can unlock and exploit previously random information.
  • Who this affects: At first glance, a company would assign data lake projects to a database administrator or a storage manager, though the best practice is to hire experienced Hadoop experts. Hadoop is not required—you could use other file systems, but that is the exception, not the norm.
  • When this is happening: Now. Data lakes are becoming a mature concept with service offerings from companies that are household names.
  • How to get it: There are four parts of a data lake: unstructured data sources, storage where the information resides, the file system, and people/tools to analyze it. You'll need all four parts to turn your lake into cleanly bottled water.

SEE: Free ebook download: Executive's guide to the future of enterprise storage

What is a data lake?

James Dixon, chief technology officer of Hitachi-owned Pentaho, is credited with coining the term data lake in 2008. Dixon said he was looking for a way to explain unstructured data.

Data mart and data warehouse were existing terms; the former is generally defined as a department-level concept where information is actually used, and the latter is more of a storage concept. He began to think about metaphors with water: thirsty people get bottles from a mart, the mart gets cases from a warehouse, and the warehouse obtains and bottles it from the wild source — the lake.

Additional resources:

Why do data lakes matter?

Data lakes matter because the dark side of big data is that someone's got to analyze it. Consider some modern data sources: the mess of each of your user's PC hard drives, social networking, the Internet of Things, mobile devices, rogue networks, and who-knows-what data in the Indiana Jones vault that you call tape backups.

Lakes (by any other name) always existed, Enterprise Strategy Group analyst Nik Rouda explained. Accessing your lake used to mean spending a lot of money. Normally, the more that data grew, the more you'd have to spend.

Something funny happened on the way to the future: IT departments now have ready access to inexpensive mass storage, such as through commodity hardware or on a cloud, along with the open-source Hadoop file system, which scales in ways that previous unstructured data arrangements couldn't.

SEE: Free ebook download: Executive's guide to IoT and big data

Dixon cited a customer that used ad-hoc data lakes, Hadoop, and data analysis services to uncover hacking in financial markets. He said another customer used this approach in determining when to clean barnacles off ships, thereby saving money on fuel because of less drag in the ocean. Not all cases are so sexy. Typically, data lake analysis can be used to instruct information management software on how to slim down your company's storage costs and uncover unknown or lost intelligence.

Additional resources:

Who does this affect?

Nik Rouda said the most common mistake in data lake projects is that companies don't have the right people to manage it. Database administrators may not understand how to apply their knowledge to unstructured information, while storage managers typically focus on nuts and bolts. The people most affected by a data lake are probably those who pull the purse strings, because a company will need to budget for hiring analytic experts or outsourcing that job to a professional services organization.

Additional resources:

When is this happening?

Data lakes are becoming a mature concept. Federal intelligence agencies are using data lakes to hunt crooks, fraudsters, and terrorists. Companies are following suit and beginning to use data lakes for critical projects, not just in science experiments.

An evolving factor is security. Players in the data lake niche are starting to realize that security is vital because making a lake means pulling data away from its normal home and often entrusting it to outside vendors.

Additional resources:

How do I get it?

Once you've identified your sources of unstructured data, you need to put it someplace. That can be what storage managers called JBOD — "just a bunch of disks" in a RAID setup — or it can be on a SAN if you've got the space and budget to spare. It can also be on a cloud. Amazon Web Services and Microsoft Azure are common choices. Next, pick a file system: Apache Hadoop is the overwhelming choice.

The hardest part is figuring out what to actually do with your lake. Professional service providers such as Accenture, Cap Gemini, and Deloitte could all be of assistance. Service wings of IT companies such as EMC (soon to be Dell), HP Enterprise, and IBM are also in the mix. Pentaho and other smaller companies can lend a hand. The unicorn, Dixon joked, would be finding an affordable expert to bring on your full-time staff.

Additional resources:

About

Evan Koblentz began covering enterprise IT during the dot-com boom times of the late 1990s. He recently published a book, "Abacus to smartphone: The evolution of mobile and portable computers".

Editor's Picks