A data lake is a set of unstructured information that you assemble for analysis. Deciding which information to put in the lake, how to store it, and what to make of it are the hard parts.

The concept of a data lake is perhaps the most challenging aspect of information management to understand. A data lake can be thought of not as something you buy, but as something you do. “Data lake” sounds like a noun, but it works like a verb. This guide is an entry-level summary about data lakes.
SEE: Free ebook download: Executive’s guide to the future of enterprise storage
James Dixon, chief technology officer of Hitachi-owned Pentaho, is credited with coining the term data lake in 2008. Dixon said he was looking for a way to explain unstructured data.
Data mart and data warehouse were existing terms; the former is generally defined as a department-level concept where information is actually used, and the latter is more of a storage concept. He began to think about metaphors with water: thirsty people get bottles from a mart, the mart gets cases from a warehouse, and the warehouse obtains and bottles it from the wild source — the lake.
Additional resources:
Data lakes matter because the dark side of big data is that someone’s got to analyze it. Consider some modern data sources: the mess of each of your user’s PC hard drives, social networking, the Internet of Things, mobile devices, rogue networks, and who-knows-what data in the Indiana Jones vault that you call tape backups.
Lakes (by any other name) always existed, Enterprise Strategy Group analyst Nik Rouda explained. Accessing your lake used to mean spending a lot of money. Normally, the more that data grew, the more you’d have to spend.
Something funny happened on the way to the future: IT departments now have ready access to inexpensive mass storage, such as through commodity hardware or on a cloud, along with the open-source Hadoop file system, which scales in ways that previous unstructured data arrangements couldn’t.
SEE: Free ebook download: Executive’s guide to IoT and big data
Dixon cited a customer that used ad-hoc data lakes, Hadoop, and data analysis services to uncover hacking in financial markets. He said another customer used this approach in determining when to clean barnacles off ships, thereby saving money on fuel because of less drag in the ocean. Not all cases are so sexy. Typically, data lake analysis can be used to instruct information management software on how to slim down your company’s storage costs and uncover unknown or lost intelligence.
Additional resources:
Nik Rouda said the most common mistake in data lake projects is that companies don’t have the right people to manage it. Database administrators may not understand how to apply their knowledge to unstructured information, while storage managers typically focus on nuts and bolts. The people most affected by a data lake are probably those who pull the purse strings, because a company will need to budget for hiring analytic experts or outsourcing that job to a professional services organization.
Additional resources:
Data lakes are becoming a mature concept. Federal intelligence agencies are using data lakes to hunt crooks, fraudsters, and terrorists. Companies are following suit and beginning to use data lakes for critical projects, not just in science experiments.
An evolving factor is security. Players in the data lake niche are starting to realize that security is vital because making a lake means pulling data away from its normal home and often entrusting it to outside vendors.
Additional resources:
Once you’ve identified your sources of unstructured data, you need to put it someplace. That can be what storage managers called JBOD — “just a bunch of disks” in a RAID setup — or it can be on a SAN if you’ve got the space and budget to spare. It can also be on a cloud. Amazon Web Services and Microsoft Azure are common choices. Next, pick a file system: Apache Hadoop is the overwhelming choice.
The hardest part is figuring out what to actually do with your lake. Professional service providers such as Accenture, Cap Gemini, and Deloitte could all be of assistance. Service wings of IT companies such as EMC (soon to be Dell), HP Enterprise, and IBM are also in the mix. Pentaho and other smaller companies can lend a hand. The unicorn, Dixon joked, would be finding an affordable expert to bring on your full-time staff.
Additional resources:
Evan became a technology reporter during the dot-com boom of the late 1990s. He published a book, "Abacus to smartphone: The evolution of mobile and portable computers" in 2015 and is executive director of Vintage Computer Federation, a 501(c)3 non-profit organization. His vices include running and Springsteen.