Big Data

Data lakes vs. data streams: Know the difference to save on storage costs

Even if you can store all of your big data, it doesn't mean you should. Get expert advice on how to determine the best data selection strategy for incoming data streams.


The sheer volume of big data is forcing most organizations to focus on their storage costs. This sharply contrasts with what many organizations were saying just months ago: We should "store everything" big data, in case there's a need for legal information mandated by eDiscovery, or a need for a long-trend investigation that goes back into organizational, market, or industry data collected over several decades.

Ben Parker, Principal Technologist of Product Strategy at the big data analytics solutions company Guavus, says this quandary comes down to knowing the difference between "data at rest" (known as "data lakes" that are ultimately written to disk) vs. "data streams" that consist of real-time data in motion that is collected in transit, analyzed, and then reduced in mass by a frontend engine that ferrets out the important data for processing and commitment to storage and discards the rest.

Parker talks about the vastness of big data, which makes it virtually cost prohibitive to entirely store on disk. "If you can deploy a means of reducing the size of this data by 100 or 1,000 to one in a data stream evaluation that precedes committing the data to disk, you can reduce your storage and arrive at a cleaner set of data for analytics," said Parker. "In this way, you are shifting your data to compute and not storage resources as you evaluate it for its inherent usefulness."

Parker's point about using "up-front" data evaluation forensics as streams of data enter the enterprise is well taken, since compute is cheaper than storage. If you look at web traffic alone, much of what big data brings in is header-level data that rapidly loses value after several hours. Even if storage were an unlimited resource and an organization decided to store all incoming data, the data "pile" would only end up being relocated internally, where someone would still have to painfully navigate through it to weed out the truly important gems of information.

"Organizations are starting to understand the relevance of being able to define how they want to use big data in practice, and which big data should be at rest in storage-based 'data lakes' for analytics and which is best analyzed in real-time 'data streams,' where irrelevant data is taken out so that a more refined big data source for analytics is produced before it is written to disk," said Parker.

"For instance, the tier-one companies now have a pretty strong vision of what they want to do with their big data, and are using business intelligence analysts or data scientists to assist them with their big data analytics. It's the smaller and midsized organizations that are still exploring the value of big data, because they recognize that no one has the funding to perform random data analytics on unlimited data. You have to be able to show an expected return on investment (ROI) from your big data analytics efforts."

Part of this ROI hinges on the dollars you save — economizing storage is central to that.

"The value of differentiating between data lake and data stream processing for big data is that it steers you away from the potential pitfall of storing everything just because you can," said Parker.

It's a familiar theme that enterprises and small and medium-size businesses have grappled with forever, and every CIO who has battled with a line of business staffs to define data retention policies knows the challenge. Nevertheless, the stakes of failing to decide which data to store are exponentially greater with big data, because there is so much more of it.

This is why the companies that correctly figure out their data selection strategies for incoming data streams, in addition to applying data retention policies, will be in the best spot to "choose their data" and prove out their ROIs.

Automatically subscribe to TechRepublic's Big Data Analytics newsletter.

Also see

About Mary Shacklett

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...

Editor's Picks

Free Newsletters, In your Inbox