“Over the past few years, organizations have set up multiple data lakes throughout the enterprise. One of the reasons is that it’s easy to set up a data lake quickly,” said Ken Tsai, head of cloud platform and data management for SAP.
As a result, data lakes have propagated, user department by user department, often with very little data curation or integration.
In many cases, the result has been a series of “standing pools” of data that are growing murky with an unknown number of data types that can’t readily integrate with each other to produce insights. “We call this phenomenon ‘data dissonance’ because the data can’t be brought into a harmonic and compatible state without preparing it so it can work with other types of data,” said Tsai.
The objective, then, is to clean up the data lake. Unfortunately, cleaning up a polluted data lake is no trivial task.
One reason is that companies, in their initial haste to create data lakes, end up throwing all kinds of raw data into these lakes without working out governance, security, or the zoning of data into classifications like raw, transient and trusted.
“There are other problem areas as well,” said Tsai. “Since these data elements are both structured and unstructured and are thrown into data lakes in their raw states, without any initial work being done to ensure that they could be adequately described by metadata and integrated into a relational or other database that would enable flexible combinations of the data with other data, data dissonance occurs. As this lake of dissonant data builds, companies eventually lose the use of the data because all of the disconnects between various pieces of data no longer enable the full data lake to function well.”
Traceability of the data, which is facilitated by the metadata that describes the characteristics of the data, such as when it was last accessed and by whom, is also missing.
Finally, there is the issue of data retention as these data lakes continue to expand.
“Data retention contributes to data lake pollution because many organizations fail to address it in a comprehensive way,” said Tsai.
Instead, there is an inclination to just store all of the data forever–in case an auditor or a company trends analyst ever wants to access it.
Tsai recommends instead, that companies take an active role in data retention.
“The big question is, are you retaining data just to retain it for some purpose, or are you retaining it so it can remain in a query-able mode?” said Tsai. “As an example, if you are a financial institution and you want to perform fraud detection, you are going to want to be able to go back in time with your data and to trace what could be an initial fraudulent pattern that is emerging to when it first happened or even to who the initial user was. If your data lake is full of data dissonance, you’re going to lose track of this data lineage.”
This brings us to the data architect or the data base administrator who must make heads or tails of this data and construct an architecture in which all types of data–structured and unstructured-can work together.
Inevitably, he or she finds that there are data disconnects and data dissonance because the data hasn’t been cleaned, classified, organized or described adequately with metadata. In short, the data lake is polluted–possibly to a point where the data is no longer trusted, the company is limited in the value it can derive from the data–and the data lake is virtually standing by itself as a stagnant and toxic pool where no one can see the bottom.
The good news for companies is that they don’t have to be caretakers of toxic data lakes.
Here are four steps that companies can take to avoid or to reverse data lake pollution:
1. If you haven’t deployed data lakes yet, look for an integrator and/or data lake tools company that can assist you in the process.
“A project like tapping data from IoT machines and putting this data into an enterprise system isn’t trivial,” said Tsai. “How will you construct your data pipelines and integration? Early in the process, it is very important to find the right set of tools to do the job.”
2. Make a plan for data retention
How are you planning to retain your data? Will some of this data strictly be archived off to storage because your legal and/or compliance department requires it? Will you also have older data that must remain active for purposes of historical trend queries or system pattern usage? Is there some data that you can simply throw away after a certain time point?
3. Think about how data lakes will be orchestrated
Do you have a company-wide architecture that governs and coordinates data lake creation and synchronization with all other data lakes? How are you ensuring that the data in these lakes is both integrated and consistent?
4. Consider your ultimate goal with the data
In most cases, companies want to be able to share both data and results, so data can be leveraged for decision making throughout the organization. “If a company’s present set of data lakes can’t enable this, through appropriate tools, architecting and cleanup efforts, data needs to be brought to this level so that it is integratable and trusted,”said Tsai.
The data lakes market is expected to grow to $8.81 billion by 2021, at a compound annual growth rate of 28.3%. Clearly, companies are making the move into data lakes as part of their overall big data and analytics strategies. They can stay ahead of the curve by viewing these data lakes like the clear water lakes that they visit on summer holidays. To remain functional, lakes must be renewable resources–and they can only be renewable and teaming with abundance when the proper stewardship is exercised.