Without good cleaning and archiving practices, data lakes can become dense, murky places. Here are some tips on preventing that.
The four Vs of big data are volume, variety, velocity and veracity of data. All are important factors that data architects are cognizant of as they develop big data management strategies.
But as troves of big data continue to grow exponentially in companies, they begin to devolve into stagnant and even toxic data lakes and repositories because so much data is thrown and mixed into these data retention ponds. In extreme cases where every new bit of big data is simply thrown into a data retention area with scant review, the visibility of this data and the ability to derive value from it, are nearly impossible. The "water" in these data lakes clouds, and data architects and developers see that is it getting harder to work with the data in agile ways.
Collectively, these pockets of polluted data lakes give rise to a fifth V that I believe it is time to add to big data: viscosity.
Viscosity in its common usage is used to describe the thickness of liquids. For example, honey has a much higher viscosity than water.
You can see the linkage to data lakes begin to deform from growing pollution that results from poor practices surrounding data cleaning and archiving. This data begins to get muddy and "congeal" to the point where it can no longer be navigated.
SEE: Survey: How useful are your company's big data insights? (Tech Pro Research)
Here are some steps data architects can take to clean up this data so it can be made useable again-and how CIOs can help them.
1. A business case must be built.
Cleaning up data, or finding ways to reclassify and rehabilitate it, is a background task that does not immediately tie to reductions in operating expenses or increases in revenues. Consequently, a project like this, which can take many hours of a big salary person's time, is not going to be popular with executives who don't necessarily understand or appreciate the IT.
Nevertheless, CIOs must sell it.
The business benefits are :
- your time to market for business analytics will improve if your data is clean and agile
- well-stewarded data improves regulatory compliance and governance
- data security and safekeeping will improve because by straightening out the data, you can also review access permissions and data storage security guidelines
- cost savings could factor in if you define your data retention rules and discard useless data that contributes to in-house or cloud storage costs.
2. Data architects should consider building a chain of lakes.
Separate data lakes are helpful when organized by subject area. For instance, there might be one data lake for sales and marketing. A second data lake might be utilized by manufacturing and engineering. A third might be for finance, and so on.
As needs emerge to aggregate data from these different data sources, separate "build" pools of data can be created by aggregating from these originating data lakes, but the integrity of the originating data lakes would be maintained.
The distributed data architecture could be done on a single server by setting up multiple databases and/or system partitions; or it could be done on multiple servers. Either way, there is probably more processing overhead to keep data segregated in originating data subject lakes, but this value is returned by the data agility and organization that you gain.
3. Data shared by these data lakes must be normalized.
If there is data overlap, data architects must have ways to resolve issues like two different terms from two different systems describing the same piece of data, or data elements that contain different values.
In the end, the goal is to have clean data that is well-organized and stewarded. When the data is organized, stewarded and easily aggregated with data from other clean data lakes for analytics queries that span multiple areas of subject matter, the applications using this data become more agile because you're not feeding them muddy data anymore.
Best of all, you've positioned your company to move forward in analytics, because the quality of your data is no longer holding you back.
- Data lakes: The smart person's guide (TechRepublic)
- How to keep your data lakes from becoming cesspools (TechRepublic)
- Big data in 2017: AI, machine learning, cloud, IoT, and more (TechRepublic)
- Pig for Wrangling Big Data (TechRepublic Academy)
- Hands-on with Azure Data Lake: How to get productive fast (ZDNet)
- Open source big data and DevOps tools: A fast path to analytics applications (Tech Pro Research)