How to supercharge your data lakes

Learn how to improve the performance of your organization's data lakes and analytics.

datalakeistock000081548541isergey.jpg

Image: iStock/iSergey

Streamlio CEO Karthik Ramasamy questioned in March 2019 if it was time to drain the data lakes. In his DATAVERSITY post, Ramasamy wrote that problems with data lakes include process complexity, sluggishness in obtaining data, and demands on IT talent that took away from other important projects. All of these factors contribute to make more data lakes into "data swamps"--disorganized information that companies were unsuccessful in mining for insights.

While articles like Ramasamy's aren't enough to dissuade organizations from using data lakes in analytics, they do raise key issues that organizations continue to face as they strive to get the most out of their data lakes and analytics. 

SEE: 60 ways to get the most value from your big data initiatives (free PDF) (TechRepublic)

Companies want to see data lakes that contain fresh data, entail reduced expenditures of money and resources to develop these lakes, deliver faster times to market for analytics and business insights, and enable everyone--not just data scientists--to query and obtain value from the data. All of these goals are still works in progress for most organizations. 

"The work involved in creating a data lake can be complex and time- and resource-intensive," said Tomer Shiran, CEO and founder of Dremio, which provides a data lake engine solution. "Often IT must create data cubes and data warehouses for data that is extracted for the purpose of creating data lake repositories. This process can consist of multiple steps and can become highly complex because of that. Along the way, there are also potential data governance problems."

The problem is exacerbated because semi-structured or unstructured data must be maintained and refreshed in these data lakes.

Shiran sees placing more data lakes of both structured and unstructured data directly into clouds such as S3/AWS and Microsoft Azure as part of the solution.

"The cloud is scalable, and it allows you to increase or decrease your compute and your server clusters as needed, which tamps down costs," said Shiran.

This is an architectural concept that companies like Dremio rely on. These companies furnish connectors to different clouds and query engines that enable organizations to go directly to the cloud for their data lakes--without the need to create separate data cubes and data warehouses. 

So, how does this work? By using software that comes with a complete set of connectors to commercial cloud platforms, databases, data warehouses, and common data query tools such as SQL, Snowflake, and Salesforce, organizations are able to bypass the tedium of having to develop these interfaces themselves, in addition to their own data cubes and data lakes. Instead, organizations can go natively to the cloud, let the software do the work, and deliver data query services faster.

"In essence, you have a tool bag of pre-developed multiple connectors into databases, query tools, and clouds such as AWS and Azure that enable you to take advantage of the cloud's scalable costs and resources, and that can also conserve your own IT resources and budget because you don't have to perform all of the intermediate setup costs for queries and data lake connections yourself," said Shiran.

These toolsets are also able to optimize memory so the most frequently accessed data is in the fastest memory--this speeds data retrieval and lessens time to market for business insights. Additionally, the tools have built-in predictive data retrieval intelligence that enables them to assess which types of data are accessed most often so that data can be assigned to rapid memory, where it can be most rapidly retrieved.

"The other element we add is semantic," said Shiran. "In other words, we create user interfaces that make it easy for everyday users who want to run data queries to do these queries easily—without the need to ask a data scientist for help."

Can approaches like this assist organizations in optimizing their data lakes? The potential is there, as long as organizations also do these two things.

  1. Assess existing data lakes for effectiveness: This could involve determining which data lakes are working and which are stagnant. For data lakes that are stagnant or nearing the point of no return on investment, decisions should be made as to whether to renovate them or to simply sunset them and start over.
  2. Evaluate your cloud and in-house data architecture: Connector and data lake optimization tools are only as effective as your ability to understand your data lake and query needs and how they link to your onsite and cloud-based data. Once you understand how data must be linked and where it resides, you can seek out connector tools that help to eliminate the manual work.

Also see