Microsoft is only a few weeks away from making its cloud-based big data analytics Azure Data Lake generally available.

Azure Data Lake, described by Microsoft as a “hyperscale repository for big data analytic workloads in the cloud”, has three main components, Azure Data Lake Analytics, Azure Data Lake Store and U-SQL.

Data is held in the Hadoop Distributed File System-compatible Azure Data Lake Store and processed using Azure Data Lake Analytics, an Apache YARN-based service that lets users query data using U-SQL, which combines aspects of C# and SQL and is designed to be relatively easy to use.

Joseph Sirosh, corporate VP for Microsoft’s data group, said the service, available in public beta, would be made generally available “in a small number of weeks”.

Azure Data Lake is suited to complex batch analytics of large amounts of different structured and unstructured data, with Sirosh saying examples include bulk speech recognition, captioning images and deriving meaning from text.

Sirosh sees the service as potentially being a wholesale replacement for running Hadoop in-house, and says one of Azure Data Lake’s advantages is its relative ease of use.

SEE: Job description: Big data modeler (Tech Pro Research)

“Azure Data Lake is one of our fastest growing data services now,” he said.

“That is because it’s extremely simple to approach for a lot of people who are already familiar with SQL-like languages.

“Whereas the traditional Hadoop systems required developers with a lot of coding experience, with Azure Data Lake you can put unstructured and structured data in a massive data lake store and with Azure Data Lake Analytics, you can operate on it with SQL-like constructs.”

Despite cloud services typically costing more than in-house infrastructure for predictable, long-term jobs, Sirosh believes the pay per query pricing model of Azure Data Lake should make it attractive.

“With a Hadoop cluster you’re buying the cluster and managing it. Most people ignore the cost of the infrastructure management and the people required to operate that,” he said, adding that Hadoop clusters were also prone to running for long periods when they weren’t being utilised.

“The human investment to manage these systems, especially at scale, is a significant cost.”

Azure Data Lake is compatible with Azure HDInsight, Microsoft’s service for deploying Hadoop, Spark, R, HBase and Storm clusters in the cloud.

In September, Microsoft confirmed the technological underpinnings of the Data Lake service are based on Microsoft’s internal Cosmos big-data storage and analytics service.

Cosmos is Microsoft’s massively parallel storage and computation service that handles data from Azure, Bing, AdCenter, MSN, Skype and Windows Live. Queries on this data can run on anywhere from one to 40,000 machines in parallel.

Also see