Intel and Cloudera: Why we're better together for Hadoop

Cloudera's CEO and Intel's GM of datacenter software explain what Intel's $740m investment in Cloudera means for the future of the big-data analytics platform.

Intel has backed some notable companies over the years - investing in Red Hat and VMware - two firms that helped effect major shifts in the IT industry.

The chipmaker is hoping Cloudera will generate similar momentum in the field of big-data analytics, and in doing so open new avenues for growth in a stagnant enterprise IT market. To this end Intel has invested $740m in Cloudera, giving it an 18 percent stake in the company.

Cloudera builds and supports tools to run on top of Apache Hadoop, the open-source software framework that allows data to be processed by clusters of commodity hardware for data warehousing and big-data analytics.

Cloudera's distribution of Hadoop (CDH) and its subscription offering, Cloudera Enterprise, include various integrated tools to help businesses store and analyse data in Hadoop clusters, offering improved security and availability. Cloudera provides software to support real-time SQL and search-engine queries, machine learning, security, and stream and batch data processing, as well as to manage Hadoop clusters.

The firm is one of several competing to offer the Hadoop distribution of choice for businesses. Each of the companies behind major Hadoop distributions - Hortonworks, IBM, MapR and Pivotal - provides different tools to manage, secure and exploit data stored on Hadoop clusters. But usage figures indicate that Cloudera's distribution is the most popular.

Intel had released its own distribution of Hadoop but this will now be withdrawn. Intel engineers will instead work on Cloudera's distro, which will be enhanced with features from Intel's platform.

While analysts estimate that Cloudera's paying user base may be tiny at present - about 350-strong and growing at about 50 new customers per quarter - Intel said it is buying into future potential.

"It's not really a technology play but it really is about overall business value. If you look at Intel's datacentre business over the past few years, the cloud service provider segment, the telecommunications and even the high-performance computing segments have all grown quite handsomely. But the enterprise segment has been a little bit stagnant," Boyd Davis, general manager of Intel's datacentre software division, said.

"What you see with big data is a different phenomenon occurring. It's injecting more investment into the IT world because there's such huge business value that gets derived from it, and that's the way I expect to see dramatic growth in our business."

But why did Intel decide against exploiting that growth with its own Hadoop distribution and instead chose to back Cloudera?

Davis said Intel wanted to boost Cloudera's already strong standing in the Hadoop market and reassure companies unsure which distribution to deploy that Cloudera will be a good long-term investment.

"The Hadoop ecosystem is still relatively nascent, when you compare it with the $100bn data-management market, and it's really important for us to take the risk out for customers," he said.

"Enterprises like to know this is the right path, so they don't have to sit on the sidelines and wait to see how the market plays out. That was important for us as well because we want to see this market grow."

Intel's is now Cloudera's largest strategic investor, defined by Cloudera as investors where there is "alignment between corporate initiatives".

The $740m investment by Intel was preceded by a cash injection of $160m into Cloudera by a variety of firms, including Google's investment arm. About 60 percent of the combined $900m investment will end up in Cloudera's pockets, according to Cloudera CEO Tom Reilly, as some of the money will go to existing investors in Cloudera.

"We've raised more than half a billion dollars that goes into Cloudera," Reilly said.

Initially, Cloudera will use the funding to help organisations move from Intel's Hadoop distribution to its own.

"We're hiring up engineers on our side to interface and integrate with Intel's engineering team, so we have the staff to support the partnership on the technical side of things," Reilly said.

"We're going to be transitioning all Intel customers to our new distribution, which combines the best of our distributions."

Reilly sees the partnership with Intel as a springboard to accelerate its ambitions for global expansion.

"Intel has a tremendous presence in China and India. The next thing we're going to do is to staff up and build up resources in those geographies to support the customers and continue to grow those big markets."

Both firms plan to increase their contributions to open-source projects related to Hadoop, with Reilly expressing interest in projects focused on in-memory processing, such as Apache Spark, and security.

Finally, the company will also use the money to help it acquire companies, "to accelerate our growth", according to Reilly.

The company still plans to go public but Reilly said it is not "setting an expectation" as to when an IPO might occur.

Competing architectures

Unsurprisingly, Intel's investment will result in engineers from both companies focusing on optimising Cloudera's toolset, as well as the core open-source Hadoop platform, to run on Intel's 64-bit x86 chip architecture.

"Hadoop will continue to work on all platforms, but the optimisations will occur on Intel sooner and faster," Reilly said.

"Intel has 94 percent market share in the datacentre. We believe the Intel platform is going to outperform other platforms."

The stance is something of a departure from a public statement made by a co-founder of Cloudera last year, when the company's CTO praised low-power ARM chips for being more efficient than competing silicon from other companies.

In a discussion about ARM-based processors at the time, Cloudera co-founder and CTO Amr Awadallah was reported as saying: "Cores from other vendors - without saying their name - consume significantly more power in the idle state, hence we're relieved that ARM is moving into this space."

Technical benefits

Intel and Cloudera have a "multi-year roadmap" of features in Intel hardware that will be exploited by Cloudera's distribution of Hadoop, and Intel's Davis said the first fruits of this collaboration are likely to be revealed in the near future.

"A really good example of one of the areas where we are collaborating that will show up in Cloudera products very soon is around hardware-accelerated security," he said.

"In our own distribution we took advantage of instructions in the Xeon chip that accelerate encryption, so that customers could encrypt the data in a Hadoop environment without necessarily having the performance overhead of many of the solutions out there.

"We had that intimate knowledge of the instructions that could accelerate the security algorithms. We built that into our distribution and are actively working to get that into Cloudera's product as quickly as we can."

When Intel launched its own Hadoop distribution last year, it promised that extensions to instruction sets in its chips would boost performance in various ways: improving data encryption speed via AES-NI and compression using AVX and SSE 4.2.

Various optimisations from Intel's Hadoop distribution will begin to be incorporated into CDH, following the release of version 3.1 of the Intel distro, the final outing for the platform.

Reilly said the firms' engineering collaboration and the absorption of Intel's distribution into Cloudera's platform will yield enhancements to Cloudera's offering "not just five years from now but in the coming months".

Davis expects the bulk of the collaboration between the companies will be on improvements to the open-source, core Hadoop platform, but added they will also work to improve Cloudera's proprietary tools on top of Hadoop.

"It's one of our fundamental objectives to maintain an open ecosystem, and Intel's going to continue to do engineering work and contribute to the open-source community," Davis said.

"We'll also continue to innovate in some of the areas around Hadoop that are not open source, on things like the management and data governance that are around Hadoop but not in the core platform. A lot of people have unique technologies there, and we will work with Cloudera on those."

On rare occasions there may also be other considerations that prevent their combined engineering teams from open-sourcing technologies, he said.

"There are certain cases where open source has some downsides. Security is an example. I don't have a specific example but sometimes you want to do something to take advantage of security capabilities in the chip that if you were to make open source would actually open up security holes. But the vast majority of the innovations that we drive are going to end up in open source."

Reilly said there was a natural crossover between the capabilities of the Hadoop platform to handle large volumes of data and Intel's investment in the internet of things, which is expected to fuel an explosion in data collection and analytics.