Open Source

Hadoop and cloud computing: Collision course or happy symbiosis?

According to Forrester, two of the industry's hottest trends -- cloud computing and Hadoop -- may not work well together. This theory, however, doesn't seem to be supported by the facts.

Hadoop and cloud computing

Two of the biggest trends in technology might be on a collision course. According to Forrester, Hadoop — often considered the heart of big data — is not a natural fit for the cloud, where we increasingly want to run our apps. But with more data generated in the cloud and stored in Hadoop, it's more likely that the "collision course" envisaged by Forrester is actually a happy symbiosis.

Hadoop: An earthbound misfit?

Hadoop is one of the hottest trends in technology, according to Indeed.com job trends, among other sources. So is cloud computing, with more enterprises turning to the cloud to accelerate innovation, as a RightScale user survey uncovers (Figure A):

Figure A

Figure A

RightScale user survey.

The two mega trends, however, may not mesh. At least, not according to Forrester analyst Richard Fichera, who notes that the very nature of cloud computing mitigates against it being a welcome home for Hadoop clusters.

To support his argument, Fichera offers three reasons why Hadoop belongs in an enterprise data center rather than in a cloud computing environment:

  • Heavy and increasing workloads favor on-premises Hadoop. Hadoop clusters tend to be heavily utilized, with capacity being added as resources get scarce, rather than being massively overprovisioned. In other words, whether slow and steady or fast and steady, Hadoop clusters get fed data in a mostly predictable fashion, without the peaks and valleys that normally lend themselves to an elastic cloud deployment.
  • Cloud storage is both slower and more expensive for data sets that just keep growing. Cloud storage may have "unacceptably long access times," and cost comparisons don't indicate it's inherently cheaper anyway. In addition, "Hadoop tends to collect 10 times or more data than legacy transactional environments do, plus data scientists and their customer-focused business stakeholders will almost never want to discard Hadoop data, and the access requirements are unpredictable — all of which favors on-premises storage."
  • Data sources and locality make a big difference for performance. While running Hadoop clusters in the cloud may make sense where the data itself is generated in the cloud (e.g., analysis of Twitter), "for real-time customer-facing systems with data coming from multiple venues, Operations will likely need to build Hadoop out in a physical facility with the right (deterministic bandwidth and latency) network interconnects to minimize the end-to-end latency of the application."

Cloudy Hadoop for cloudy data

Maybe. Maybe not. After all, it's the very "data gravity" argument that Forrester seems to downplay that's most likely to lead to more Hadoop-in-the-cloud deployments. It's early to be making long-term projections as to where data will sit, as Mike Olson, co-founder and Chief Strategy Officer at Hadoop vendor Cloudera, told me over Gtalk:

"Hadoop gets installed where the data already is. Cloud deployment makes sense when you already have a bunch of data in S3 buckets. There just hasn't been enough history for an analysis of long-term trends."

Even so, it's unwise to imagine that Hadoop will remain bound to the data center. Marten Mickos, CEO of hybrid cloud vendor Eucalyptus, told me over email that it's more likely we'll see Hadoop going everywhere:

"What people often forget is that we will have data EVERYWHERE. Data exerts gravity. But when data is everywhere, so will Hadoop workloads be. Don't be surprised if we start seeing Hadoop workloads on wireless base stations, in vehicles, or in other edges of the IT infrastructure."

Because of this multi-headed data beast, it's unlikely that Hadoop workloads will remain entrenched in the data center. Nor is it likely that every Hadoop cluster will run in the cloud.

It's closer to the truth that Hadoop's future lies both in the data center and in the cloud, something Shaun Connolly, vice president of Strategy at Hortonworks, a leading Hadoop vendor, told me over Skype:

"I believe there will be multiple centers of data gravity, one of which is on-premises. But I am convinced Hadoop in the cloud plays a significant role in the broader architecture as the Hadoop market continues to mature.

"Moreover, for a certain portion of data, the economics of cloud storage will be compelling for older, historical data that you still want accessible for historical reporting. Cloud storage can play a role that tape has historically played, but with significantly better accessibility. This is why [having] Linux and Windows [available] both on-premises and cloud (a la Azure, Amazon, Rackspace, etc) is so important."

The only loser in this split between data centers and the public cloud, according to Mickos, is "dedicated bare-metal provisioning."

ThoughtWorks lead consultant Hemanth Yamijala gives six other reasons to believe that Hadoop is a natural fit for cloud environments:

  • Lowering the cost of innovation
  • Procuring large scale resources quickly
  • Handling batch workloads efficiently
  • Handling variable resource requirements
  • Running closer to the data
  • Simplifying Hadoop operations

His second point is particularly instructive as a counter to Forrester's argument. It may make more sense on paper to throw internal hardware at a Hadoop problem, but the reality of most IT departments is very different. It's easier to say "I need 50 additional servers" than it is to actually procure them, given internal politics or procurement policies.

For these and other reasons, the theory of Hadoop in the data center is far rosier than its reality. Whether Hadoop is a perfect fit for cloud infrastructure is a very different question than whether Hadoop adoption patterns tend to favor the cloud.

This revolution may not run in your data center

All of which is reason to believe that while Forrester may have nailed its theory of Hadoop deployments, it seems to have missed the reality of where enterprise data will increasingly live and how easily IT will be able to provision hardware to meet the growing Hadoop demand. As more data moves to the cloud, enterprises will have more reason and need to run Hadoop there, too.

But there's more.

As Redmonk analyst James Governor points out, the missing, but essential component in Forrester's calculus is convenience:

"Hadoop is sophisticated technology, which requires skill and experience to deploy, configure, scale and manage. The enterprise choice seems to be work with an existing supplier to integrate Hadoop into its existing systems, or try something that will radically change and improve how it works today. The cloud is where that difference will be realised."

Convenience trumps most every other consideration, including the rational if unrealistic reasons offered by Forrester.

What are your thoughts about the union of Hadoop and cloud computing? Let us know in the discussion thread below.

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox