Hadoop complexity is part of the master plan, says Cloudera exec

Hadoop will never be a discrete project that is easily understood, and that's part of its charm.


One of Hadoop's biggest problems is that no one knows what the heck it means anymore. Hadoop used to mean something pretty specific: "an open-source software framework... for distributed storage and distributed processing of very large data sets." This basically meant Hadoop Distributed File System (HDFS) and MapReduce.

Over time, however, "Hadoop" has gobbled up all sorts of projects and subprojects that were nonexistent when Doug Cutting first developed Hadoop. This, in turn, has created all sorts of complexity.

This is a good thing, as Cloudera's vice president of Products, Charles Zedlewski, told me in an interview.

Hadoop was never meant to be a static platform, according to Zedlewski, but rather a living platform that incorporates a variety of modular components. This willingness to embrace rival components, he concludes, make it a better big data platform because "the substitution of one component in the stack does not obviate the overall platform."

TechRepublic: Hadoop has become the poster child for big data, but as it's grown in adoption, it has also grown in complexity, scaring off some newbies. What's the right way to think about Hadoop?

Zedlewski: Apache Hadoop is best understood as a collection of affiliated, interoperable open-source components that are combined together to make up a complete big data platform. Components (and their corresponding projects) are regularly added to the ecosystem. In many situations, there are several components that perform essentially the same function.

Change has always been a part of the Apache Hadoop ecosystem. The rate of change stems from the fact that the Apache Hadoop ecosystem embraces modularity and continuous competition for membership in the overall platform. This steady rate of change comes with its challenges, but the benefits far outweigh the drawbacks.

These benefits are shared among customers, vendors, and the community alike.

Nor is Apache Hadoop unique in embracing modularity and continuous competition. Many other open-source platform technologies take the same approach.

TechRepublic: Like what? What are some good examples?

Zedlewski: Both Linux and OpenMPI provide good examples.

Red Hat Enterprise Linux, SuSE Linux, and Ubuntu Linux are respectively comprised of more than 100 open-source projects [Note: As a basis of comparison, Cloudera's Hadoop platform currently includes approximately 25.] Over the years, Linux distributions have regularly added components that extended its functionality (web servers, developer libraries, browsers, databases, graphical user interfaces, etc) and introduced components that compete with preexisting components (e.g., Red Hat ships multiple competing filesystems and schedulers in its distribution).

Similarly, the OpenMPI ecosystem supports an array of competing schedulers and filesystems in its architecture. Increasingly, the container ecosystem seems to be heading down the same modular path: there are already several competing container orchestrators, schedulers, and operating systems.

This model stands in contrast to other open-source projects that take more of an all-in-one approach.

The functionality of open-source databases such as Postgres, MySQL, and MongoDB are largely all housed within one open-source project. The same can be said for browsers like Chrome and Mozilla and CMS tools like Drupal and Alfresco.

Development frameworks such as Spring, Java, and Ruby on Rails typically have modules—but in those particular cases, they are all housed under the same umbrella project, and there aren't different open-source communities developing competing implementations, so they are effectively all-in-one projects as well.

Both open-source models can clearly work but the open-source ecosystems that embrace modularity and continuous competition over time have seen some of the greatest success.

TechRepublic: Why do you think that's the case?

Zedlewski: There are at least three reasons. First, the modular approach is more future proof.

For users, platform technologies are a multi-decade bet, so there is a big concern about premature obsolescence. All-in-one projects get off to fast starts because of their greater usability and coherence, but they are often quickly supplanted by an alternative project that improves upon the original idea.

Consider developer frameworks, which can fall quickly into and out of fashion, as an example. On the browser side, browser usage share has also changed rapidly over the course of just five years.

Modular open-source communities tend to be better incorporators of good new ideas, because the substitution of one component in the stack does not obviate the overall platform. There have been many open-source projects in the Apache Hadoop ecosystem that were obvious improvements over the prior generation technologies, with Apache Spark improving upon MapReduce and Apache Parquet improving upon RCfile as just two recent examples.

The Apache Hadoop ecosystem has seen these as enhancements rather than threats.

The second reason for the success of Hadoop's modular approach is that it's more functionally competitive.

Over time, all systems expand in functionality. But it's a common pattern for new capabilities of a popular system to fall short of what's best in class.

In all-in-one technologies, new functionality gets adopted because it's tightly conjoined with a popular technology, not because it's great functionality in and of itself. Over time, more and more features fall short of best in class, and the system as a whole begins to atrophy.

The core Apache Hadoop itself used to have multiple contribs and sub-projects of highly variable quality whose primary virtue was that they were packaged in with the core Hadoop project. They were eventually stripped out or spun out as independent projects, where several have thrived but others have been allowed to fail.

Lastly, Hadoop's modular model is more attractive for contributors. It's easier to build strong communities of developers when they can form tight, coherent teams that have the freedom to make their own technical decisions and form their own norms.

Not that everything is rosy with modular platforms. There are drawbacks to a modular ecosystem in a state of continuous competition.

One drawback can be complexity; the overall platform feels less like a coherent single-author effort. The other drawback can be confusion; customers, press, and analysts find it challenging to discuss a technology whose boundaries are constantly changing.

But these drawbacks can clearly be mitigated by vendors like Cloudera and our various competitors. We would far prefer to iteratively simplify and improve the usability of a highly capable and competitive platform than have to turn around a platform that missed an important turn in the road.

Also see

About Matt Asay

Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.

Editor's Picks

Free Newsletters, In your Inbox