Yahoo! may not have the same cachet today as Google, Facebook, and Twitter, but it has something none of them do: bragging rights to the world's largest Hadoop cluster. How big? Well, according to the Apache Hadoop website, Yahoo! has more than 100,000 CPUs in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in Hadoop.
That's big, and approximately four times larger than Facebook's beefiest Hadoop cluster.
For a big data geek, it's like dying and going to heaven. Or, in the case of Peter Cnudde (@pcnudde), one of Hadoop's rock stars and now Yahoo!'s vice president of Engineering, it's a serious reason to return to Yahoo! after years away.
I talked with Cnudde this week to better understand the future of Hadoop at Yahoo! and the traditional enterprise. Below are some excerpts from our interview.
TechRepublic: Given the widespread interest in Hadoop and big data and the difficulty of finding quality talent, why return to Yahoo!?
Cnudde: The job I have here is the best job in the world. After all, we still run the largest, multi-tenant Hadoop installation in the world, with a very broad set of use cases. We run over 850,000 Hadoop jobs every day. So that is interesting.
In fact, we've always been at the forefront of Hadoop. We originated it. We were the first to run YARN (next-generation MapReduce) at Scale. We're pushing the cutting-edge with Storm for real-time distributed data processing. We're also doing really interesting work on the machine learning side of things.
It's a combination of scale, a variety of workloads, that make Yahoo!'s Hadoop engineering incredibly interesting. Most users of Hadoop are nowhere near the scale that we're at.
TechRepublic: As you noted, no one else runs Hadoop at the scale you do. So, what is it about Hadoop that should make it interesting to mainstream enterprises?
Cnudde: Large enterprises have a lot of data but, just as important, that data is siloed. Hadoop gives organizations the ability to share data. This is important. Hadoop enables companies to bring all their data together. In a large organization, you can actually combine all of that data.
Some use "data lake" as a marketing term, but the marketing isn't important. The importance lies in that ability to keep your data in one place. You can then use YARN to run a whole range of jobs against the data. Some of those jobs require massive MapReduce and a lot of servers. For example, Yahoo! has 32,000 nodes within 16 clusters running YARN.
But you don't have to think about the overall scale to be productive. YARN allows a new employee to get started immediately, working with, for example, a 100-node Spark cluster within that larger YARN deployment. The flexibility that YARN gives is pretty important to us.
While web companies have always been very well instrumented in the sense that we mine data on page views, clickstreams, etc., sensors and the Internet of Things (IoT) will mean that data will become core to most businesses, if not all. These non-web companies can learn from our example that it's possible to build large-scale, multi-tenant systems on which all engineers in a company can work together in a secure way.
TechRepublic: Are there obvious limits to Hadoop? Or is it the "operating system" that will power all data-related applications going forward?
Cnudde: To a large extent, this is all a question of nomenclature. Is Hbase part of Hadoop or not? What about Pig? Hive? Etc. These are all components of the larger Hadoop ecosystem, yet can also be thought of as distinct systems.
The open-source Apache model has been very successful in big data. For example, we did much of the early work with HDFS but have done relatively little with Hbase, yet we use it extensively now. We are both contributors to Hadoop and benefactors of others' contributions.
The ecosystem around Hadoop will continue to evolve and take on new capabilities.
TechRepublic: So, given Hadoop's flexibility, and its constant evolution beyond HDFS, will Hadoop obviate the need for traditional enterprise data warehouses and other legacy data infrastructure?
Cnudde: This depends on the applications and constraints that might exist within an enterprise, as well as on the scale.
For web companies like Yahoo!, Hadoop is a core part of how we manage data. Things like click logs live in Hadoop. But we also use non-Hadoop systems for some of our analytics. It's a centerpiece, but it won't replace everything.
We do on occasion copy data. For example, we move email into Hadoop systems so that we can analyze huge volumes of email for anti-spam purposes. But we don't use Hadoop to serve our email.
Another example is Flickr photos. All photos are in Hadoop so we can run image recognition processes, but the main source of truth for photo serving is not in Hadoop.
So, we should expect to see Hadoop and its ecosystem continue to grow and take on new roles even as other systems fill important roles.
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.