SHARE

Apache Hadoop: A cheat sheet

Hadoop is a popular open-source distributed storage and processing framework. This primer about the framework covers commercial solutions, Hadoop on the public cloud, and why it matters for business.

Written By

James Sanders

Jul 11, 2017

We may earn from vendors via affiliate links or sponsorships. This might affect product placement on our site, but not the content of our reviews. See our Terms of Use for details.

As organizations amass ever-expanding pools of data, the task of processing this data requires substantial consideration. Hadoop, a big data storage and processing framework, was originally based on two technical white papers from Google, and has grown to be the industry standard solution for the most data-intensive organizations, including Adobe and Twitter.

TechRepublic’s cheat sheet to Hadoop is a quick introduction to the popular open-source distributed storage and processing framework. This resource will be updated periodically when there are new developments to the Hadoop ecosystem.

SEE: All of TechRepublic’s cheat sheets and smart person’s guides

Executive summary
What is Hadoop?
Why does Hadoop matter?
Who does Hadoop affect?
When is Hadoop available?
How do I get Hadoop?

Executive summary

What is Hadoop? Hadoop is an open-source framework that is designed for distributed storage and big data processing.
Why does Hadoop matter? For one-time task deployments, as well as for use cases with continuous input, Hadoop can make quick work of your data.
Who does Hadoop affect? Organizations that handle large quantities of data turn to Hadoop as their first choice for efficient storage and processing.
When is Hadoop available? The first version was released in April 2006. Hadoop 2.8.0 is the current stable version. Version 3.0.0-alpha4 was released on July 7, 2017, and version 3.0.0 is expected to reach general availability in October 2017.
How do I get Hadoop? For organizations continuously amassing more data, building your own Hadoop deployment is advisable, though public cloud providers do offer Hadoop services.

SEE: Online course: Introduction to Hadoop (TechRepublic Academy)

What is Hadoop?

Hadoop is an open-source framework developed by the Apache Software Foundation that is designed for distributed storage and big data processing using the MapReduce programming model. Hadoop operates using computer clusters, splitting files into blocks and distributing across nodes in a given cluster. Using Hadoop, MapReduce jobs can be delegated to the particular node where the relevant data is stored, allowing for faster parallel processing of data using simple programming models.

Hadoop is particularly extensible, allowing external services to interact with a Hadoop development. The core Hadoop project includes MapReduce, Hadoop Distributed File System (HDFS), YARN (a framework for scheduling and resource management), and Common (the shared set of utilities that support the use of Hadoop modules).

Other Hadoop-related projects include:

Cassandra, a scalable database with no single point of failure;
HBase, a distributed big data store that supports very large tables;
Spark, a fast general compute engine for data in Hadoop;
Pig, a high-level parallel computation framework for big data;
Hive, a data warehouse system that provides data summarization and ad hoc querying;
Mahout, a machine learning and data mining system; and
Ambari, a web-based management and provisioning tool for Hadoop clusters, which includes support for some non-core plugins.

Additional resources

Hadoop creator Doug Cutting on the near-future tech that will unlock big data (ZDNet)
The secret ingredients for making Hadoop an enterprise tool (TechRepublic)
The meteoric rise of Spark and the evolution of Hadoop (Tech Pro Research)
Cloudera’s new data science tool aims to boost big data and machine learning for businesses (TechRepublic)

Why does Hadoop matter?

Between user-generated data, and the logging of user activities and the necessary task of generating metrics based on those logs, many organizations routinely generate absurdly large amounts of data. Deploying a Hadoop cluster is a more efficient means of handling data storage and manipulation than traditional storage and analytics methods. The modular nature of Hadoop provides flexibility in system design, as Spark begins to supplant MapReduce in popularity.

SEE: Ebook–How to build a successful data scientist career (TechRepublic)

Hadoop can also be quite beneficial for organizations with archival data that requires analysis and/or modification. TimesMachine, a service from The New York Times that allows subscribers to read historical issues of the venerable newspaper, was built using Hadoop. By using Amazon EC2, Hadoop, and custom code, 405,000 large TIFF images, 405,000 XML files, and 3.3 million SGML were converted to 810,000 PNG images and 405,000 JavaScript files in less than 36 hours.

Additional resources

Hadoop engine benchmark: How Spark, Impala, Hive, and Presto compare (TechRepublic)
Artificial intelligence on Hadoop: Does it make sense? (ZDNet)
Devs are from Mars, Ops are from Venus: Can analytics bridge the gap? (ZDNet)

Who does Hadoop affect?

Organizations that handle large quantities of data typically turn to Hadoop as their first choice for efficient storage and processing. Perhaps foremost among these is Facebook, which announced in 2012 that its largest cluster was over 100 PB, and growing by over half a PB per day, on which over 60,000 Hive queries per day are performed.

Must-read big data coverage

Yahoo, a long-standing contributor to Hadoop, reports having “100,000 CPUs in >40,000 computers running Hadoop” which are used for supporting research for advertisements and web search. Twitter, another contributor, uses Hadoop “to store and process tweets, log files, and many other types of data.” Rakuten, the Japanese ecommerce giant, uses Hadoop for log analysis for its recommendation system.

Music aggregator Last.fm (TechRepublic and Last.fm are CBS Interactive brands) has a 100-node Hadoop cluster used for charts calculation, royalty reporting, log analysis, A/B testing, and dataset merging, as well as analyzing audio features in millions of music tracks.

Additional resources

The Very Big Data & Apache Hadoop Training Bundle (TechRepublic Academy)
Pig for Wrangling Big Data (TechRepublic Academy)
Why developers are so divided on the tech they love and the tech they hate (TechRepublic)
6 questions every business must ask about big data architecture (TechRepublic)
How to optimize Hadoop performance by getting a handle on processing demands (TechRepublic)
Linux Foundation offers Hadoop training (ZDNet)
Open source big data and DevOps tools: A fast path to analytics applications (Tech Pro Research)

When is Hadoop available?

The first public version of Hadoop, version 0.1.0, was released in April 2006. The next month, Yahoo deployed a 300-machine cluster, increasing to two clusters of 1,000 machines in April 2007. Yahoo moved its search index to Hadoop in February 2008, using a 10,000-core cluster.

The first Hadoop summit was hosted in March 2008 in Sunnyvale, CA. US Hadoop summits have been hosted annually in June in San Jose, CA. Starting in 2014, the European Hadoop summit has been held annually in April.

The commercial Hadoop vendor Cloudera was founded in October 2008. The competitor MapR was founded in July 2009. In June 2011, Hortonworks was founded when 24 engineers at Yahoo left to form their own company.

Hadoop 2.8.0, the current stable version, was released on March 22, 2017. Version 3.0.0-alpha4 was released on July 7, 2017. Version 3.0.0 is expected to reach general availability in October 2017.

Additional resources

Has the Hadoop market turned a corner? (ZDNet)
Big data booming, fueled by Hadoop and NoSQL adoption (TechRepublic)
How to avoid big data project failures: Your 5-step guide (TechRepublic)
Some Hadoop vendors don’t understand who their biggest competitor really is (TechRepublic)
Hadoop vendors are listening: Hortonworks gets pragmatic (ZDNet)

How do I get Hadoop?

Generally, Hadoop is designed for deployments on clusters of hardware in data centers for organizations that have an ongoing need to process and store data continuously. As an open-source project, Hadoop is available freely from the Apache Foundation. Various organizations also provide customized versions of Hadoop with product support, including Hortonworks, Cloudera, and MapR.

For fixed sets of data that require processing (such as the aforementioned New York Times example), Hadoop is available from public cloud providers. Amazon Elastic MapReduce is a customized version of Hadoop that automates file transfer between EC2 and S3, as well as offers support for Hive. Naturally, standard Apache Hadoop can itself be run directly from EC2 and S3. Microsoft Azure HDInsight is a customized Hortonworks HDP deployment. On Google Cloud, Dataproc is a customized Spark and Hadoop service, with support for Hortonworks, Cloudera, and MapR available using bdutil.

Additional resources

Cloudera launches new PaaS offering to boost big data analytics in the cloud (TechRepublic)
Azure HDInsight click-by-click guide: Get cloud-based Hadoop up and running today (ZDNet)
IBM and Hortonworks go steady with OEM deal (ZDNet)
Hortonworks adds GUI for streaming data, “Flex Support” for hybrid cloud (ZDNet)
Projects in Hadoop and Big Data: Learn by Building Apps (TechRepublic Academy)

James Sanders

James Sanders is an analyst for 451 Research. He was formerly a Staff Technology Writer for TechRepublic.