As organizations amass ever-expanding pools of data, the task of processing this data requires substantial consideration. Hadoop, a big data storage and processing framework, was originally based on two technical white papers from Google, and has grown to be the industry standard solution for the most data-intensive organizations, including Adobe and Twitter.

TechRepublic’s cheat sheet to Hadoop is a quick introduction to the popular open-source distributed storage and processing framework. This resource will be updated periodically when there are new developments to the Hadoop ecosystem.

SEE: All of TechRepublic’s cheat sheets and smart person’s guides

Executive summary

  • What is Hadoop? Hadoop is an open-source framework that is designed for distributed storage and big data processing.
  • Why does Hadoop matter? For one-time task deployments, as well as for use cases with continuous input, Hadoop can make quick work of your data.
  • Who does Hadoop affect? Organizations that handle large quantities of data turn to Hadoop as their first choice for efficient storage and processing.
  • When is Hadoop available? The first version was released in April 2006. Hadoop 2.8.0 is the current stable version. Version 3.0.0-alpha4 was released on July 7, 2017, and version 3.0.0 is expected to reach general availability in October 2017.
  • How do I get Hadoop? For organizations continuously amassing more data, building your own Hadoop deployment is advisable, though public cloud providers do offer Hadoop services.

SEE: Online course: Introduction to Hadoop (TechRepublic Academy)

What is Hadoop?

Hadoop is an open-source framework developed by the Apache Software Foundation that is designed for distributed storage and big data processing using the MapReduce programming model. Hadoop operates using computer clusters, splitting files into blocks and distributing across nodes in a given cluster. Using Hadoop, MapReduce jobs can be delegated to the particular node where the relevant data is stored, allowing for faster parallel processing of data using simple programming models.

Hadoop is particularly extensible, allowing external services to interact with a Hadoop development. The core Hadoop project includes MapReduce, Hadoop Distributed File System (HDFS), YARN (a framework for scheduling and resource management), and Common (the shared set of utilities that support the use of Hadoop modules).

Other Hadoop-related projects include:

  • Cassandra, a scalable database with no single point of failure;
  • HBase, a distributed big data store that supports very large tables;
  • Spark, a fast general compute engine for data in Hadoop;
  • Pig, a high-level parallel computation framework for big data;
  • Hive, a data warehouse system that provides data summarization and ad hoc querying;
  • Mahout, a machine learning and data mining system; and
  • Ambari, a web-based management and provisioning tool for Hadoop clusters, which includes support for some non-core plugins.

Additional resources

Why does Hadoop matter?

Between user-generated data, and the logging of user activities and the necessary task of generating metrics based on those logs, many organizations routinely generate absurdly large amounts of data. Deploying a Hadoop cluster is a more efficient means of handling data storage and manipulation than traditional storage and analytics methods. The modular nature of Hadoop provides flexibility in system design, as Spark begins to supplant MapReduce in popularity.

SEE: Ebook–How to build a successful data scientist career (TechRepublic)

Hadoop can also be quite beneficial for organizations with archival data that requires analysis and/or modification. TimesMachine, a service from The New York Times that allows subscribers to read historical issues of the venerable newspaper, was built using Hadoop. By using Amazon EC2, Hadoop, and custom code, 405,000 large TIFF images, 405,000 XML files, and 3.3 million SGML were converted to 810,000 PNG images and 405,000 JavaScript files in less than 36 hours.

Additional resources

Who does Hadoop affect?

Organizations that handle large quantities of data typically turn to Hadoop as their first choice for efficient storage and processing. Perhaps foremost among these is Facebook, which announced in 2012 that its largest cluster was over 100 PB, and growing by over half a PB per day, on which over 60,000 Hive queries per day are performed.

Yahoo, a long-standing contributor to Hadoop, reports having “100,000 CPUs in >40,000 computers running Hadoop” which are used for supporting research for advertisements and web search. Twitter, another contributor, uses Hadoop “to store and process tweets, log files, and many other types of data.” Rakuten, the Japanese ecommerce giant, uses Hadoop for log analysis for its recommendation system.

Music aggregator (TechRepublic and are CBS Interactive brands) has a 100-node Hadoop cluster used for charts calculation, royalty reporting, log analysis, A/B testing, and dataset merging, as well as analyzing audio features in millions of music tracks.

Additional resources

When is Hadoop available?

The first public version of Hadoop, version 0.1.0, was released in April 2006. The next month, Yahoo deployed a 300-machine cluster, increasing to two clusters of 1,000 machines in April 2007. Yahoo moved its search index to Hadoop in February 2008, using a 10,000-core cluster.

The first Hadoop summit was hosted in March 2008 in Sunnyvale, CA. US Hadoop summits have been hosted annually in June in San Jose, CA. Starting in 2014, the European Hadoop summit has been held annually in April.

The commercial Hadoop vendor Cloudera was founded in October 2008. The competitor MapR was founded in July 2009. In June 2011, Hortonworks was founded when 24 engineers at Yahoo left to form their own company.

Hadoop 2.8.0, the current stable version, was released on March 22, 2017. Version 3.0.0-alpha4 was released on July 7, 2017. Version 3.0.0 is expected to reach general availability in October 2017.

Additional resources

How do I get Hadoop?

Generally, Hadoop is designed for deployments on clusters of hardware in data centers for organizations that have an ongoing need to process and store data continuously. As an open-source project, Hadoop is available freely from the Apache Foundation. Various organizations also provide customized versions of Hadoop with product support, including Hortonworks, Cloudera, and MapR.

For fixed sets of data that require processing (such as the aforementioned New York Times example), Hadoop is available from public cloud providers. Amazon Elastic MapReduce is a customized version of Hadoop that automates file transfer between EC2 and S3, as well as offers support for Hive. Naturally, standard Apache Hadoop can itself be run directly from EC2 and S3. Microsoft Azure HDInsight is a customized Hortonworks HDP deployment. On Google Cloud, Dataproc is a customized Spark and Hadoop service, with support for Hortonworks, Cloudera, and MapR available using bdutil.

Additional resources