Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload
This paper takes a detailed look at the Garbage Collection (GC) characteristics of TeraSort workload running on top of an Apache Hadoop2 framework deployed on a seven-node cluster. Apache Hadoop is a Java technology-based framework that facilitates distributed computing using commodity hardware. TeraSort workload is an example MapReduce application that ships with the Apache Hadoop distribution and is intended for sorting terabytes of data. This paper will discuss the recommendations for Java Virtual Machine (JVM) flags that can help tune GC behavior on a similar TeraSort setup. This paper will show how GC tuning helped them achieve as much as a 7% gain in TeraSort's performance on the experimental cluster.