Compressing Intermediate Keys Between Mappers and Reducers in SciHadoop
In Hadoop mappers send data to reducers in the form of key/value pairs. The default design of Hadoop's process for transmitting this intermediate data can cause a very high overhead, especially for scientific data containing multiple variables in a multi-dimensional space. For example, for a 3D scalar field of a variable "windspeed1" the size of keys was 6.75 times the size of values. Much of the disk and network bandwidth of "shuffling" this intermediate data is consumed by repeatedly transmitting the variable name for each value. This significant waste of resources is due to an assumption fundamental to Hadoop's design that all key/values are independent.