Performance Evaluation of Hadoop on Virtual Machines

Download Now Date Added: Dec 2009
Format: PDF

MapReduce is a popular programming framework that is intended for automatical paralellization of computation in the cloud. MapReduce deals with data intensive applications; huge amount of data is first loaded from remote DFS, then copied as intermediate results from Mapper to Reducer, and finally written back to DFS. Along with this large amount of data transfer, many I/O operations are incurred across MapReduce instances. MapReduce runtime usually operates in a data center environment, such as Amazon EC2, where virtual machine instances, rather than physical machines, are exposed. Virtual machines have great advantages on isolation of failed instances, server consolation for better power utilization. However, they also hide from upper-level applications the data locality on physical networks, introducing unnecessary I/O operations.