Daniel Vicory

Institution: 
Allan Hancock College
Year: 
2011

Novel Optimization of Task Scheduling Within Mapreduce/Hadoop

MapReduce, an algorithm created by Google, can be used to work with large datasets, such as indexing the web. The simple and distributed approach to problems by MapReduce has afforded its widespread use, with the leading implementation being Hadoop. However, non-uniform data and heterogeneous clusters mean that tasks usually finish executing out of sync, which, due to the nature of MapReduce, can leave computing power untapped. SkewReduce, a project from researchers at the University of Washington, unveiled a method to reduce the skew, or difference in task completion times, through the use of cost analysis functions and sample data. With these two pieces, their framework can calculate how long the algorithm will take on any given computer and partition the dataset optimally so that tasks finish together, reaching theoretical efficiency. However, because of the cost functions, it is not an out of the box solution to skew. Therefore, we propose a novel optimization of task scheduling that doesn’t require these additions. Our algorithm stops tasks that have been executing for too long in comparison to other tasks and redistributes the work to the cluster. It replaces SkewReduce’s optimizer and cost portions, so we can compare to previous research on the performance of SkewReduce and Hadoop’s task scheduler. This project is still in progress, and we do not yet have our task scheduler complete to benchmark with. We are confident we should be able to make modest performance gains against Hadoop’s task scheduler and approach the limit of what is possible without any changes to existing algorithms or knowing the cost.

UC Santa Barbara Center for Science and Engineering Partnerships UCSB California NanoSystems Institute