Skip to main navigation Skip to search Skip to main content

Improved MapReduce load balancing through distribution-dependent hash function optimization

  • Stony Brook University
  • University of Maryland, College Park

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution. Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter c to add in the hash function to minimize makespan. For two different bucketing methods - modulo labeling and consecutive binning - we present efficient algorithms for finding the optimal value of c. Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 26th International Conference on Parallel and Distributed Systems, ICPADS 2020
PublisherIEEE Computer Society
Pages9-18
Number of pages10
ISBN (Electronic)9781728190747
DOIs
StatePublished - Dec 2020
Event26th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2020 - Virtual, Hong Kong, Hong Kong
Duration: Dec 2 2020Dec 4 2020

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2020-December
ISSN (Print)1521-9097

Conference

Conference26th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2020
Country/TerritoryHong Kong
CityVirtual, Hong Kong
Period12/2/2012/4/20

Keywords

  • Apache Spark
  • Cloud Computing
  • Distributed-memory Cluster Computing
  • Hadoop
  • Hashing
  • Load Balancing
  • MapReduce

Fingerprint

Dive into the research topics of 'Improved MapReduce load balancing through distribution-dependent hash function optimization'. Together they form a unique fingerprint.

Cite this