Skip to main navigation Skip to search Skip to main content

Timely Reporting of Heavy Hitters using External Memory

  • Prashant Pandey
  • , Shikha Singh
  • , Michael A. Bender
  • , Jonathan W. Berry
  • , Martín Farach-Colton
  • , Rob Johnson
  • , Thomas M. Kroeger
  • , Cynthia A. Phillips
  • Carnegie Mellon University
  • Wellesley College
  • Sandia National Laboratories, New Mexico
  • Rutgers - The State University of New Jersey, New Brunswick
  • Dell
  • Sandia National Laboratories, California

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Given an input stream of size N, a †-heavy hitter is an item that occurs at least † N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = † N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ∼100K observations per second.

Original languageEnglish
Title of host publicationSIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1431-1446
Number of pages16
ISBN (Electronic)9781450367356
DOIs
StatePublished - Jun 14 2020
Event2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020 - Portland, United States
Duration: Jun 14 2020Jun 19 2020

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
Country/TerritoryUnited States
CityPortland
Period06/14/2006/19/20

Keywords

  • dictionary data structure
  • external-memory algorithms
  • streaming algorithms

Fingerprint

Dive into the research topics of 'Timely Reporting of Heavy Hitters using External Memory'. Together they form a unique fingerprint.

Cite this