Skip to main navigation Skip to search Skip to main content

Timely Reporting of Heavy Hitters Using External Memory

  • Shikha Singh
  • , Prashant Pandey
  • , Michael A. Bender
  • , Jonathan W. Berry
  • , Martín Farach-Colton
  • , Rob Johnson
  • , Thomas M. Kroeger
  • , Cynthia A. Phillips
  • Williams College
  • Dell
  • Sandia National Laboratories, New Mexico
  • Rutgers - The State University of New Jersey, New Brunswick

Research output: Contribution to journalArticlepeer-review

Abstract

Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.

Original languageEnglish
Article number3472392
JournalACM Transactions on Database Systems
Volume46
Issue number4
DOIs
StatePublished - Dec 2021

Keywords

  • Dictionary data structure
  • external-memory algorithms
  • streaming algorithms

Fingerprint

Dive into the research topics of 'Timely Reporting of Heavy Hitters Using External Memory'. Together they form a unique fingerprint.

Cite this