TY - GEN
T1 - An Upper Bound and Linear-Space Queries on the LZ-End Parsing
AU - Kempa, Dominik
AU - Saha, Barna
N1 - Publisher Copyright:
Copyright © 2022 by SIAM.
PY - 2022
Y1 - 2022
N2 - Lempel-Ziv (LZ77) compression is the most commonly used lossless compression algorithm. The basic idea is to greedily break the input string into blocks (called "phrases"), every time forming as a phrase the longest prefix of the unprocessed part that has an earlier occurrence. In 2010, Kreft and Navarro introduced a variant of LZ77 called LZ-End, that additionally requires the previous occurrence of each phrase to end at the boundary of an already existing phrase. Due to its excellent practical performance as a compression algorithm and a compressed index, they conjectured that it achieves a compression that can be provably upper-bounded in terms of the LZ77 size. Despite the recent progress in understanding such relation for other compression algorithms (e.g., the run-length encoded Burrows-Wheeler transform), no such result is known for LZ-End. We prove that for any string of length n, the number ze of phrases in the LZ-End parsing satisfies ze = O(z log2 n), where z is the number of phrases in the LZ77 parsing. This puts LZ-End among the strongest dictionary compressors and solves a decade-old open problem of Kreft and Navarro. Using our techniques we also derive bounds for other variants of LZ-End and with respect to other compression measures. Our second contribution is a data structure that implements random access queries to the text in O(ze) space and O(polylog n) time. This is the first linear-size structure on LZ-End that efficiently implements such queries. All previous data structures either incur a logarithmic penalty in the space or have slow queries. We also show how to extend these techniques to support longest-common-extension (LCE) queries.
AB - Lempel-Ziv (LZ77) compression is the most commonly used lossless compression algorithm. The basic idea is to greedily break the input string into blocks (called "phrases"), every time forming as a phrase the longest prefix of the unprocessed part that has an earlier occurrence. In 2010, Kreft and Navarro introduced a variant of LZ77 called LZ-End, that additionally requires the previous occurrence of each phrase to end at the boundary of an already existing phrase. Due to its excellent practical performance as a compression algorithm and a compressed index, they conjectured that it achieves a compression that can be provably upper-bounded in terms of the LZ77 size. Despite the recent progress in understanding such relation for other compression algorithms (e.g., the run-length encoded Burrows-Wheeler transform), no such result is known for LZ-End. We prove that for any string of length n, the number ze of phrases in the LZ-End parsing satisfies ze = O(z log2 n), where z is the number of phrases in the LZ77 parsing. This puts LZ-End among the strongest dictionary compressors and solves a decade-old open problem of Kreft and Navarro. Using our techniques we also derive bounds for other variants of LZ-End and with respect to other compression measures. Our second contribution is a data structure that implements random access queries to the text in O(ze) space and O(polylog n) time. This is the first linear-size structure on LZ-End that efficiently implements such queries. All previous data structures either incur a logarithmic penalty in the space or have slow queries. We also show how to extend these techniques to support longest-common-extension (LCE) queries.
UR - https://www.scopus.com/pages/publications/85122386379
U2 - 10.1137/1.9781611977073.111
DO - 10.1137/1.9781611977073.111
M3 - Conference contribution
AN - SCOPUS:85122386379
T3 - Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
SP - 2847
EP - 2866
BT - ACM-SIAM Symposium on Discrete Algorithms, SODA 2022
PB - Association for Computing Machinery
T2 - 33rd Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2022
Y2 - 9 January 2022 through 12 January 2022
ER -