TY - JOUR
T1 - Sensitivity of string compressors and repetitiveness measures
AU - Akagi, Tooru
AU - Funakoshi, Mitsuru
AU - Inenaga, Shunsuke
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Numbers JP20J21147 (MF) and JP22H03551 (SI), and by JST PRESTO Grant Number JPMJPR1922 (SI). The authors thank Yuichi Yoshida for his helpful comments. The authors thank anonymous referees for pointing out some errors in the earlier version of this work and for their suggestions to improve the paper.
Publisher Copyright:
© 2022 The Authors
PY - 2023/3
Y1 - 2023/3
N2 - The sensitivity of a string compression algorithm C asks how much the output size C(T) for an input string T can increase when a single character edit operation is performed on T. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by maxT∈Σn{C(T′)/C(T):ed(T,T′)=1}, where ed(T,T′) denotes the edit distance between T and T′. In particular, for the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is only a small constant (2 or 3, depending on the version of the Lempel-Ziv 77 and the edit operation type), i.e., the size of the Lempel-Ziv 77 factorizations can be larger by only a small constant factor. We strengthen our upper bound results by presenting matching lower bounds on the worst-case sensitivity for all these major versions of the Lempel-Ziv 77 factorizations. We generalize these results to the smallest bidirectional scheme b. In addition, we show that the sensitivity of a grammar-based compressor called GCIS (Grammar Compression by Induced Sorting) is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size γ and the substring complexity δ, and show that the worst-case sensitivity of δ is also a small constant. These results contrast with the previously known related results such that the size z78 of the Lempel-Ziv 78 factorization can increase by a factor of Ω(n1/4) (shown by Lagarde and Perifel), and the number r of runs in the Burrows-Wheeler transform can increase by a factor of Ω(logn) (shown by Giuliani et al.) when a character is prepended to an input string of length n. By applying our sensitivity bounds of δ or the smallest grammar to known results (cf. Navarro's survey) some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including γ, r, LZ-End, RePair, LongestMatch, and AVL-grammar, are derived. We also exhibit the worst-case additive sensitivity maxT∈Σn{C(T′)−C(T):ed(T,T′)=1}, which allows one to observe more details in the changes of the output sizes.
AB - The sensitivity of a string compression algorithm C asks how much the output size C(T) for an input string T can increase when a single character edit operation is performed on T. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by maxT∈Σn{C(T′)/C(T):ed(T,T′)=1}, where ed(T,T′) denotes the edit distance between T and T′. In particular, for the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is only a small constant (2 or 3, depending on the version of the Lempel-Ziv 77 and the edit operation type), i.e., the size of the Lempel-Ziv 77 factorizations can be larger by only a small constant factor. We strengthen our upper bound results by presenting matching lower bounds on the worst-case sensitivity for all these major versions of the Lempel-Ziv 77 factorizations. We generalize these results to the smallest bidirectional scheme b. In addition, we show that the sensitivity of a grammar-based compressor called GCIS (Grammar Compression by Induced Sorting) is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size γ and the substring complexity δ, and show that the worst-case sensitivity of δ is also a small constant. These results contrast with the previously known related results such that the size z78 of the Lempel-Ziv 78 factorization can increase by a factor of Ω(n1/4) (shown by Lagarde and Perifel), and the number r of runs in the Burrows-Wheeler transform can increase by a factor of Ω(logn) (shown by Giuliani et al.) when a character is prepended to an input string of length n. By applying our sensitivity bounds of δ or the smallest grammar to known results (cf. Navarro's survey) some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including γ, r, LZ-End, RePair, LongestMatch, and AVL-grammar, are derived. We also exhibit the worst-case additive sensitivity maxT∈Σn{C(T′)−C(T):ed(T,T′)=1}, which allows one to observe more details in the changes of the output sizes.
UR - http://www.scopus.com/inward/record.url?scp=85146054591&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85146054591&partnerID=8YFLogxK
U2 - 10.1016/j.ic.2022.104999
DO - 10.1016/j.ic.2022.104999
M3 - Article
AN - SCOPUS:85146054591
SN - 0890-5401
VL - 291
JO - Information and Computation
JF - Information and Computation
M1 - 104999
ER -