TY - GEN
T1 - Grammar Index by Induced Suffix Sorting
AU - Akagi, Tooru
AU - Köppl, Dominik
AU - Nakashima, Yuto
AU - Inenaga, Shunsuke
AU - Bannai, Hideo
AU - Takeda, Masayuki
N1 - Funding Information:
Acknowledgements. This work was supported by JSPS KAKENHI grant numbers JP21K17701 (DK), JP21K17705 (YN), JP20H04141 (HB), JP18H04098 (MT), and JST PRESTO grant number JPMJPR1922 (SI).
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - We propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC’18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P, certain substrings of P, called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P. Supported by the cores, given a pattern of length m, we can locate all its occ occurrences in a text T of length n within O(mlg | S| + occ Clg | S| lg n+ occ ) time, where S is the set of all characters and non-terminals, occ is the number of occurrences, and occ C is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T. Our grammar index requires O(g) words of space and can be built in O(n) time using O(g) working space, where g is the sum of the lengths of the right hand sides of all production rules. We practically evaluate that our proposed index excels at locating long patterns in highly-repetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.
AB - We propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC’18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P, certain substrings of P, called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P. Supported by the cores, given a pattern of length m, we can locate all its occ occurrences in a text T of length n within O(mlg | S| + occ Clg | S| lg n+ occ ) time, where S is the set of all characters and non-terminals, occ is the number of occurrences, and occ C is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T. Our grammar index requires O(g) words of space and can be built in O(n) time using O(g) working space, where g is the sum of the lengths of the right hand sides of all production rules. We practically evaluate that our proposed index excels at locating long patterns in highly-repetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.
UR - http://www.scopus.com/inward/record.url?scp=85116824926&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116824926&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-86692-1_8
DO - 10.1007/978-3-030-86692-1_8
M3 - Conference contribution
AN - SCOPUS:85116824926
SN - 9783030866914
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 85
EP - 99
BT - String Processing and Information Retrieval - 28th International Symposium, SPIRE 2021, Proceedings
A2 - Lecroq, Thierry
A2 - Touzet, Hélène
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Symposium on String Processing and Information Retrieval, SPIRE 2021
Y2 - 4 October 2021 through 6 October 2021
ER -