抄録
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.
元の言語 | 英語 |
---|---|
ホスト出版物のタイトル | String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings |
編集者 | Rossano Venturini, Gabriele Fici, Marinella Sciortino |
出版者 | Springer Verlag |
ページ | 304-316 |
ページ数 | 13 |
ISBN(印刷物) | 9783319674278 |
DOI | |
出版物ステータス | 出版済み - 1 1 2017 |
イベント | 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017 - Palermo, イタリア 継続期間: 9 26 2017 → 9 29 2017 |
出版物シリーズ
名前 | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
巻 | 10508 LNCS |
ISSN(印刷物) | 0302-9743 |
ISSN(電子版) | 1611-3349 |
その他
その他 | 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017 |
---|---|
国 | イタリア |
市 | Palermo |
期間 | 9/26/17 → 9/29/17 |
Fingerprint
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Computer Science(all)
これを引用
Linear-size CDAWG : New repetition-aware indexing and grammar compression. / Takagi, Takuya; Goto, Keisuke; Fujishige, Yuta; Inenaga, Shunsuke; Arimura, Hiroki.
String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings. 版 / Rossano Venturini; Gabriele Fici; Marinella Sciortino. Springer Verlag, 2017. p. 304-316 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 巻 10508 LNCS).研究成果: 著書/レポートタイプへの貢献 › 会議での発言
}
TY - GEN
T1 - Linear-size CDAWG
T2 - New repetition-aware indexing and grammar compression
AU - Takagi, Takuya
AU - Goto, Keisuke
AU - Fujishige, Yuta
AU - Inenaga, Shunsuke
AU - Arimura, Hiroki
PY - 2017/1/1
Y1 - 2017/1/1
N2 - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.
AB - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.
UR - http://www.scopus.com/inward/record.url?scp=85030173354&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85030173354&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-67428-5_26
DO - 10.1007/978-3-319-67428-5_26
M3 - Conference contribution
AN - SCOPUS:85030173354
SN - 9783319674278
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 304
EP - 316
BT - String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings
A2 - Venturini, Rossano
A2 - Fici, Gabriele
A2 - Sciortino, Marinella
PB - Springer Verlag
ER -