Linear-size CDAWG

New repetition-aware indexing and grammar compression

Takuya Takagi, Keisuke Goto, Yuta Fujishige, Shunsuke Inenaga, Hiroki Arimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.

Original languageEnglish
Title of host publicationString Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings
EditorsRossano Venturini, Gabriele Fici, Marinella Sciortino
PublisherSpringer Verlag
Pages304-316
Number of pages13
ISBN (Print)9783319674278
DOIs
Publication statusPublished - Jan 1 2017
Event24th International Symposium on String Processing and Information Retrieval, SPIRE 2017 - Palermo, Italy
Duration: Sep 26 2017Sep 29 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10508 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other24th International Symposium on String Processing and Information Retrieval, SPIRE 2017
CountryItaly
CityPalermo
Period9/26/179/29/17

Fingerprint

Pattern matching
Grammar
Indexing
Compression
Pattern Matching
Graph in graph theory
Byproducts
Labels
Straight-line Programs
Run Length
Space Complexity
Repetition
Text

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., & Arimura, H. (2017). Linear-size CDAWG: New repetition-aware indexing and grammar compression. In R. Venturini, G. Fici, & M. Sciortino (Eds.), String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings (pp. 304-316). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10508 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-67428-5_26

Linear-size CDAWG : New repetition-aware indexing and grammar compression. / Takagi, Takuya; Goto, Keisuke; Fujishige, Yuta; Inenaga, Shunsuke; Arimura, Hiroki.

String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings. ed. / Rossano Venturini; Gabriele Fici; Marinella Sciortino. Springer Verlag, 2017. p. 304-316 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10508 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Takagi, T, Goto, K, Fujishige, Y, Inenaga, S & Arimura, H 2017, Linear-size CDAWG: New repetition-aware indexing and grammar compression. in R Venturini, G Fici & M Sciortino (eds), String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10508 LNCS, Springer Verlag, pp. 304-316, 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017, Palermo, Italy, 9/26/17. https://doi.org/10.1007/978-3-319-67428-5_26
Takagi T, Goto K, Fujishige Y, Inenaga S, Arimura H. Linear-size CDAWG: New repetition-aware indexing and grammar compression. In Venturini R, Fici G, Sciortino M, editors, String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings. Springer Verlag. 2017. p. 304-316. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-67428-5_26
Takagi, Takuya ; Goto, Keisuke ; Fujishige, Yuta ; Inenaga, Shunsuke ; Arimura, Hiroki. / Linear-size CDAWG : New repetition-aware indexing and grammar compression. String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings. editor / Rossano Venturini ; Gabriele Fici ; Marinella Sciortino. Springer Verlag, 2017. pp. 304-316 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{898f2175d7a445ee9795b64fefad3bf5,
title = "Linear-size CDAWG: New repetition-aware indexing and grammar compression",
abstract = "In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.",
author = "Takuya Takagi and Keisuke Goto and Yuta Fujishige and Shunsuke Inenaga and Hiroki Arimura",
year = "2017",
month = "1",
day = "1",
doi = "10.1007/978-3-319-67428-5_26",
language = "English",
isbn = "9783319674278",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "304--316",
editor = "Rossano Venturini and Gabriele Fici and Marinella Sciortino",
booktitle = "String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Linear-size CDAWG

T2 - New repetition-aware indexing and grammar compression

AU - Takagi, Takuya

AU - Goto, Keisuke

AU - Fujishige, Yuta

AU - Inenaga, Shunsuke

AU - Arimura, Hiroki

PY - 2017/1/1

Y1 - 2017/1/1

N2 - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.

AB - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.

UR - http://www.scopus.com/inward/record.url?scp=85030173354&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030173354&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-67428-5_26

DO - 10.1007/978-3-319-67428-5_26

M3 - Conference contribution

SN - 9783319674278

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 304

EP - 316

BT - String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings

A2 - Venturini, Rossano

A2 - Fici, Gabriele

A2 - Sciortino, Marinella

PB - Springer Verlag

ER -