### Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽ_{T} log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽ_{T} is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽ_{T} is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(e_{T}^{r} log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, e_{T}^{r} is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽ_{T}) for a given text T in O(n + ẽ_{T} log σ) time.

Original language | English |
---|---|

Title of host publication | String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings |

Editors | Rossano Venturini, Gabriele Fici, Marinella Sciortino |

Publisher | Springer Verlag |

Pages | 304-316 |

Number of pages | 13 |

ISBN (Print) | 9783319674278 |

DOIs | |

Publication status | Published - Jan 1 2017 |

Event | 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017 - Palermo, Italy Duration: Sep 26 2017 → Sep 29 2017 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 10508 LNCS |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Other

Other | 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017 |
---|---|

Country | Italy |

City | Palermo |

Period | 9/26/17 → 9/29/17 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Theoretical Computer Science
- Computer Science(all)

### Cite this

*String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings*(pp. 304-316). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10508 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-67428-5_26

**Linear-size CDAWG : New repetition-aware indexing and grammar compression.** / Takagi, Takuya; Goto, Keisuke; Fujishige, Yuta; Inenaga, Shunsuke; Arimura, Hiroki.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings.*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10508 LNCS, Springer Verlag, pp. 304-316, 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017, Palermo, Italy, 9/26/17. https://doi.org/10.1007/978-3-319-67428-5_26

}

TY - GEN

T1 - Linear-size CDAWG

T2 - New repetition-aware indexing and grammar compression

AU - Takagi, Takuya

AU - Goto, Keisuke

AU - Fujishige, Yuta

AU - Inenaga, Shunsuke

AU - Arimura, Hiroki

PY - 2017/1/1

Y1 - 2017/1/1

N2 - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.

AB - In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with O(ẽT log n) bits of space allowing for O(log n) -time random and O(1)-time sequential accesses to edge labels, and O(m log σ + occ) -time pattern matching. Here, ẽT is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, σ is the alphabet size, and occ is the number of occurrences of the pattern in T. The repetitiveness measure ẽT is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve O(m + occ ) pattern matching time with O(eTr log n) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of log log n, with the same space complexity. Here, eTr is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size O(ẽT) for a given text T in O(n + ẽT log σ) time.

UR - http://www.scopus.com/inward/record.url?scp=85030173354&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030173354&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-67428-5_26

DO - 10.1007/978-3-319-67428-5_26

M3 - Conference contribution

SN - 9783319674278

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 304

EP - 316

BT - String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, Proceedings

A2 - Venturini, Rossano

A2 - Fici, Gabriele

A2 - Sciortino, Marinella

PB - Springer Verlag

ER -