Finding characteristic substrings from compressed texts

Shunsuke Inenaga, Hideo Bannai

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equivalence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compression, and therefore our algorithms are more efficient than any existing algorithms that require decompression of given SLPs.

Original languageEnglish
Pages (from-to)261-280
Number of pages20
JournalInternational Journal of Foundations of Computer Science
Volume23
Issue number2
DOIs
Publication statusPublished - Feb 1 2012

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)

Fingerprint Dive into the research topics of 'Finding characteristic substrings from compressed texts'. Together they form a unique fingerprint.

Cite this