Finding characteristic substrings from compressed texts

研究成果: ジャーナルへの寄稿記事

3 引用 (Scopus)

抄録

Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equivalence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compression, and therefore our algorithms are more efficient than any existing algorithms that require decompression of given SLPs.

元の言語英語
ページ(範囲)261-280
ページ数20
ジャーナルInternational Journal of Foundations of Computer Science
23
発行部数2
DOI
出版物ステータス出版済み - 2 1 2012

Fingerprint

Equivalence classes
Computer science
Polynomials

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)

これを引用

Finding characteristic substrings from compressed texts. / Inenaga, Shunsuke; Bannai, Hideo.

:: International Journal of Foundations of Computer Science, 巻 23, 番号 2, 01.02.2012, p. 261-280.

研究成果: ジャーナルへの寄稿記事

@article{b41439cce92849459098ad5348bef100,
title = "Finding characteristic substrings from compressed texts",
abstract = "Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equivalence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compression, and therefore our algorithms are more efficient than any existing algorithms that require decompression of given SLPs.",
author = "Shunsuke Inenaga and Hideo Bannai",
year = "2012",
month = "2",
day = "1",
doi = "10.1142/S0129054112400126",
language = "English",
volume = "23",
pages = "261--280",
journal = "International Journal of Foundations of Computer Science",
issn = "0129-0541",
publisher = "World Scientific Publishing Co. Pte Ltd",
number = "2",

}

TY - JOUR

T1 - Finding characteristic substrings from compressed texts

AU - Inenaga, Shunsuke

AU - Bannai, Hideo

PY - 2012/2/1

Y1 - 2012/2/1

N2 - Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equivalence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compression, and therefore our algorithms are more efficient than any existing algorithms that require decompression of given SLPs.

AB - Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel problem: given a compressed text and compressed pattern, compute the representative of the equivalence class of the pattern w.r.t. the text. We present algorithms that solve the above problems in time polynomial in the size of input compressed strings. The compression scheme we consider is straight line program (SLP) which has exponential compression, and therefore our algorithms are more efficient than any existing algorithms that require decompression of given SLPs.

UR - http://www.scopus.com/inward/record.url?scp=84863101615&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863101615&partnerID=8YFLogxK

U2 - 10.1142/S0129054112400126

DO - 10.1142/S0129054112400126

M3 - Article

AN - SCOPUS:84863101615

VL - 23

SP - 261

EP - 280

JO - International Journal of Foundations of Computer Science

JF - International Journal of Foundations of Computer Science

SN - 0129-0541

IS - 2

ER -