TY - GEN

T1 - Computing q-gram non-overlapping frequencies on SLP compressed texts

AU - Goto, Keisuke

AU - Bannai, Hideo

AU - Inenaga, Shunsuke

AU - Takeda, Masayuki

N1 - Copyright:
Copyright 2012 Elsevier B.V., All rights reserved.

PY - 2012

Y1 - 2012

N2 - Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all q-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O(q 2 n) time and O(qn) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q = 2 in O(n 4logn) time and O(n 3) space.

AB - Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all q-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O(q 2 n) time and O(qn) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q = 2 in O(n 4logn) time and O(n 3) space.

UR - http://www.scopus.com/inward/record.url?scp=84856047503&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84856047503&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-27660-6_25

DO - 10.1007/978-3-642-27660-6_25

M3 - Conference contribution

AN - SCOPUS:84856047503

SN - 9783642276590

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 301

EP - 312

BT - SOFSEM 2012

T2 - 38th Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2012

Y2 - 21 January 2012 through 27 January 2012

ER -