Finding peculiar compositions of two frequent strings with background texts

Research output: Contribution to journalArticle

Abstract

We consider mining unusual patterns from a set T of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set B of background texts to define a composition x y to be peculiar if both x and y are more frequent in B than in T and conversely xy is more frequent in T . xy is unusual because x and y are infrequent in T while xy is unexpectedly frequent compared to xy in B. To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions.We also showthat discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.

Original languageEnglish
Pages (from-to)499-530
Number of pages32
JournalKnowledge and Information Systems
Volume41
Issue number2
DOIs
Publication statusPublished - Oct 17 2014

Fingerprint

Chemical analysis
DNA sequences
Experiments
Statistical Models

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Cite this

Finding peculiar compositions of two frequent strings with background texts. / Ikeda, Daisuke; Suzuki, Einoshin.

In: Knowledge and Information Systems, Vol. 41, No. 2, 17.10.2014, p. 499-530.

Research output: Contribution to journalArticle

@article{a331fc7b8c6c4bde8293691254cd8f74,
title = "Finding peculiar compositions of two frequent strings with background texts",
abstract = "We consider mining unusual patterns from a set T of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set B of background texts to define a composition x y to be peculiar if both x and y are more frequent in B than in T and conversely xy is more frequent in T . xy is unusual because x and y are infrequent in T while xy is unexpectedly frequent compared to xy in B. To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions.We also showthat discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.",
author = "Daisuke Ikeda and Einoshin Suzuki",
year = "2014",
month = "10",
day = "17",
doi = "10.1007/s10115-013-0688-9",
language = "English",
volume = "41",
pages = "499--530",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

TY - JOUR

T1 - Finding peculiar compositions of two frequent strings with background texts

AU - Ikeda, Daisuke

AU - Suzuki, Einoshin

PY - 2014/10/17

Y1 - 2014/10/17

N2 - We consider mining unusual patterns from a set T of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set B of background texts to define a composition x y to be peculiar if both x and y are more frequent in B than in T and conversely xy is more frequent in T . xy is unusual because x and y are infrequent in T while xy is unexpectedly frequent compared to xy in B. To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions.We also showthat discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.

AB - We consider mining unusual patterns from a set T of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set B of background texts to define a composition x y to be peculiar if both x and y are more frequent in B than in T and conversely xy is more frequent in T . xy is unusual because x and y are infrequent in T while xy is unexpectedly frequent compared to xy in B. To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions.We also showthat discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.

UR - http://www.scopus.com/inward/record.url?scp=84919475563&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919475563&partnerID=8YFLogxK

U2 - 10.1007/s10115-013-0688-9

DO - 10.1007/s10115-013-0688-9

M3 - Article

AN - SCOPUS:84919475563

VL - 41

SP - 499

EP - 530

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -