Finding peculiar compositions of two frequent strings with background texts

Research output: Contribution to journalArticlepeer-review

Abstract

We consider mining unusual patterns from a set(Formula presented.)(Formula presented.)of target texts. A typical method outputs unusual patterns if their observed frequencies are far from their expectation estimated under an assumed probabilistic model. However, it is difficult for the method to deal with the zero frequency and thus it suffers from data sparseness. We employ another set(Formula presented.)(Formula presented.) of background texts to define a composition(Formula presented.)(Formula presented.) to be peculiar if both(Formula presented.)(Formula presented.)and (Formula presented.)(Formula presented.) are more frequent in (Formula presented.)(Formula presented.) than in (Formula presented.)(Formula presented.)and conversely(Formula presented.) (Formula presented.) is more frequent in (Formula presented.)(Formula presented.).(Formula presented.)(Formula presented.)is unusual because(Formula presented.)(Formula presented.) and (Formula presented.)(Formula presented.) are infrequent in(Formula presented.) (Formula presented.) while (Formula presented.)(Formula presented.) is unexpectedly frequent compared to(Formula presented.)(Formula presented.) in (Formula presented.)(Formula presented.). To find frequent subpatterns and infrequent patterns simultaneously, we develop a fast algorithm using the suffix tree and show that it scales almost linearly under practical settings of parameters. Experiments using DNA sequences show that found peculiar compositions basically appear in rRNA while patterns found by an existing method seem not to relate to specific biological functions. We also show that discovered patterns have similar lengths at which the distribution of frequencies of fixed length substrings begins to skew. This fact explains why our method can find long peculiar compositions.

Original languageEnglish
Pages (from-to)499-530
Number of pages32
JournalKnowledge and Information Systems
Volume41
Issue number2
DOIs
Publication statusPublished - Oct 17 2014

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Finding peculiar compositions of two frequent strings with background texts'. Together they form a unique fingerprint.

Cite this