Mining pure patterns in texts

Yasuhiro Yamada, Tetsuya Nakatoh, Kensuke Baba, Daisuke Ikeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.

Original languageEnglish
Title of host publicationProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
Pages285-290
Number of pages6
DOIs
Publication statusPublished - Dec 14 2012
Event1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 - Fukuoka, Japan
Duration: Sep 20 2012Sep 22 2012

Publication series

NameProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

Other

Other1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
CountryJapan
CityFukuoka
Period9/20/129/22/12

Fingerprint

DNA sequences
Entropy
Statistics
Experiments

All Science Journal Classification (ASJC) codes

  • Information Systems

Cite this

Yamada, Y., Nakatoh, T., Baba, K., & Ikeda, D. (2012). Mining pure patterns in texts. In Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 (pp. 285-290). [6337205] (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012). https://doi.org/10.1109/IIAI-AAI.2012.75

Mining pure patterns in texts. / Yamada, Yasuhiro; Nakatoh, Tetsuya; Baba, Kensuke; Ikeda, Daisuke.

Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. p. 285-290 6337205 (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yamada, Y, Nakatoh, T, Baba, K & Ikeda, D 2012, Mining pure patterns in texts. in Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012., 6337205, Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012, pp. 285-290, 1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012, Fukuoka, Japan, 9/20/12. https://doi.org/10.1109/IIAI-AAI.2012.75
Yamada Y, Nakatoh T, Baba K, Ikeda D. Mining pure patterns in texts. In Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. p. 285-290. 6337205. (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012). https://doi.org/10.1109/IIAI-AAI.2012.75
Yamada, Yasuhiro ; Nakatoh, Tetsuya ; Baba, Kensuke ; Ikeda, Daisuke. / Mining pure patterns in texts. Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. pp. 285-290 (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012).
@inproceedings{e2707ee5d8c64b69b28d3395a90f6b22,
title = "Mining pure patterns in texts",
abstract = "We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.",
author = "Yasuhiro Yamada and Tetsuya Nakatoh and Kensuke Baba and Daisuke Ikeda",
year = "2012",
month = "12",
day = "14",
doi = "10.1109/IIAI-AAI.2012.75",
language = "English",
isbn = "9780769548265",
series = "Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012",
pages = "285--290",
booktitle = "Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012",

}

TY - GEN

T1 - Mining pure patterns in texts

AU - Yamada, Yasuhiro

AU - Nakatoh, Tetsuya

AU - Baba, Kensuke

AU - Ikeda, Daisuke

PY - 2012/12/14

Y1 - 2012/12/14

N2 - We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.

AB - We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.

UR - http://www.scopus.com/inward/record.url?scp=84870847112&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870847112&partnerID=8YFLogxK

U2 - 10.1109/IIAI-AAI.2012.75

DO - 10.1109/IIAI-AAI.2012.75

M3 - Conference contribution

AN - SCOPUS:84870847112

SN - 9780769548265

T3 - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

SP - 285

EP - 290

BT - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

ER -