Abstract
We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 |
Pages | 285-290 |
Number of pages | 6 |
DOIs | |
Publication status | Published - Dec 14 2012 |
Event | 1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 - Fukuoka, Japan Duration: Sep 20 2012 → Sep 22 2012 |
Publication series
Name | Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 |
---|
Other
Other | 1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 |
---|---|
Country | Japan |
City | Fukuoka |
Period | 9/20/12 → 9/22/12 |
Fingerprint
All Science Journal Classification (ASJC) codes
- Information Systems
Cite this
Mining pure patterns in texts. / Yamada, Yasuhiro; Nakatoh, Tetsuya; Baba, Kensuke; Ikeda, Daisuke.
Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. p. 285-290 6337205 (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
}
TY - GEN
T1 - Mining pure patterns in texts
AU - Yamada, Yasuhiro
AU - Nakatoh, Tetsuya
AU - Baba, Kensuke
AU - Ikeda, Daisuke
PY - 2012/12/14
Y1 - 2012/12/14
N2 - We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.
AB - We herein investigate finding unusual patterns from a given string as a text. In the present paper, the pattern is expressed as a substring of the string. The natural assumption with respect to the frequency of a pattern is that the shorter the length of the pattern, the larger the frequency of the pattern. We define a pattern to be pure if the frequencies of all of the substrings of the pattern are the same as the frequency of the pattern. This means that the substrings appear only within the pattern in the string. This condition is in contrast to the natural assumption. The present paper proposes three statistics for quantifying the purity of a pattern, i.e., probability, entropy, and difference, which are calculated based on the frequency of the pattern and its substrings. Experiments using DNA sequences reveal that patterns with large probability correspond to the features of the sequences.
UR - http://www.scopus.com/inward/record.url?scp=84870847112&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870847112&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI.2012.75
DO - 10.1109/IIAI-AAI.2012.75
M3 - Conference contribution
AN - SCOPUS:84870847112
SN - 9780769548265
T3 - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
SP - 285
EP - 290
BT - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
ER -