Application and improvement of the purity measure to texts

研究成果: Contribution to journalArticle査読


The purity measure is an unusualness measure for substrings of a given string. Although we have shown its usefulness on characterization of specific regions of genome sequences in the previous work, it has not been examined deeply how well the measure can be applied to text data, where much more symbols are used than in genome sequences. In this paper, we investigate its usefulness on texts and also show that the purity measure cannot differentiate the unusualness of substrings when many symbols are used in an input string. Therefore, we propose an improved measure called atomicity measure and show it can differentiate the unusualness of substrings better. Our experiment on alphabet sequences in texts shows both the measures distinguish word-like sequences and non-word sequences. Another experiment on word sequences (phrases), which is the case that there are a lot of symbols, shows the atomicity measure gives high values to phrases such as proper nouns and low values to idiomatic phrases that might reflect genres of texts while the purity measure is not so suggestive on phrases. We conclude that especially the atomicity measure can characterize texts well, and it will potentially be useful in text mining.

ジャーナルResearch Reports on Information Science and Electrical Engineering of Kyushu University
出版ステータス出版済み - 1 2014

All Science Journal Classification (ASJC) codes

  • コンピュータ サイエンス(全般)
  • 電子工学および電気工学


「Application and improvement of the purity measure to texts」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。