The purity measure is an unusualness measure for substrings of a given string. Although we have shown its usefulness on characterization of specific regions of genome sequences in the previous work, it has not been examined deeply how well the measure can be applied to text data, where much more symbols are used than in genome sequences. In this paper, we investigate its usefulness on texts and also show that the purity measure cannot differentiate the unusualness of substrings when many symbols are used in an input string. Therefore, we propose an improved measure called atomicity measure and show it can differentiate the unusualness of substrings better. Our experiment on alphabet sequences in texts shows both the measures distinguish word-like sequences and non-word sequences. Another experiment on word sequences (phrases), which is the case that there are a lot of symbols, shows the atomicity measure gives high values to phrases such as proper nouns and low values to idiomatic phrases that might reflect genres of texts while the purity measure is not so suggestive on phrases. We conclude that especially the atomicity measure can characterize texts well, and it will potentially be useful in text mining.
|Number of pages||6|
|Journal||Research Reports on Information Science and Electrical Engineering of Kyushu University|
|Publication status||Published - Jan 2014|
All Science Journal Classification (ASJC) codes
- Computer Science(all)
- Electrical and Electronic Engineering