TY - GEN
T1 - Eliminating useless parts in semi-structured documents using alternation counts
AU - Ikeda, Daisuke
AU - Yamada, Yasuhiro
AU - Hirokawa, Sachio
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2001.
PY - 2001
Y1 - 2001
N2 - We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.
AB - We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.
UR - http://www.scopus.com/inward/record.url?scp=33745304536&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33745304536&partnerID=8YFLogxK
U2 - 10.1007/3-540-45650-3_13
DO - 10.1007/3-540-45650-3_13
M3 - Conference contribution
AN - SCOPUS:33745304536
SN - 9783540429562
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 113
EP - 127
BT - Discovery Science - 4th International Conference, DS 2001, Proceedings
A2 - Jantke, Klaus P.
A2 - Shinohara, Ayumi
PB - Springer Verlag
T2 - 4th International Conference on Discovery Science, DS 2001
Y2 - 25 November 2001 through 28 November 2001
ER -