Eliminating useless parts in semi-structured documents using alternation counts

Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa

研究成果: 書籍/レポート タイプへの寄稿会議への寄与

4 被引用数 (Scopus)

抄録

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

本文言語英語
ホスト出版物のタイトルDiscovery Science - 4th International Conference, DS 2001, Proceedings
編集者Klaus P. Jantke, Ayumi Shinohara
出版社Springer Verlag
ページ113-127
ページ数15
ISBN(印刷版)9783540429562
DOI
出版ステータス出版済み - 2001
イベント4th International Conference on Discovery Science, DS 2001 - Washington, 米国
継続期間: 11月 25 200111月 28 2001

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
2226
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

その他

その他4th International Conference on Discovery Science, DS 2001
国/地域米国
CityWashington
Period11/25/0111/28/01

!!!All Science Journal Classification (ASJC) codes

  • 理論的コンピュータサイエンス
  • コンピュータ サイエンス(全般)

フィンガープリント

「Eliminating useless parts in semi-structured documents using alternation counts」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル