Eliminating useless parts in semi-structured documents using alternation counts

Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

Original languageEnglish
Title of host publicationDiscovery Science - 4th International Conference, DS 2001, Proceedings
EditorsKlaus P. Jantke, Ayumi Shinohara
PublisherSpringer Verlag
Pages113-127
Number of pages15
ISBN (Print)9783540429562
DOIs
Publication statusPublished - 2001
Event4th International Conference on Discovery Science, DS 2001 - Washington, United States
Duration: Nov 25 2001Nov 28 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2226
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other4th International Conference on Discovery Science, DS 2001
CountryUnited States
CityWashington
Period11/25/0111/28/01

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Eliminating useless parts in semi-structured documents using alternation counts'. Together they form a unique fingerprint.

Cite this