Eliminating useless parts in semi-structured documents using alternation counts

Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

Original languageEnglish
Title of host publicationDiscovery Science - 4th International Conference, DS 2001, Proceedings
PublisherSpringer Verlag
Pages113-127
Number of pages15
Volume2226
ISBN (Print)9783540429562
Publication statusPublished - 2001
Event4th International Conference on Discovery Science, DS 2001 - Washington, United States
Duration: Nov 25 2001Nov 28 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2226
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other4th International Conference on Discovery Science, DS 2001
CountryUnited States
CityWashington
Period11/25/0111/28/01

Fingerprint

Alternation
Count
N-gram
Web Mining
Statistic
Preprocessing
Eliminate
Statistics
Style

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Ikeda, D., Yamada, Y., & Hirokawa, S. (2001). Eliminating useless parts in semi-structured documents using alternation counts. In Discovery Science - 4th International Conference, DS 2001, Proceedings (Vol. 2226, pp. 113-127). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2226). Springer Verlag.

Eliminating useless parts in semi-structured documents using alternation counts. / Ikeda, Daisuke; Yamada, Yasuhiro; Hirokawa, Sachio.

Discovery Science - 4th International Conference, DS 2001, Proceedings. Vol. 2226 Springer Verlag, 2001. p. 113-127 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2226).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ikeda, D, Yamada, Y & Hirokawa, S 2001, Eliminating useless parts in semi-structured documents using alternation counts. in Discovery Science - 4th International Conference, DS 2001, Proceedings. vol. 2226, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2226, Springer Verlag, pp. 113-127, 4th International Conference on Discovery Science, DS 2001, Washington, United States, 11/25/01.
Ikeda D, Yamada Y, Hirokawa S. Eliminating useless parts in semi-structured documents using alternation counts. In Discovery Science - 4th International Conference, DS 2001, Proceedings. Vol. 2226. Springer Verlag. 2001. p. 113-127. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Ikeda, Daisuke ; Yamada, Yasuhiro ; Hirokawa, Sachio. / Eliminating useless parts in semi-structured documents using alternation counts. Discovery Science - 4th International Conference, DS 2001, Proceedings. Vol. 2226 Springer Verlag, 2001. pp. 113-127 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{7f685f2a0d23463a8d03cecbd10ec781,
title = "Eliminating useless parts in semi-structured documents using alternation counts",
abstract = "We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97{\%} accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95{\%} accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.",
author = "Daisuke Ikeda and Yasuhiro Yamada and Sachio Hirokawa",
year = "2001",
language = "English",
isbn = "9783540429562",
volume = "2226",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "113--127",
booktitle = "Discovery Science - 4th International Conference, DS 2001, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Eliminating useless parts in semi-structured documents using alternation counts

AU - Ikeda, Daisuke

AU - Yamada, Yasuhiro

AU - Hirokawa, Sachio

PY - 2001

Y1 - 2001

N2 - We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

AB - We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

UR - http://www.scopus.com/inward/record.url?scp=33745304536&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33745304536&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9783540429562

VL - 2226

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 113

EP - 127

BT - Discovery Science - 4th International Conference, DS 2001, Proceedings

PB - Springer Verlag

ER -