SCOOP: A record extractor without knowledge on input

Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa

研究成果: Chapter in Book/Report/Conference proceedingConference contribution

2 被引用数 (Scopus)

抄録

We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

本文言語英語
ホスト出版物のタイトルDiscovery Science - 4th International Conference, DS 2001, Proceedings
編集者Klaus P. Jantke, Ayumi Shinohara
出版社Springer Verlag
ページ482-487
ページ数6
ISBN(印刷版)9783540429562
DOI
出版ステータス出版済み - 2001
イベント4th International Conference on Discovery Science, DS 2001 - Washington, 米国
継続期間: 11 25 200111 28 2001

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
2226
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

その他

その他4th International Conference on Discovery Science, DS 2001
国/地域米国
CityWashington
Period11/25/0111/28/01

All Science Journal Classification (ASJC) codes

  • 理論的コンピュータサイエンス
  • コンピュータ サイエンス(全般)

フィンガープリント

「SCOOP: A record extractor without knowledge on input」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル