SCOOP: A record extractor without knowledge on input

Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

Original languageEnglish
Title of host publicationDiscovery Science - 4th International Conference, DS 2001, Proceedings
EditorsKlaus P. Jantke, Ayumi Shinohara
PublisherSpringer Verlag
Pages482-487
Number of pages6
ISBN (Print)9783540429562
DOIs
Publication statusPublished - 2001
Event4th International Conference on Discovery Science, DS 2001 - Washington, United States
Duration: Nov 25 2001Nov 28 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2226
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other4th International Conference on Discovery Science, DS 2001
Country/TerritoryUnited States
CityWashington
Period11/25/0111/28/01

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'SCOOP: A record extractor without knowledge on input'. Together they form a unique fingerprint.

Cite this