SCOOP: A record extractor without knowledge on input

Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

Original languageEnglish
Title of host publicationDiscovery Science - 4th International Conference, DS 2001, Proceedings
EditorsKlaus P. Jantke, Ayumi Shinohara
PublisherSpringer Verlag
Pages482-487
Number of pages6
ISBN (Print)9783540429562
Publication statusPublished - Jan 1 2001
Event4th International Conference on Discovery Science, DS 2001 - Washington, United States
Duration: Nov 25 2001Nov 28 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2226
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other4th International Conference on Discovery Science, DS 2001
CountryUnited States
CityWashington
Period11/25/0111/28/01

Fingerprint

Extractor
Experiments
Divides
Counting
Count
Strings
Experimental Results
Experiment
Knowledge

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Yamada, Y., Ikeda, D., & Hirokawa, S. (2001). SCOOP: A record extractor without knowledge on input. In K. P. Jantke, & A. Shinohara (Eds.), Discovery Science - 4th International Conference, DS 2001, Proceedings (pp. 482-487). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2226). Springer Verlag.

SCOOP : A record extractor without knowledge on input. / Yamada, Yasuhiro; Ikeda, Daisuke; Hirokawa, Sachio.

Discovery Science - 4th International Conference, DS 2001, Proceedings. ed. / Klaus P. Jantke; Ayumi Shinohara. Springer Verlag, 2001. p. 482-487 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2226).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yamada, Y, Ikeda, D & Hirokawa, S 2001, SCOOP: A record extractor without knowledge on input. in KP Jantke & A Shinohara (eds), Discovery Science - 4th International Conference, DS 2001, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2226, Springer Verlag, pp. 482-487, 4th International Conference on Discovery Science, DS 2001, Washington, United States, 11/25/01.
Yamada Y, Ikeda D, Hirokawa S. SCOOP: A record extractor without knowledge on input. In Jantke KP, Shinohara A, editors, Discovery Science - 4th International Conference, DS 2001, Proceedings. Springer Verlag. 2001. p. 482-487. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Yamada, Yasuhiro ; Ikeda, Daisuke ; Hirokawa, Sachio. / SCOOP : A record extractor without knowledge on input. Discovery Science - 4th International Conference, DS 2001, Proceedings. editor / Klaus P. Jantke ; Ayumi Shinohara. Springer Verlag, 2001. pp. 482-487 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{1b0551bc2233435b953a525515a151fb,
title = "SCOOP: A record extractor without knowledge on input",
abstract = "We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.",
author = "Yasuhiro Yamada and Daisuke Ikeda and Sachio Hirokawa",
year = "2001",
month = "1",
day = "1",
language = "English",
isbn = "9783540429562",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "482--487",
editor = "Jantke, {Klaus P.} and Ayumi Shinohara",
booktitle = "Discovery Science - 4th International Conference, DS 2001, Proceedings",
address = "Germany",

}

TY - GEN

T1 - SCOOP

T2 - A record extractor without knowledge on input

AU - Yamada, Yasuhiro

AU - Ikeda, Daisuke

AU - Hirokawa, Sachio

PY - 2001/1/1

Y1 - 2001/1/1

N2 - We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

AB - We present a record extractor system SCOOP. We assume that semi-structured documents given to SCOOP contain similar formats and each of them has only a record consisting of some different fields. SCOOP treats a document as just a string and does not use knowledge on input except that a field is surrounded with delimiters, a left delimiter ends with “>”, and the corresponding right delimiter begins with “<”. By counting substrings, SCOOP roughly divides into two parts: contents of the fields and others. SCOOP counts substrings near boundaries of two parts and extracts the most frequent substrings as delimiters. We show experimental results with news articles written in English or Japanese. A record consists of the headline and the body text on this experiment. SCOOP extracts records at a high rate.

UR - http://www.scopus.com/inward/record.url?scp=84943266730&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84943266730&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84943266730

SN - 9783540429562

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 482

EP - 487

BT - Discovery Science - 4th International Conference, DS 2001, Proceedings

A2 - Jantke, Klaus P.

A2 - Shinohara, Ayumi

PB - Springer Verlag

ER -