Testbed for information extraction from deep web

Yasuhiro Yamada, Nick Craswell, Tetsuya Nakatoh, Sachio Hirokawa

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    30 Citations (Scopus)

    Abstract

    Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web [1]. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a testbed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results page and manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.

    Original languageEnglish
    Title of host publicationProceedings of the 13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004
    PublisherAssociation for Computing Machinery, Inc
    Pages346-347
    Number of pages2
    ISBN (Electronic)1581139128, 9781581139129
    DOIs
    Publication statusPublished - May 19 2004
    Event13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004 - New York, United States
    Duration: May 19 2004May 21 2004

    Publication series

    NameProceedings of the 13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004

    Other

    Other13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004
    CountryUnited States
    CityNew York
    Period5/19/045/21/04

    All Science Journal Classification (ASJC) codes

    • Computer Networks and Communications
    • Software

    Fingerprint Dive into the research topics of 'Testbed for information extraction from deep web'. Together they form a unique fingerprint.

    Cite this