Information extraction from web pages using semi-structured data alignment

Tetsuji Kuboyama, Tetsuhiro Miyahara, Sachio Hirokawa, Eisuke Itou

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.

    Original languageEnglish
    Title of host publicationWMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
    Pages42-47
    Number of pages6
    Publication statusPublished - Dec 1 2005
    Event9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005 - Orlando, FL, United States
    Duration: Jul 10 2005Jul 13 2005

    Publication series

    NameWMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
    Volume1

    Other

    Other9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005
    CountryUnited States
    CityOrlando, FL
    Period7/10/057/13/05

    All Science Journal Classification (ASJC) codes

    • Artificial Intelligence
    • Computer Networks and Communications
    • Information Systems

    Fingerprint Dive into the research topics of 'Information extraction from web pages using semi-structured data alignment'. Together they form a unique fingerprint.

    Cite this