Information extraction from web pages using semi-structured data alignment

Tetsuji Kuboyama, Tetsuhiro Miyahara, Sachio Hirokawa, Eisuke Itou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.

Original languageEnglish
Title of host publicationWMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
Pages42-47
Number of pages6
Publication statusPublished - Dec 1 2005
Event9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005 - Orlando, FL, United States
Duration: Jul 10 2005Jul 13 2005

Publication series

NameWMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
Volume1

Other

Other9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005
CountryUnited States
CityOrlando, FL
Period7/10/057/13/05

Fingerprint

HTML
Websites
Metadata
Data storage equipment
Experiments

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Networks and Communications
  • Information Systems

Cite this

Kuboyama, T., Miyahara, T., Hirokawa, S., & Itou, E. (2005). Information extraction from web pages using semi-structured data alignment. In WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings (pp. 42-47). (WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings; Vol. 1).

Information extraction from web pages using semi-structured data alignment. / Kuboyama, Tetsuji; Miyahara, Tetsuhiro; Hirokawa, Sachio; Itou, Eisuke.

WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings. 2005. p. 42-47 (WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings; Vol. 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kuboyama, T, Miyahara, T, Hirokawa, S & Itou, E 2005, Information extraction from web pages using semi-structured data alignment. in WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings. WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings, vol. 1, pp. 42-47, 9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005, Orlando, FL, United States, 7/10/05.
Kuboyama T, Miyahara T, Hirokawa S, Itou E. Information extraction from web pages using semi-structured data alignment. In WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings. 2005. p. 42-47. (WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings).
Kuboyama, Tetsuji ; Miyahara, Tetsuhiro ; Hirokawa, Sachio ; Itou, Eisuke. / Information extraction from web pages using semi-structured data alignment. WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings. 2005. pp. 42-47 (WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings).
@inproceedings{12d19d23220a41ccad4be0889e4669dc,
title = "Information extraction from web pages using semi-structured data alignment",
abstract = "Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.",
author = "Tetsuji Kuboyama and Tetsuhiro Miyahara and Sachio Hirokawa and Eisuke Itou",
year = "2005",
month = "12",
day = "1",
language = "English",
isbn = "9806560531",
series = "WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings",
pages = "42--47",
booktitle = "WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings",

}

TY - GEN

T1 - Information extraction from web pages using semi-structured data alignment

AU - Kuboyama, Tetsuji

AU - Miyahara, Tetsuhiro

AU - Hirokawa, Sachio

AU - Itou, Eisuke

PY - 2005/12/1

Y1 - 2005/12/1

N2 - Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.

AB - Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.

UR - http://www.scopus.com/inward/record.url?scp=84867368995&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867368995&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84867368995

SN - 9806560531

SN - 9789806560536

T3 - WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings

SP - 42

EP - 47

BT - WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings

ER -