Extraction of relevant components using shallow structure of HTML documents

Jun Zeng, Brendan Flanagan, Toshihiko Sakai, Sachio Hirokawa

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.

    Original languageEnglish
    Title of host publicationProceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
    Pages1186-1190
    Number of pages5
    DOIs
    Publication statusPublished - 2012
    Event2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012 - Chongqing, China
    Duration: May 29 2012May 31 2012

    Other

    Other2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
    CountryChina
    CityChongqing
    Period5/29/125/31/12

    All Science Journal Classification (ASJC) codes

    • Control and Optimization
    • Logic

    Fingerprint Dive into the research topics of 'Extraction of relevant components using shallow structure of HTML documents'. Together they form a unique fingerprint.

    Cite this