LTDE: A layout tree based approach for deep page data extraction

Jun Zeng, Feng Li, Brendan Flanagan, Sachio Hirokawa

    研究成果: Contribution to journalArticle査読

    1 被引用数 (Scopus)

    抄録

    Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deepWeb page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.

    本文言語英語
    ページ(範囲)1067-1078
    ページ数12
    ジャーナルIEICE Transactions on Information and Systems
    E100D
    5
    DOI
    出版ステータス出版済み - 5 2017

    All Science Journal Classification (ASJC) codes

    • ソフトウェア
    • ハードウェアとアーキテクチャ
    • コンピュータ ビジョンおよびパターン認識
    • 電子工学および電気工学
    • 人工知能

    フィンガープリント

    「LTDE: A layout tree based approach for deep page data extraction」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

    引用スタイル