LTDE: A layout tree based approach for deep page data extraction

Jun Zeng, Feng Li, Brendan Flanagan, Sachio Hirokawa

Research output: Contribution to journalArticle

Abstract

Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deepWeb page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.

Original languageEnglish
Pages (from-to)1067-1078
Number of pages12
JournalIEICE Transactions on Information and Systems
VolumeE100D
Issue number5
DOIs
Publication statusPublished - May 2017

Fingerprint

HTML
Websites
World Wide Web
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Cite this

LTDE : A layout tree based approach for deep page data extraction. / Zeng, Jun; Li, Feng; Flanagan, Brendan; Hirokawa, Sachio.

In: IEICE Transactions on Information and Systems, Vol. E100D, No. 5, 05.2017, p. 1067-1078.

Research output: Contribution to journalArticle

Zeng, Jun ; Li, Feng ; Flanagan, Brendan ; Hirokawa, Sachio. / LTDE : A layout tree based approach for deep page data extraction. In: IEICE Transactions on Information and Systems. 2017 ; Vol. E100D, No. 5. pp. 1067-1078.
@article{d06051d959a04841bfefb0d8b0fb3708,
title = "LTDE: A layout tree based approach for deep page data extraction",
abstract = "Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deepWeb page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.",
author = "Jun Zeng and Feng Li and Brendan Flanagan and Sachio Hirokawa",
year = "2017",
month = "5",
doi = "10.1587/transinf.2016EDP7375",
language = "English",
volume = "E100D",
pages = "1067--1078",
journal = "IEICE Transactions on Information and Systems",
issn = "0916-8532",
publisher = "一般社団法人電子情報通信学会",
number = "5",

}

TY - JOUR

T1 - LTDE

T2 - A layout tree based approach for deep page data extraction

AU - Zeng, Jun

AU - Li, Feng

AU - Flanagan, Brendan

AU - Hirokawa, Sachio

PY - 2017/5

Y1 - 2017/5

N2 - Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deepWeb page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.

AB - Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deepWeb page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.

UR - http://www.scopus.com/inward/record.url?scp=85018955304&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018955304&partnerID=8YFLogxK

U2 - 10.1587/transinf.2016EDP7375

DO - 10.1587/transinf.2016EDP7375

M3 - Article

AN - SCOPUS:85018955304

VL - E100D

SP - 1067

EP - 1078

JO - IEICE Transactions on Information and Systems

JF - IEICE Transactions on Information and Systems

SN - 0916-8532

IS - 5

ER -