Layout-tree-based approach for identifying visually similar blocks in a web page

Jun Zeng, Brendan Flanagan, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

When extracting information from a web page, IE systems usually need to perform pattern recognition to identify the elements that have similar patterns. However, most of them are mainly based on analyzing HMTL source code, DOM tree, tag tree or Xpath of web pages. These methods are language-dependent, or more precisely, HTML-dependent. They have some insuperable limitations. In order to overcome these limitations, we propose a notion of layout-tree and a pattern recognition method to identify visual blocks with similar visual pattern using layout tree. In this paper, we call a visible rectangular region in a web page a visual block or block for short. We consider if the elements of two blocks are displayed in a similar layout, we define that the two blocks are visually similar. We first transform the layout into a layout tree. By calculating the similarity of the layout trees of two blocks, we can determine whether the two blocks are visually similar or not. The result of experiment shows that the layout tree is an effective method to identify visually similar blocks.

Original languageEnglish
Title of host publication2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings
Pages65-70
Number of pages6
DOIs
Publication statusPublished - Oct 31 2013
Event2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Niigata, Japan
Duration: Jun 16 2013Jun 20 2013

Publication series

Name2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings

Other

Other2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013
CountryJapan
CityNiigata
Period6/16/136/20/13

Fingerprint

Websites
Pattern recognition
HTML
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Information Systems

Cite this

Zeng, J., Flanagan, B., & Hirokawa, S. (2013). Layout-tree-based approach for identifying visually similar blocks in a web page. In 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings (pp. 65-70). [6607818] (2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings). https://doi.org/10.1109/ICIS.2013.6607818

Layout-tree-based approach for identifying visually similar blocks in a web page. / Zeng, Jun; Flanagan, Brendan; Hirokawa, Sachio.

2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings. 2013. p. 65-70 6607818 (2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zeng, J, Flanagan, B & Hirokawa, S 2013, Layout-tree-based approach for identifying visually similar blocks in a web page. in 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings., 6607818, 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings, pp. 65-70, 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013, Niigata, Japan, 6/16/13. https://doi.org/10.1109/ICIS.2013.6607818
Zeng J, Flanagan B, Hirokawa S. Layout-tree-based approach for identifying visually similar blocks in a web page. In 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings. 2013. p. 65-70. 6607818. (2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings). https://doi.org/10.1109/ICIS.2013.6607818
Zeng, Jun ; Flanagan, Brendan ; Hirokawa, Sachio. / Layout-tree-based approach for identifying visually similar blocks in a web page. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings. 2013. pp. 65-70 (2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings).
@inproceedings{6a37ce302dbd4c3aa9f12f618631f171,
title = "Layout-tree-based approach for identifying visually similar blocks in a web page",
abstract = "When extracting information from a web page, IE systems usually need to perform pattern recognition to identify the elements that have similar patterns. However, most of them are mainly based on analyzing HMTL source code, DOM tree, tag tree or Xpath of web pages. These methods are language-dependent, or more precisely, HTML-dependent. They have some insuperable limitations. In order to overcome these limitations, we propose a notion of layout-tree and a pattern recognition method to identify visual blocks with similar visual pattern using layout tree. In this paper, we call a visible rectangular region in a web page a visual block or block for short. We consider if the elements of two blocks are displayed in a similar layout, we define that the two blocks are visually similar. We first transform the layout into a layout tree. By calculating the similarity of the layout trees of two blocks, we can determine whether the two blocks are visually similar or not. The result of experiment shows that the layout tree is an effective method to identify visually similar blocks.",
author = "Jun Zeng and Brendan Flanagan and Sachio Hirokawa",
year = "2013",
month = "10",
day = "31",
doi = "10.1109/ICIS.2013.6607818",
language = "English",
isbn = "9781479901746",
series = "2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings",
pages = "65--70",
booktitle = "2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings",

}

TY - GEN

T1 - Layout-tree-based approach for identifying visually similar blocks in a web page

AU - Zeng, Jun

AU - Flanagan, Brendan

AU - Hirokawa, Sachio

PY - 2013/10/31

Y1 - 2013/10/31

N2 - When extracting information from a web page, IE systems usually need to perform pattern recognition to identify the elements that have similar patterns. However, most of them are mainly based on analyzing HMTL source code, DOM tree, tag tree or Xpath of web pages. These methods are language-dependent, or more precisely, HTML-dependent. They have some insuperable limitations. In order to overcome these limitations, we propose a notion of layout-tree and a pattern recognition method to identify visual blocks with similar visual pattern using layout tree. In this paper, we call a visible rectangular region in a web page a visual block or block for short. We consider if the elements of two blocks are displayed in a similar layout, we define that the two blocks are visually similar. We first transform the layout into a layout tree. By calculating the similarity of the layout trees of two blocks, we can determine whether the two blocks are visually similar or not. The result of experiment shows that the layout tree is an effective method to identify visually similar blocks.

AB - When extracting information from a web page, IE systems usually need to perform pattern recognition to identify the elements that have similar patterns. However, most of them are mainly based on analyzing HMTL source code, DOM tree, tag tree or Xpath of web pages. These methods are language-dependent, or more precisely, HTML-dependent. They have some insuperable limitations. In order to overcome these limitations, we propose a notion of layout-tree and a pattern recognition method to identify visual blocks with similar visual pattern using layout tree. In this paper, we call a visible rectangular region in a web page a visual block or block for short. We consider if the elements of two blocks are displayed in a similar layout, we define that the two blocks are visually similar. We first transform the layout into a layout tree. By calculating the similarity of the layout trees of two blocks, we can determine whether the two blocks are visually similar or not. The result of experiment shows that the layout tree is an effective method to identify visually similar blocks.

UR - http://www.scopus.com/inward/record.url?scp=84886519649&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886519649&partnerID=8YFLogxK

U2 - 10.1109/ICIS.2013.6607818

DO - 10.1109/ICIS.2013.6607818

M3 - Conference contribution

AN - SCOPUS:84886519649

SN - 9781479901746

T3 - 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings

SP - 65

EP - 70

BT - 2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings

ER -