Extraction of relevant snippets from web pages using hybrid features

Jun Zeng, Junhao Wen, Qingyu Xiong, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.

Original languageEnglish
Title of host publicationProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
Pages209-213
Number of pages5
DOIs
Publication statusPublished - Dec 14 2012
Event1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 - Fukuoka, Japan
Duration: Sep 20 2012Sep 22 2012

Publication series

NameProceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

Other

Other1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
CountryJapan
CityFukuoka
Period9/20/129/22/12

Fingerprint

HTML
Websites
Search engines
Model structures
Experiments

All Science Journal Classification (ASJC) codes

  • Information Systems

Cite this

Zeng, J., Wen, J., Xiong, Q., & Hirokawa, S. (2012). Extraction of relevant snippets from web pages using hybrid features. In Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012 (pp. 209-213). [6337190] (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012). https://doi.org/10.1109/IIAI-AAI.2012.50

Extraction of relevant snippets from web pages using hybrid features. / Zeng, Jun; Wen, Junhao; Xiong, Qingyu; Hirokawa, Sachio.

Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. p. 209-213 6337190 (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zeng, J, Wen, J, Xiong, Q & Hirokawa, S 2012, Extraction of relevant snippets from web pages using hybrid features. in Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012., 6337190, Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012, pp. 209-213, 1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012, Fukuoka, Japan, 9/20/12. https://doi.org/10.1109/IIAI-AAI.2012.50
Zeng J, Wen J, Xiong Q, Hirokawa S. Extraction of relevant snippets from web pages using hybrid features. In Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. p. 209-213. 6337190. (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012). https://doi.org/10.1109/IIAI-AAI.2012.50
Zeng, Jun ; Wen, Junhao ; Xiong, Qingyu ; Hirokawa, Sachio. / Extraction of relevant snippets from web pages using hybrid features. Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012. 2012. pp. 209-213 (Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012).
@inproceedings{a6beb7c5053442fbb309cfb153ddd6d1,
title = "Extraction of relevant snippets from web pages using hybrid features",
abstract = "As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.",
author = "Jun Zeng and Junhao Wen and Qingyu Xiong and Sachio Hirokawa",
year = "2012",
month = "12",
day = "14",
doi = "10.1109/IIAI-AAI.2012.50",
language = "English",
isbn = "9780769548265",
series = "Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012",
pages = "209--213",
booktitle = "Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012",

}

TY - GEN

T1 - Extraction of relevant snippets from web pages using hybrid features

AU - Zeng, Jun

AU - Wen, Junhao

AU - Xiong, Qingyu

AU - Hirokawa, Sachio

PY - 2012/12/14

Y1 - 2012/12/14

N2 - As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.

AB - As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.

UR - http://www.scopus.com/inward/record.url?scp=84870794135&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870794135&partnerID=8YFLogxK

U2 - 10.1109/IIAI-AAI.2012.50

DO - 10.1109/IIAI-AAI.2012.50

M3 - Conference contribution

AN - SCOPUS:84870794135

SN - 9780769548265

T3 - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

SP - 209

EP - 213

BT - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012

ER -