Development of a large-scale web crawler and search engine infrastructure

Susumu Akamine, Yoshikiyo Kato, Daisuke Kawahara, Keiji Shinzato, Kentaro Inui, Sadao Kurohashi, Yutaka Kidawara

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.

Original languageEnglish
Title of host publicationProceedings of the 3rd International Universal Communication Symposium, IUCS 2009
Pages126-131
Number of pages6
DOIs
Publication statusPublished - Dec 1 2009
Event3rd International Universal Communication Symposium, IUCS 2009 - Tokyo, Japan
Duration: Dec 3 2009Dec 4 2009

Other

Other3rd International Universal Communication Symposium, IUCS 2009
CountryJapan
CityTokyo
Period12/3/0912/4/09

Fingerprint

Search engines
Scalability
Websites
Communication
Web crawler

All Science Journal Classification (ASJC) codes

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Cite this

Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S., & Kidawara, Y. (2009). Development of a large-scale web crawler and search engine infrastructure. In Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009 (pp. 126-131) https://doi.org/10.1145/1667780.1667805

Development of a large-scale web crawler and search engine infrastructure. / Akamine, Susumu; Kato, Yoshikiyo; Kawahara, Daisuke; Shinzato, Keiji; Inui, Kentaro; Kurohashi, Sadao; Kidawara, Yutaka.

Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009. 2009. p. 126-131.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Akamine, S, Kato, Y, Kawahara, D, Shinzato, K, Inui, K, Kurohashi, S & Kidawara, Y 2009, Development of a large-scale web crawler and search engine infrastructure. in Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009. pp. 126-131, 3rd International Universal Communication Symposium, IUCS 2009, Tokyo, Japan, 12/3/09. https://doi.org/10.1145/1667780.1667805
Akamine S, Kato Y, Kawahara D, Shinzato K, Inui K, Kurohashi S et al. Development of a large-scale web crawler and search engine infrastructure. In Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009. 2009. p. 126-131 https://doi.org/10.1145/1667780.1667805
Akamine, Susumu ; Kato, Yoshikiyo ; Kawahara, Daisuke ; Shinzato, Keiji ; Inui, Kentaro ; Kurohashi, Sadao ; Kidawara, Yutaka. / Development of a large-scale web crawler and search engine infrastructure. Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009. 2009. pp. 126-131
@inproceedings{cd7e2467b99a427d86f3bf851e2fc36d,
title = "Development of a large-scale web crawler and search engine infrastructure",
abstract = "This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.",
author = "Susumu Akamine and Yoshikiyo Kato and Daisuke Kawahara and Keiji Shinzato and Kentaro Inui and Sadao Kurohashi and Yutaka Kidawara",
year = "2009",
month = "12",
day = "1",
doi = "10.1145/1667780.1667805",
language = "English",
isbn = "9781605586410",
pages = "126--131",
booktitle = "Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009",

}

TY - GEN

T1 - Development of a large-scale web crawler and search engine infrastructure

AU - Akamine, Susumu

AU - Kato, Yoshikiyo

AU - Kawahara, Daisuke

AU - Shinzato, Keiji

AU - Inui, Kentaro

AU - Kurohashi, Sadao

AU - Kidawara, Yutaka

PY - 2009/12/1

Y1 - 2009/12/1

N2 - This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.

AB - This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.

UR - http://www.scopus.com/inward/record.url?scp=76749136736&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=76749136736&partnerID=8YFLogxK

U2 - 10.1145/1667780.1667805

DO - 10.1145/1667780.1667805

M3 - Conference contribution

AN - SCOPUS:76749136736

SN - 9781605586410

SP - 126

EP - 131

BT - Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009

ER -