An experiment of automatic classification of books using Nippon Decimal Classification

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.

Original languageEnglish
Pages (from-to)31-45
Number of pages15
JournalLibrary and Information Science
Volume39
Publication statusPublished - Dec 1 1998

Fingerprint

experiment
weighting
information retrieval
language

All Science Journal Classification (ASJC) codes

  • Library and Information Sciences

Cite this

An experiment of automatic classification of books using Nippon Decimal Classification. / Ishita, Emi.

In: Library and Information Science, Vol. 39, 01.12.1998, p. 31-45.

Research output: Contribution to journalArticle

@article{04fd742f6d964080bf1e3637064756ba,
title = "An experiment of automatic classification of books using Nippon Decimal Classification",
abstract = "In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9{\%} of 1,000 books.",
author = "Emi Ishita",
year = "1998",
month = "12",
day = "1",
language = "English",
volume = "39",
pages = "31--45",
journal = "Library and Information Science",
issn = "0373-4447",
publisher = "Mita Society for Library and Information Science",

}

TY - JOUR

T1 - An experiment of automatic classification of books using Nippon Decimal Classification

AU - Ishita, Emi

PY - 1998/12/1

Y1 - 1998/12/1

N2 - In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.

AB - In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.

UR - http://www.scopus.com/inward/record.url?scp=25144523665&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=25144523665&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:25144523665

VL - 39

SP - 31

EP - 45

JO - Library and Information Science

JF - Library and Information Science

SN - 0373-4447

ER -