Abstract
In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.
Original language | English |
---|---|
Pages (from-to) | 31-45 |
Number of pages | 15 |
Journal | Library and Information Science |
Volume | 39 |
Publication status | Published - Dec 1 1998 |
Fingerprint
All Science Journal Classification (ASJC) codes
- Library and Information Sciences
Cite this
An experiment of automatic classification of books using Nippon Decimal Classification. / Ishita, Emi.
In: Library and Information Science, Vol. 39, 01.12.1998, p. 31-45.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - An experiment of automatic classification of books using Nippon Decimal Classification
AU - Ishita, Emi
PY - 1998/12/1
Y1 - 1998/12/1
N2 - In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.
AB - In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.
UR - http://www.scopus.com/inward/record.url?scp=25144523665&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=25144523665&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:25144523665
VL - 39
SP - 31
EP - 45
JO - Library and Information Science
JF - Library and Information Science
SN - 0373-4447
ER -