Significance of Low-Frequent Words in Concept Describing Document

Yuki Okumura, Sachio Hirokawa, Kazuhiro Takeuchi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.

Original languageEnglish
Title of host publicationProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1035-1036
Number of pages2
ISBN (Electronic)9781728126272
DOIs
Publication statusPublished - Jul 2019
Event8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019 - Toyama, Japan
Duration: Jul 7 2019Jul 11 2019

Publication series

NameProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019

Conference

Conference8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019
CountryJapan
CityToyama
Period7/7/197/11/19

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management
  • Social Sciences (miscellaneous)

Fingerprint Dive into the research topics of 'Significance of Low-Frequent Words in Concept Describing Document'. Together they form a unique fingerprint.

  • Cite this

    Okumura, Y., Hirokawa, S., & Takeuchi, K. (2019). Significance of Low-Frequent Words in Concept Describing Document. In Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019 (pp. 1035-1036). [8992811] (Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IIAI-AAI.2019.00214