Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC

Masayuki Takeda, Tetsuya Matsumoto, Tomoko Fukuda, Ichirō Nanri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

Original languageEnglish
Title of host publicationDiscovery Science - 3rd International Conference, DS 2000, Proceedings
EditorsSetsuo Arikawa, Shinichi Morishita
PublisherSpringer Verlag
Pages112-126
Number of pages15
ISBN (Print)9783540413523
Publication statusPublished - Jan 1 2000
Event3rd International Conference on Discovery Science, DS 2000 - Kyoto, Japan
Duration: Dec 4 2000Dec 6 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1967
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other3rd International Conference on Discovery Science, DS 2000
CountryJapan
CityKyoto
Period12/4/0012/6/00

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC'. Together they form a unique fingerprint.

  • Cite this

    Takeda, M., Matsumoto, T., Fukuda, T., & Nanri, I. (2000). Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC. In S. Arikawa, & S. Morishita (Eds.), Discovery Science - 3rd International Conference, DS 2000, Proceedings (pp. 112-126). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1967). Springer Verlag.