Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC

Masayuki Takeda, Tetsuya Matsumoto, Tomoko Fukuda, Ichirō Nanri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

Original languageEnglish
Title of host publicationDiscovery Science - 3rd International Conference, DS 2000, Proceedings
EditorsSetsuo Arikawa, Shinichi Morishita
PublisherSpringer Verlag
Pages112-126
Number of pages15
ISBN (Print)9783540413523
Publication statusPublished - Jan 1 2000
Event3rd International Conference on Discovery Science, DS 2000 - Kyoto, Japan
Duration: Dec 4 2000Dec 6 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1967
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other3rd International Conference on Discovery Science, DS 2000
CountryJapan
CityKyoto
Period12/4/0012/6/00

Fingerprint

Text Analysis
N-gram
Redundancy
Statistics
Pattern Discovery
Browsing
Fragment
Strings
Human

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Takeda, M., Matsumoto, T., Fukuda, T., & Nanri, I. (2000). Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC. In S. Arikawa, & S. Morishita (Eds.), Discovery Science - 3rd International Conference, DS 2000, Proceedings (pp. 112-126). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1967). Springer Verlag.

Discovering characteristic expressions from literary works : A new text analysis method beyond N-gram statistics and KWIC. / Takeda, Masayuki; Matsumoto, Tetsuya; Fukuda, Tomoko; Nanri, Ichirō.

Discovery Science - 3rd International Conference, DS 2000, Proceedings. ed. / Setsuo Arikawa; Shinichi Morishita. Springer Verlag, 2000. p. 112-126 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1967).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Takeda, M, Matsumoto, T, Fukuda, T & Nanri, I 2000, Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC. in S Arikawa & S Morishita (eds), Discovery Science - 3rd International Conference, DS 2000, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 1967, Springer Verlag, pp. 112-126, 3rd International Conference on Discovery Science, DS 2000, Kyoto, Japan, 12/4/00.
Takeda M, Matsumoto T, Fukuda T, Nanri I. Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC. In Arikawa S, Morishita S, editors, Discovery Science - 3rd International Conference, DS 2000, Proceedings. Springer Verlag. 2000. p. 112-126. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Takeda, Masayuki ; Matsumoto, Tetsuya ; Fukuda, Tomoko ; Nanri, Ichirō. / Discovering characteristic expressions from literary works : A new text analysis method beyond N-gram statistics and KWIC. Discovery Science - 3rd International Conference, DS 2000, Proceedings. editor / Setsuo Arikawa ; Shinichi Morishita. Springer Verlag, 2000. pp. 112-126 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{306e96697c6044eb921975e1a33a9c5a,
title = "Discovering characteristic expressions from literary works: A new text analysis method beyond N-gram statistics and KWIC",
abstract = "We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.",
author = "Masayuki Takeda and Tetsuya Matsumoto and Tomoko Fukuda and Ichirō Nanri",
year = "2000",
month = "1",
day = "1",
language = "English",
isbn = "9783540413523",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "112--126",
editor = "Setsuo Arikawa and Shinichi Morishita",
booktitle = "Discovery Science - 3rd International Conference, DS 2000, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Discovering characteristic expressions from literary works

T2 - A new text analysis method beyond N-gram statistics and KWIC

AU - Takeda, Masayuki

AU - Matsumoto, Tetsuya

AU - Fukuda, Tomoko

AU - Nanri, Ichirō

PY - 2000/1/1

Y1 - 2000/1/1

N2 - We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

AB - We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.

UR - http://www.scopus.com/inward/record.url?scp=84974725411&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84974725411&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84974725411

SN - 9783540413523

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 112

EP - 126

BT - Discovery Science - 3rd International Conference, DS 2000, Proceedings

A2 - Arikawa, Setsuo

A2 - Morishita, Shinichi

PB - Springer Verlag

ER -