Compilation of an idiom example database for supervised idiom identification

Chikara Hashimoto, Daisuke Kawahara

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

Original languageEnglish
Pages (from-to)355-384
Number of pages30
JournalLanguage Resources and Evaluation
Volume43
Issue number4
DOIs
Publication statusPublished - Dec 1 2009

Fingerprint

experiment
Idioms
Compilation
Data Base
language
Disambiguation
Word Sense
Experiment

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Cite this

Compilation of an idiom example database for supervised idiom identification. / Hashimoto, Chikara; Kawahara, Daisuke.

In: Language Resources and Evaluation, Vol. 43, No. 4, 01.12.2009, p. 355-384.

Research output: Contribution to journalArticle

Hashimoto, Chikara ; Kawahara, Daisuke. / Compilation of an idiom example database for supervised idiom identification. In: Language Resources and Evaluation. 2009 ; Vol. 43, No. 4. pp. 355-384.
@article{128f15ad6b6c4ed292c0713d652c592b,
title = "Compilation of an idiom example database for supervised idiom identification",
abstract = "Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86{\%}, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.",
author = "Chikara Hashimoto and Daisuke Kawahara",
year = "2009",
month = "12",
day = "1",
doi = "10.1007/s10579-009-9104-1",
language = "English",
volume = "43",
pages = "355--384",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Compilation of an idiom example database for supervised idiom identification

AU - Hashimoto, Chikara

AU - Kawahara, Daisuke

PY - 2009/12/1

Y1 - 2009/12/1

N2 - Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

AB - Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89. 25 and 88. 86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

UR - http://www.scopus.com/inward/record.url?scp=77950756086&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77950756086&partnerID=8YFLogxK

U2 - 10.1007/s10579-009-9104-1

DO - 10.1007/s10579-009-9104-1

M3 - Article

AN - SCOPUS:77950756086

VL - 43

SP - 355

EP - 384

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 4

ER -