Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features

Chikara Hashimoto, Daisuke Kawahara

Research output: Contribution to conferencePaper

12 Citations (Scopus)

Abstract

Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 ambiguous idioms, and consists of 102,846 sentences, each of which is annotated with a literal/idiom label. For idiom identification, we targeted 90 out of the 146 idioms and adopted a word sense disambiguation (WSD) method using both common WSD features and idiom-specific features. The corpus and the experiment are the largest of their kind, as far as we know. As a result, we found that a standard supervised WSD method works well for the idiom identification and achieved an accuracy of 89.25% and 88.86% with/without idiom-specific features and that the most effective idiom-specific feature is the one involving the adjacency of idiom constituents.

Original languageEnglish
Pages992-1001
Number of pages10
Publication statusPublished - Dec 1 2008
Event2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation - Honolulu, HI, United States
Duration: Oct 25 2008Oct 27 2008

Other

Other2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation
CountryUnited States
CityHonolulu, HI
Period10/25/0810/27/08

Fingerprint

Labels
Experiments
Processing

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Hashimoto, C., & Kawahara, D. (2008). Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. 992-1001. Paper presented at 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation, Honolulu, HI, United States.

Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. / Hashimoto, Chikara; Kawahara, Daisuke.

2008. 992-1001 Paper presented at 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation, Honolulu, HI, United States.

Research output: Contribution to conferencePaper

Hashimoto, C & Kawahara, D 2008, 'Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features', Paper presented at 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation, Honolulu, HI, United States, 10/25/08 - 10/27/08 pp. 992-1001.
Hashimoto C, Kawahara D. Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. 2008. Paper presented at 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation, Honolulu, HI, United States.
Hashimoto, Chikara ; Kawahara, Daisuke. / Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. Paper presented at 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation, Honolulu, HI, United States.10 p.
@conference{2acc608c3d83474d9617ac1624be7806,
title = "Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features",
abstract = "Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 ambiguous idioms, and consists of 102,846 sentences, each of which is annotated with a literal/idiom label. For idiom identification, we targeted 90 out of the 146 idioms and adopted a word sense disambiguation (WSD) method using both common WSD features and idiom-specific features. The corpus and the experiment are the largest of their kind, as far as we know. As a result, we found that a standard supervised WSD method works well for the idiom identification and achieved an accuracy of 89.25{\%} and 88.86{\%} with/without idiom-specific features and that the most effective idiom-specific feature is the one involving the adjacency of idiom constituents.",
author = "Chikara Hashimoto and Daisuke Kawahara",
year = "2008",
month = "12",
day = "1",
language = "English",
pages = "992--1001",
note = "2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Co-located with AMTA 2008 and the International Workshop on Spoken Language Translation ; Conference date: 25-10-2008 Through 27-10-2008",

}

TY - CONF

T1 - Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features

AU - Hashimoto, Chikara

AU - Kawahara, Daisuke

PY - 2008/12/1

Y1 - 2008/12/1

N2 - Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 ambiguous idioms, and consists of 102,846 sentences, each of which is annotated with a literal/idiom label. For idiom identification, we targeted 90 out of the 146 idioms and adopted a word sense disambiguation (WSD) method using both common WSD features and idiom-specific features. The corpus and the experiment are the largest of their kind, as far as we know. As a result, we found that a standard supervised WSD method works well for the idiom identification and achieved an accuracy of 89.25% and 88.86% with/without idiom-specific features and that the most effective idiom-specific feature is the one involving the adjacency of idiom constituents.

AB - Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 ambiguous idioms, and consists of 102,846 sentences, each of which is annotated with a literal/idiom label. For idiom identification, we targeted 90 out of the 146 idioms and adopted a word sense disambiguation (WSD) method using both common WSD features and idiom-specific features. The corpus and the experiment are the largest of their kind, as far as we know. As a result, we found that a standard supervised WSD method works well for the idiom identification and achieved an accuracy of 89.25% and 88.86% with/without idiom-specific features and that the most effective idiom-specific feature is the one involving the adjacency of idiom constituents.

UR - http://www.scopus.com/inward/record.url?scp=80053372197&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053372197&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:80053372197

SP - 992

EP - 1001

ER -