Integrating multiple dependency corpora for inducing wide-coverage Japanese CCG resources

Sumire Uematsu, Takuya Matsuzaki, Hiroki Hanaoka, Yusuke Miyao, Hideki Mima

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

A novel method to induce wide-coverage Combinatory Categorial Grammar (CCG) resources for Japanese is proposed in this article. For some languages including English, the availability of large annotated corpora and the development of data-based induction of lexicalized grammar have enabled deep parsing, i.e., parsing based on lexicalized grammars. However, deep parsing for Japanese has not been widely studied. This is mainly because most Japanese syntactic resources are represented in chunk-based dependency structures, while previous methods for inducing grammars are dependent on tree corpora. To translate syntactic information presented in chunk-based dependencies to phrase structures as accurately as possible, integration of annotation from multiple dependency-based corpora is proposed. Our method first integrates dependency structures and predicate-argument information and converts them into phrase structure trees. The trees are then transformed into CCG derivations in a similar way to previously proposed methods. The quality of the conversion is empirically evaluated in terms of the coverage of the obtained CCG lexicon and the accuracy of the parsing with the grammar. While the transforming process used in this study is specialized for Japanese, the framework of our method would be applicable to other languages for which dependency-based analysis has been regarded as more appropriate than phrase structure-based analysis due to morphosyntactic features.

Original languageEnglish
Article number1
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume14
Issue number1
DOIs
Publication statusPublished - Jan 2015
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Integrating multiple dependency corpora for inducing wide-coverage Japanese CCG resources'. Together they form a unique fingerprint.

Cite this