Document separation between native English and nonnative English using long POS strings

Kensei Yukino, Sayaka Aoki, Ryuji Tanigawa, Yoichi Tomiura

研究成果: Contribution to journalArticle査読

抄録

We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.

本文言語英語
ページ(範囲)115-119
ページ数5
ジャーナルResearch Reports on Information Science and Electrical Engineering of Kyushu University
11
2
出版ステータス出版済み - 9 2006

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering
  • Hardware and Architecture
  • Engineering (miscellaneous)

フィンガープリント 「Document separation between native English and nonnative English using long POS strings」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル