Document separation between native English and nonnative English using long POS strings

Kensei Yukino, Sayaka Aoki, Ryuji Tanigawa, Yoichi Tomiura

Research output: Contribution to journalArticle

Abstract

We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.

Original languageEnglish
Pages (from-to)115-119
Number of pages5
JournalResearch Reports on Information Science and Electrical Engineering of Kyushu University
Volume11
Issue number2
Publication statusPublished - Sep 2006

Fingerprint

Experiments

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering
  • Hardware and Architecture
  • Engineering (miscellaneous)

Cite this

Document separation between native English and nonnative English using long POS strings. / Yukino, Kensei; Aoki, Sayaka; Tanigawa, Ryuji; Tomiura, Yoichi.

In: Research Reports on Information Science and Electrical Engineering of Kyushu University, Vol. 11, No. 2, 09.2006, p. 115-119.

Research output: Contribution to journalArticle

@article{20b2b91b632748068ee087413d3fbddd,
title = "Document separation between native English and nonnative English using long POS strings",
abstract = "We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.",
author = "Kensei Yukino and Sayaka Aoki and Ryuji Tanigawa and Yoichi Tomiura",
year = "2006",
month = "9",
language = "English",
volume = "11",
pages = "115--119",
journal = "Research Reports on Information Science and Electrical Engineering of Kyushu University",
issn = "1342-3819",
publisher = "Kyushu University, Faculty of Science",
number = "2",

}

TY - JOUR

T1 - Document separation between native English and nonnative English using long POS strings

AU - Yukino, Kensei

AU - Aoki, Sayaka

AU - Tanigawa, Ryuji

AU - Tomiura, Yoichi

PY - 2006/9

Y1 - 2006/9

N2 - We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.

AB - We propose using long and low-frequency part of speech (POS) strings for document separation between native English documents and non-native English documents. The long POS strings were ignored in previous works because their frequencies in training data are too small to estimate their probabilities. Meanwhile, a research of language identification showed that the long and low-frequency byte strings were useful for language identification among similar languages. There are some similarity between language identification and document separation between native English documents and non-native English documents, for example long POS strings are more peculiar to one class than short ones, though there is a difference between POS and byte. Therefore, we can expect higher accuracy by using long and low-frequency POS strings. Some experiments are described in this paper. These experiments show that the proposed method has higher accuracy than previous ones.

UR - http://www.scopus.com/inward/record.url?scp=33751581326&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33751581326&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:33751581326

VL - 11

SP - 115

EP - 119

JO - Research Reports on Information Science and Electrical Engineering of Kyushu University

JF - Research Reports on Information Science and Electrical Engineering of Kyushu University

SN - 1342-3819

IS - 2

ER -