CEFR-based lexical simplification dataset

Satoru Uchida, Shohei Takada, Yuki Arase

研究成果: 著書/レポートタイプへの貢献会議での発言

抄録

This study creates a language dataset for lexical simplification based on Common European Framework of References for Languages (CEFR) levels (CEFR-LS). Lexical simplification has continued to be one of the important tasks for language learning and education. There are several language resources for lexical simplification that are available for generating rules and creating simplifiers using machine learning. However, these resources are not tailored to language education with word levels and lists of candidates tending to be subjective. Different from these, the present study constructs a CEFR-LS whose target and candidate words are assigned CEFR levels using CEFR-J wordlists and English Vocabulary Profile, and candidates are selected using an online thesaurus. Since CEFR is widely used around the world, using CEFR levels makes it possible to apply a simplification method based on our dataset to language education directly. CEFR-LS currently includes 406 targets and 4912 candidates. To evaluate the validity of CEFR-LS for machine learning, two basic models are employed for selecting candidates and the results are presented as a reference for future users of the dataset.

元の言語英語
ホスト出版物のタイトルLREC 2018 - 11th International Conference on Language Resources and Evaluation
編集者Hitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
出版者European Language Resources Association (ELRA)
ページ3254-3258
ページ数5
ISBN(電子版)9791095546009
出版物ステータス出版済み - 1 1 2019
イベント11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, 日本
継続期間: 5 7 20185 12 2018

出版物シリーズ

名前LREC 2018 - 11th International Conference on Language Resources and Evaluation

会議

会議11th International Conference on Language Resources and Evaluation, LREC 2018
日本
Miyazaki
期間5/7/185/12/18

Fingerprint

language
candidacy
language education
Simplification
Common European Framework of Reference for Languages
learning
thesaurus
resources
vocabulary
Language Education
education
Machine Learning
Resources
Language

All Science Journal Classification (ASJC) codes

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

これを引用

Uchida, S., Takada, S., & Arase, Y. (2019). CEFR-based lexical simplification dataset. : H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (版), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 3254-3258). (LREC 2018 - 11th International Conference on Language Resources and Evaluation). European Language Resources Association (ELRA).

CEFR-based lexical simplification dataset. / Uchida, Satoru; Takada, Shohei; Arase, Yuki.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. 版 / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 3254-3258 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).

研究成果: 著書/レポートタイプへの貢献会議での発言

Uchida, S, Takada, S & Arase, Y 2019, CEFR-based lexical simplification dataset. : H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (版), LREC 2018 - 11th International Conference on Language Resources and Evaluation. LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA), pp. 3254-3258, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, 日本, 5/7/18.
Uchida S, Takada S, Arase Y. CEFR-based lexical simplification dataset. : Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, 編集者, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 3254-3258. (LREC 2018 - 11th International Conference on Language Resources and Evaluation).
Uchida, Satoru ; Takada, Shohei ; Arase, Yuki. / CEFR-based lexical simplification dataset. LREC 2018 - 11th International Conference on Language Resources and Evaluation. 編集者 / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 3254-3258 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).
@inproceedings{20777c42323c4836ad0f080cbe955660,
title = "CEFR-based lexical simplification dataset",
abstract = "This study creates a language dataset for lexical simplification based on Common European Framework of References for Languages (CEFR) levels (CEFR-LS). Lexical simplification has continued to be one of the important tasks for language learning and education. There are several language resources for lexical simplification that are available for generating rules and creating simplifiers using machine learning. However, these resources are not tailored to language education with word levels and lists of candidates tending to be subjective. Different from these, the present study constructs a CEFR-LS whose target and candidate words are assigned CEFR levels using CEFR-J wordlists and English Vocabulary Profile, and candidates are selected using an online thesaurus. Since CEFR is widely used around the world, using CEFR levels makes it possible to apply a simplification method based on our dataset to language education directly. CEFR-LS currently includes 406 targets and 4912 candidates. To evaluate the validity of CEFR-LS for machine learning, two basic models are employed for selecting candidates and the results are presented as a reference for future users of the dataset.",
author = "Satoru Uchida and Shohei Takada and Yuki Arase",
year = "2019",
month = "1",
day = "1",
language = "English",
series = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",
pages = "3254--3258",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",

}

TY - GEN

T1 - CEFR-based lexical simplification dataset

AU - Uchida, Satoru

AU - Takada, Shohei

AU - Arase, Yuki

PY - 2019/1/1

Y1 - 2019/1/1

N2 - This study creates a language dataset for lexical simplification based on Common European Framework of References for Languages (CEFR) levels (CEFR-LS). Lexical simplification has continued to be one of the important tasks for language learning and education. There are several language resources for lexical simplification that are available for generating rules and creating simplifiers using machine learning. However, these resources are not tailored to language education with word levels and lists of candidates tending to be subjective. Different from these, the present study constructs a CEFR-LS whose target and candidate words are assigned CEFR levels using CEFR-J wordlists and English Vocabulary Profile, and candidates are selected using an online thesaurus. Since CEFR is widely used around the world, using CEFR levels makes it possible to apply a simplification method based on our dataset to language education directly. CEFR-LS currently includes 406 targets and 4912 candidates. To evaluate the validity of CEFR-LS for machine learning, two basic models are employed for selecting candidates and the results are presented as a reference for future users of the dataset.

AB - This study creates a language dataset for lexical simplification based on Common European Framework of References for Languages (CEFR) levels (CEFR-LS). Lexical simplification has continued to be one of the important tasks for language learning and education. There are several language resources for lexical simplification that are available for generating rules and creating simplifiers using machine learning. However, these resources are not tailored to language education with word levels and lists of candidates tending to be subjective. Different from these, the present study constructs a CEFR-LS whose target and candidate words are assigned CEFR levels using CEFR-J wordlists and English Vocabulary Profile, and candidates are selected using an online thesaurus. Since CEFR is widely used around the world, using CEFR levels makes it possible to apply a simplification method based on our dataset to language education directly. CEFR-LS currently includes 406 targets and 4912 candidates. To evaluate the validity of CEFR-LS for machine learning, two basic models are employed for selecting candidates and the results are presented as a reference for future users of the dataset.

UR - http://www.scopus.com/inward/record.url?scp=85059901092&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059901092&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059901092

T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation

SP - 3254

EP - 3258

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -