DDBJ data analysis challenge: A machine learning competition to predict arabidopsis chromatin feature annotations from DNA sequences

Eli Kaminuma, Yukino Baba, Masahiro Mochizuki, Hirotaka Matsumoto, Haruka Ozaki, Toshitsugu Okayama, Takuya Kato, Shinya Oki, Takatomo Fujisawa, Yasukazu Nakamura, Masanori Arita, Osamu Ogasawara, Hisashi Kashima, Toshihisa Takagi

研究成果: Contribution to journalArticle査読

抄録

Recently, the prospect of applying machine learning tools for automating the pro-cess of annotation analysis of large-scale sequences from next-generation sequenc-ers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourc-ing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1st-and 2nd-ranking models uti-lised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%–9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfa-miliar with the complexities of data science.

本文言語英語
ページ(範囲)43-50
ページ数8
ジャーナルGenes and Genetic Systems
95
1
DOI
出版ステータス出版済み - 2020

All Science Journal Classification (ASJC) codes

  • Molecular Biology
  • Genetics

フィンガープリント 「DDBJ data analysis challenge: A machine learning competition to predict arabidopsis chromatin feature annotations from DNA sequences」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル