TY - JOUR
T1 - DDBJ data analysis challenge
T2 - A machine learning competition to predict arabidopsis chromatin feature annotations from DNA sequences
AU - Kaminuma, Eli
AU - Baba, Yukino
AU - Mochizuki, Masahiro
AU - Matsumoto, Hirotaka
AU - Ozaki, Haruka
AU - Okayama, Toshitsugu
AU - Kato, Takuya
AU - Oki, Shinya
AU - Fujisawa, Takatomo
AU - Nakamura, Yasukazu
AU - Arita, Masanori
AU - Ogasawara, Osamu
AU - Kashima, Hisashi
AU - Takagi, Toshihisa
N1 - Funding Information:
We thank all of the participants in the competition: orion, Ryota, tsukasa, emn, extraterrestrial Fuun species, ηzw, mkoido, ksh, AoYu@Tohoku, hiro, emihat, MorikawaH, hmt-yamamoto, tonets, suzudora, take2, bicycle1885, morizo, forester, doiyasan, yudai, tag, nwatarai, soki, himkt, saoki, tsunechan, Ken, A.K, singular0316, IK, yk_tani, yota0000. We are also grateful to Ayako Oka, Yasuhiro Tanizawa, Takako Mochizuki, Fumi Hayashi, Naoko Sakamoto and Tarzo Ohta for their support in preparing the datasets for the task, and to Fumitaka Otobe, Takuya Ohtani, Hikari Amano, Takafumi Ohbiraki, Yuji Ashizawa, Tomohiko Yasuda, Naofumi Ishikawa, Tomohiro Hirai, Tomoka Watanabe, Chiharu Kawagoe, Emi Yokoyama, Kimiko Suzuki and Junko Kohira for their computational infrastructure and management support. Data analysis was partially performed using the Research Organization of Information and Systems (ROIS) NIG Supercomputer System. This research was partially supported by management expenses grants from the DNA Data Bank of Japan, the ROIS, and JST CREST Grant Number JPMJCR1501, Japan.
Publisher Copyright:
© 2020, Genetics Society of Japan. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Recently, the prospect of applying machine learning tools for automating the pro-cess of annotation analysis of large-scale sequences from next-generation sequenc-ers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourc-ing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1st-and 2nd-ranking models uti-lised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%–9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfa-miliar with the complexities of data science.
AB - Recently, the prospect of applying machine learning tools for automating the pro-cess of annotation analysis of large-scale sequences from next-generation sequenc-ers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourc-ing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1st-and 2nd-ranking models uti-lised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%–9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfa-miliar with the complexities of data science.
UR - http://www.scopus.com/inward/record.url?scp=85083731513&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083731513&partnerID=8YFLogxK
U2 - 10.1266/ggs.19-00034
DO - 10.1266/ggs.19-00034
M3 - Article
C2 - 32213716
AN - SCOPUS:85083731513
VL - 95
SP - 43
EP - 50
JO - Genes and Genetic Systems
JF - Genes and Genetic Systems
SN - 1341-7568
IS - 1
ER -