RF-GlutarySite

A random forest based predictor for glutarylation sites

Hussam J. Al-Barakati, Hiroto Saigo, Robert H. Newman, Dukka B. Kc

Research output: Contribution to journalArticle

Abstract

Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites of modification on individual proteins, as well as the extent of glutarylation throughout the proteome, remain largely uncharacterized. Though informative, proteomic approaches based on mass spectrometry can be expensive, technically challenging and time-consuming. Therefore, the ability to predict glutarylation sites from protein primary sequences can complement proteomics analyses and help researchers study the characteristics and functional consequences of glutarylation. To this end, we used Random Forest (RF) machine learning strategies to identify the physiochemical and sequence-based features that correlated most substantially with glutarylation. We then used these features to develop a novel method to predict glutarylation sites from primary amino acid sequences using RF. Based on 10-fold cross-validation, the resulting algorithm, termed 'RF-GlutarySite', achieved efficiency scores of 75%, 81%, 68% and 0.50 with respect to accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Likewise, using an independent test set, RF-GlutarySite exhibited ACC, SN, SP and MCC scores of 72%, 73%, 70% and 0.43, respectively. Results using both 10-fold cross validation and an independent test set were on par with or better than those achieved by existing glutarylation site predictors. Notably, RF-GlutarySite achieved the highest SN score among available glutarylation site prediction tools. Consequently, our method has the potential to uncover new glutarylation sites and to facilitate the discovery of relationships between glutarylation and well-known lysine modifications, such as acetylation, methylation and SUMOylation, as well as a number of recently identified lysine modifications, such as malonylation and succinylation.

Original languageEnglish
Pages (from-to)189-204
Number of pages16
JournalMolecular Omics
Volume15
Issue number3
DOIs
Publication statusPublished - Jan 1 2019

Fingerprint

Lysine
Proteomics
Sumoylation
Sensitivity and Specificity
Acetylation
Methylation
Proteome
Post Translational Protein Processing
Mass spectrometry
Learning systems
Amino Acid Sequence
Mass Spectrometry
Proteins
Research Personnel
Amino Acids
Machine Learning

All Science Journal Classification (ASJC) codes

  • Biochemistry
  • Molecular Biology
  • Genetics

Cite this

RF-GlutarySite : A random forest based predictor for glutarylation sites. / Al-Barakati, Hussam J.; Saigo, Hiroto; Newman, Robert H.; Kc, Dukka B.

In: Molecular Omics, Vol. 15, No. 3, 01.01.2019, p. 189-204.

Research output: Contribution to journalArticle

Al-Barakati, Hussam J. ; Saigo, Hiroto ; Newman, Robert H. ; Kc, Dukka B. / RF-GlutarySite : A random forest based predictor for glutarylation sites. In: Molecular Omics. 2019 ; Vol. 15, No. 3. pp. 189-204.
@article{3becc6b645fc4ca39f8348c04c0948be,
title = "RF-GlutarySite: A random forest based predictor for glutarylation sites",
abstract = "Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites of modification on individual proteins, as well as the extent of glutarylation throughout the proteome, remain largely uncharacterized. Though informative, proteomic approaches based on mass spectrometry can be expensive, technically challenging and time-consuming. Therefore, the ability to predict glutarylation sites from protein primary sequences can complement proteomics analyses and help researchers study the characteristics and functional consequences of glutarylation. To this end, we used Random Forest (RF) machine learning strategies to identify the physiochemical and sequence-based features that correlated most substantially with glutarylation. We then used these features to develop a novel method to predict glutarylation sites from primary amino acid sequences using RF. Based on 10-fold cross-validation, the resulting algorithm, termed 'RF-GlutarySite', achieved efficiency scores of 75{\%}, 81{\%}, 68{\%} and 0.50 with respect to accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Likewise, using an independent test set, RF-GlutarySite exhibited ACC, SN, SP and MCC scores of 72{\%}, 73{\%}, 70{\%} and 0.43, respectively. Results using both 10-fold cross validation and an independent test set were on par with or better than those achieved by existing glutarylation site predictors. Notably, RF-GlutarySite achieved the highest SN score among available glutarylation site prediction tools. Consequently, our method has the potential to uncover new glutarylation sites and to facilitate the discovery of relationships between glutarylation and well-known lysine modifications, such as acetylation, methylation and SUMOylation, as well as a number of recently identified lysine modifications, such as malonylation and succinylation.",
author = "Al-Barakati, {Hussam J.} and Hiroto Saigo and Newman, {Robert H.} and Kc, {Dukka B.}",
year = "2019",
month = "1",
day = "1",
doi = "10.1039/c9mo00028c",
language = "English",
volume = "15",
pages = "189--204",
journal = "Molecular Omics",
issn = "2515-4184",
publisher = "Royal Society of Chemistry",
number = "3",

}

TY - JOUR

T1 - RF-GlutarySite

T2 - A random forest based predictor for glutarylation sites

AU - Al-Barakati, Hussam J.

AU - Saigo, Hiroto

AU - Newman, Robert H.

AU - Kc, Dukka B.

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites of modification on individual proteins, as well as the extent of glutarylation throughout the proteome, remain largely uncharacterized. Though informative, proteomic approaches based on mass spectrometry can be expensive, technically challenging and time-consuming. Therefore, the ability to predict glutarylation sites from protein primary sequences can complement proteomics analyses and help researchers study the characteristics and functional consequences of glutarylation. To this end, we used Random Forest (RF) machine learning strategies to identify the physiochemical and sequence-based features that correlated most substantially with glutarylation. We then used these features to develop a novel method to predict glutarylation sites from primary amino acid sequences using RF. Based on 10-fold cross-validation, the resulting algorithm, termed 'RF-GlutarySite', achieved efficiency scores of 75%, 81%, 68% and 0.50 with respect to accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Likewise, using an independent test set, RF-GlutarySite exhibited ACC, SN, SP and MCC scores of 72%, 73%, 70% and 0.43, respectively. Results using both 10-fold cross validation and an independent test set were on par with or better than those achieved by existing glutarylation site predictors. Notably, RF-GlutarySite achieved the highest SN score among available glutarylation site prediction tools. Consequently, our method has the potential to uncover new glutarylation sites and to facilitate the discovery of relationships between glutarylation and well-known lysine modifications, such as acetylation, methylation and SUMOylation, as well as a number of recently identified lysine modifications, such as malonylation and succinylation.

AB - Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites of modification on individual proteins, as well as the extent of glutarylation throughout the proteome, remain largely uncharacterized. Though informative, proteomic approaches based on mass spectrometry can be expensive, technically challenging and time-consuming. Therefore, the ability to predict glutarylation sites from protein primary sequences can complement proteomics analyses and help researchers study the characteristics and functional consequences of glutarylation. To this end, we used Random Forest (RF) machine learning strategies to identify the physiochemical and sequence-based features that correlated most substantially with glutarylation. We then used these features to develop a novel method to predict glutarylation sites from primary amino acid sequences using RF. Based on 10-fold cross-validation, the resulting algorithm, termed 'RF-GlutarySite', achieved efficiency scores of 75%, 81%, 68% and 0.50 with respect to accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Likewise, using an independent test set, RF-GlutarySite exhibited ACC, SN, SP and MCC scores of 72%, 73%, 70% and 0.43, respectively. Results using both 10-fold cross validation and an independent test set were on par with or better than those achieved by existing glutarylation site predictors. Notably, RF-GlutarySite achieved the highest SN score among available glutarylation site prediction tools. Consequently, our method has the potential to uncover new glutarylation sites and to facilitate the discovery of relationships between glutarylation and well-known lysine modifications, such as acetylation, methylation and SUMOylation, as well as a number of recently identified lysine modifications, such as malonylation and succinylation.

UR - http://www.scopus.com/inward/record.url?scp=85067119273&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067119273&partnerID=8YFLogxK

U2 - 10.1039/c9mo00028c

DO - 10.1039/c9mo00028c

M3 - Article

VL - 15

SP - 189

EP - 204

JO - Molecular Omics

JF - Molecular Omics

SN - 2515-4184

IS - 3

ER -