TY - JOUR
T1 - A novel representation of protein sequences for prediction of subcellular location using support vector machines
AU - Matsuda, Setsuro
AU - Vert, Jean Philippe
AU - Saigo, Hiroto
AU - Ueda, Nobuhisa
AU - Toh, Hiroyuki
AU - Akutsu, Tatsuya
PY - 2005/11
Y1 - 2005/11
N2 - As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
AB - As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
UR - http://www.scopus.com/inward/record.url?scp=27644504110&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=27644504110&partnerID=8YFLogxK
U2 - 10.1110/ps.051597405
DO - 10.1110/ps.051597405
M3 - Article
C2 - 16251364
AN - SCOPUS:27644504110
SN - 0961-8368
VL - 14
SP - 2804
EP - 2813
JO - Protein Science
JF - Protein Science
IS - 11
ER -