TY - GEN
T1 - Cross-project Defect Prediction via ASTToken2Vec and BLSTM-based Neural Network
AU - Li, Hao
AU - Li, Xiaohong
AU - Chen, Xiang
AU - Xie, Xiaofei
AU - Mu, Yanzhou
AU - Feng, Zhiyong
N1 - Funding Information:
Xiaohong Li and Xiang Chen are the corresponding authors. This work is supported in part by National Natural Science Foundation of China (Nos. 61572349, 61872262), National Key R&D Program of China (No. 2017YFB1401200) and Jiangsu Government Scholarship for Overseas Studies.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Cross-project defect prediction (CPDP) as a means to focus quality assurance of software projects was under heavy investigation in recent years. In this paper, we propose a novel CPDP approach via deep learning. In particular, we model each program module via simplified abstract syntax tree (S-AST). For each node in S-AST, only the project-independent node type is remained and other project-specific information (such as name of variable and method) is ignored, so that the modeling method is project-independent and suitable for CPDP issue. Then we extract token sequences from program modules modeled as S-AST. In addition, to construct meaningful vector representations for token sequences, we propose a novel unsupervised embedding method ASTToken2Vec, which learns semantic information from S-AST's natural structure. Finally, we use BLSTM (bi-directional long short-term memory) based neural network to automatically learn semantic features from vectorized token sequences and construct CPDP models. In our empirical studies, 10 real large-scale open source Java projects are chosen as our empirical subjects. Final results show that our proposed CPDP approach can perform significantly better than 5 state-of-the-art CPDP baselines in terms of AUC.
AB - Cross-project defect prediction (CPDP) as a means to focus quality assurance of software projects was under heavy investigation in recent years. In this paper, we propose a novel CPDP approach via deep learning. In particular, we model each program module via simplified abstract syntax tree (S-AST). For each node in S-AST, only the project-independent node type is remained and other project-specific information (such as name of variable and method) is ignored, so that the modeling method is project-independent and suitable for CPDP issue. Then we extract token sequences from program modules modeled as S-AST. In addition, to construct meaningful vector representations for token sequences, we propose a novel unsupervised embedding method ASTToken2Vec, which learns semantic information from S-AST's natural structure. Finally, we use BLSTM (bi-directional long short-term memory) based neural network to automatically learn semantic features from vectorized token sequences and construct CPDP models. In our empirical studies, 10 real large-scale open source Java projects are chosen as our empirical subjects. Final results show that our proposed CPDP approach can perform significantly better than 5 state-of-the-art CPDP baselines in terms of AUC.
UR - http://www.scopus.com/inward/record.url?scp=85073193218&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073193218&partnerID=8YFLogxK
U2 - 10.1109/IJCNN.2019.8852135
DO - 10.1109/IJCNN.2019.8852135
M3 - Conference contribution
AN - SCOPUS:85073193218
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2019 International Joint Conference on Neural Networks, IJCNN 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 International Joint Conference on Neural Networks, IJCNN 2019
Y2 - 14 July 2019 through 19 July 2019
ER -