TY - GEN
T1 - The effects of over and under sampling on fault-prone module detection
AU - Kamei, Yasutaka
AU - Monden, Akito
AU - Matsumoto, Shinsuke
AU - Kakimoto, Takeshi
AU - Matsumoto, Ken Ichi
PY - 2007/12/1
Y1 - 2007/12/1
N2 - The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is imbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of F1-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.
AB - The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is imbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling methods to fault-proneness models. In this paper, we experimentally evaluated the effects of four sampling methods (random over sampling, synthetic minority over sampling, random under sampling and one-sided selection) applied to four fault-proneness models (linear discriminant analysis, logistic regression analysis, neural network and classification tree) by using two module sets of industry legacy software. All four sampling methods improved the prediction performance of the linear and logistic models, while neural network and classification tree models did not benefit from the sampling methods. The improvements of F1-values in linear and logistic models were 0.078 at minimum, 0.224 at maximum and 0.121 at the mean.
UR - http://www.scopus.com/inward/record.url?scp=47949103719&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=47949103719&partnerID=8YFLogxK
U2 - 10.1109/ESEM.2007.80
DO - 10.1109/ESEM.2007.80
M3 - Conference contribution
AN - SCOPUS:47949103719
SN - 0769528864
SN - 9780769528861
T3 - Proceedings - 1st International Symposium on Empirical Software Engineering and Measurement, ESEM 2007
SP - 196
EP - 204
BT - Proceedings - 1st International Symposium on Empirical Software Engineering and Measurement, ESEM 2007
T2 - 1st International Symposium on Empirical Software Engineering and Measurement, ESEM 2007
Y2 - 20 September 2007 through 21 September 2007
ER -