Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Research output: Contribution to journalArticle

Abstract

Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.

Original languageEnglish
JournalIEEE Transactions on Software Engineering
DOIs
Publication statusPublished - Jan 1 2019

Fingerprint

Learning systems
Software engineering
Classifiers

All Science Journal Classification (ASJC) codes

  • Software

Cite this

Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering. / Rajbahadur, Gopi Krishnan; Wang, Shaowei; Kamei, Yasutaka; Hassan, Ahmed E.

In: IEEE Transactions on Software Engineering, 01.01.2019.

Research output: Contribution to journalArticle

@article{e63f452caf6049f58c5630b15b819284,
title = "Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering",
abstract = "Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.",
author = "Rajbahadur, {Gopi Krishnan} and Shaowei Wang and Yasutaka Kamei and Hassan, {Ahmed E.}",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/TSE.2019.2924371",
language = "English",
journal = "IEEE Transactions on Software Engineering",
issn = "0098-5589",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

AU - Rajbahadur, Gopi Krishnan

AU - Wang, Shaowei

AU - Kamei, Yasutaka

AU - Hassan, Ahmed E.

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.

AB - Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.

UR - http://www.scopus.com/inward/record.url?scp=85068161118&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068161118&partnerID=8YFLogxK

U2 - 10.1109/TSE.2019.2924371

DO - 10.1109/TSE.2019.2924371

M3 - Article

AN - SCOPUS:85068161118

JO - IEEE Transactions on Software Engineering

JF - IEEE Transactions on Software Engineering

SN - 0098-5589

ER -