TY - GEN
T1 - A comparative study on outlier removal from a large-scale dataset using unsupervised anomaly detection
AU - Goldstein, Markus
AU - Uchida, Seiichi
N1 - Publisher Copyright:
© Copyright 2016 by SCITEPRESS-Science and Technology Publications, Lda. All rights reserved.
PY - 2016
Y1 - 2016
N2 - Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem becomes more important for large-scale datasets by the following two reasons: First, we will have a higher risk of "unexpected" outliers, such as mislabeled training data. Second, a large-scale dataset makes it more difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset. For accurate performance observation, we need to use a simple and describable recognition procedure and thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of 16×16=256, it is ensured that each digit class has at least 100 times more instances than data dimensionality. The experimental results show that the common understanding that outlier removal improves classification performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found that local anomaly detection algorithms perform better on this data than their global equivalents.
AB - Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem becomes more important for large-scale datasets by the following two reasons: First, we will have a higher risk of "unexpected" outliers, such as mislabeled training data. Second, a large-scale dataset makes it more difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset. For accurate performance observation, we need to use a simple and describable recognition procedure and thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of 16×16=256, it is ensured that each digit class has at least 100 times more instances than data dimensionality. The experimental results show that the common understanding that outlier removal improves classification performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found that local anomaly detection algorithms perform better on this data than their global equivalents.
UR - http://www.scopus.com/inward/record.url?scp=84970005866&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84970005866&partnerID=8YFLogxK
U2 - 10.5220/0005701302630269
DO - 10.5220/0005701302630269
M3 - Conference contribution
AN - SCOPUS:84970005866
T3 - ICPRAM 2016 - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods
SP - 263
EP - 269
BT - ICPRAM 2016 - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods
A2 - De Marsico, Maria
A2 - di Baja, Gabriella Sanniti
A2 - Fred, Ana
PB - SciTePress
T2 - 5th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2016
Y2 - 24 February 2016 through 26 February 2016
ER -