TY - GEN

T1 - On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

AU - Yamanishi, Kenji

AU - Takeuchi, Jun Ichi

AU - Williams, Graham

PY - 2000

Y1 - 2000

N2 - Outlier detection is a fundamental issue in data mining, specifically in fraud detections network intrusion detection, network monitoring, etc. SmartSifter, which we abbreviate as SS, is an outlier detection engine adrressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SS and empirically demonstrates its effectiveness. SS detects outliers in an online process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SS employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model, with a high score indicating a high possibility of being a statistical outlier. The novel features of SS are: 1) it is adaptive to non-stationary sources of data; 2) a score has a clear statistical/information-theoretic meaning; 3) it is computationally inexpensive; and 4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SS was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

AB - Outlier detection is a fundamental issue in data mining, specifically in fraud detections network intrusion detection, network monitoring, etc. SmartSifter, which we abbreviate as SS, is an outlier detection engine adrressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SS and empirically demonstrates its effectiveness. SS detects outliers in an online process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SS employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model, with a high score indicating a high possibility of being a statistical outlier. The novel features of SS are: 1) it is adaptive to non-stationary sources of data; 2) a score has a clear statistical/information-theoretic meaning; 3) it is computationally inexpensive; and 4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SS was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

UR - http://www.scopus.com/inward/record.url?scp=0034592923&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034592923&partnerID=8YFLogxK

U2 - 10.1145/347090.347160

DO - 10.1145/347090.347160

M3 - Conference contribution

AN - SCOPUS:0034592923

SN - 1581132336

SN - 9781581132335

T3 - Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 320

EP - 324

BT - Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A2 - Ramakrishnan, R.

A2 - Stolfo, S.

A2 - Bayardo, R.

A2 - Parsa, I.

PB - Association for Computing Machinery (ACM)

T2 - Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001)

Y2 - 20 August 2000 through 23 August 2000

ER -