An information theoretic approach to detection of minority subsets

Shin Ando, Jun Sakuma, Einoshin Suzuki, Shigenobu Kobayashi

Research output: Contribution to journalArticle

Abstract

Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.

Original languageEnglish
Pages (from-to)311-321
Number of pages11
JournalTransactions of the Japanese Society for Artificial Intelligence
Volume22
Issue number3
DOIs
Publication statusPublished - Jan 1 2007

Fingerprint

Unsupervised learning
Satellite imagery
Redundancy
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

An information theoretic approach to detection of minority subsets. / Ando, Shin; Sakuma, Jun; Suzuki, Einoshin; Kobayashi, Shigenobu.

In: Transactions of the Japanese Society for Artificial Intelligence, Vol. 22, No. 3, 01.01.2007, p. 311-321.

Research output: Contribution to journalArticle

@article{595295132cc54a778309beccf709e7af,
title = "An information theoretic approach to detection of minority subsets",
abstract = "Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.",
author = "Shin Ando and Jun Sakuma and Einoshin Suzuki and Shigenobu Kobayashi",
year = "2007",
month = "1",
day = "1",
doi = "10.1527/tjsai.22.311",
language = "English",
volume = "22",
pages = "311--321",
journal = "Transactions of the Japanese Society for Artificial Intelligence",
issn = "1346-0714",
publisher = "Japanese Society for Artificial Intelligence",
number = "3",

}

TY - JOUR

T1 - An information theoretic approach to detection of minority subsets

AU - Ando, Shin

AU - Sakuma, Jun

AU - Suzuki, Einoshin

AU - Kobayashi, Shigenobu

PY - 2007/1/1

Y1 - 2007/1/1

N2 - Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.

AB - Unsupervised learning techniques, e.g. clustering, is useful for obtaining a summary of a dataset. However, its application to large databases can be computationally expensive. Alternatively, useful information can also be retrieved from its subsets in a more efficient yet effective manner. This paper addresses the problem of finding a small subset of minority instances whose distribution significantly differs from that of the majority. Generally, such a subset can substantially overlap with the majority, which is problematic for conventional estimation of distribution. This paper proposes a new approach for estimating a minority distribution based on Information Theoretic Framework, an extension of the Rate Distortion Theory for unsupervised learning tasks. Specifically, the proposed method (a) estimates parameters which maximize the divergence between the minority and majority distributions, (b) penalizes the redundancy of data expression based on the mutual information between the observed and hidden variables, and (c) employs a hard assignment approximation to avoid computation of trivial conditional probabilities. The algorithm of the proposed method has no problem-dependent parameter and its time and space complexities are linear to the size of the minority subset. Experiments using artificial datasets show the proposed method yields significantly high precision and sensitivity in detecting minority subsets which substantially overlaps with the majority. The proposed method also substantially outperforms one-class classification and mixture estimation methods in real-world benchmark datasets for text and satellite imagery classification.

UR - http://www.scopus.com/inward/record.url?scp=34247496853&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247496853&partnerID=8YFLogxK

U2 - 10.1527/tjsai.22.311

DO - 10.1527/tjsai.22.311

M3 - Article

AN - SCOPUS:34247496853

VL - 22

SP - 311

EP - 321

JO - Transactions of the Japanese Society for Artificial Intelligence

JF - Transactions of the Japanese Society for Artificial Intelligence

SN - 1346-0714

IS - 3

ER -