An information theoretic approach to detection of minority subsets in database

Shin Ando, Einoshin Suzuki

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

Detection of rare and exceptional occurrences in largescale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of "minority" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.

Original languageEnglish
Title of host publicationProceedings - Sixth International Conference on Data Mining, ICDM 2006
Pages11-20
Number of pages10
DOIs
Publication statusPublished - Dec 1 2006
Event6th International Conference on Data Mining, ICDM 2006 - Hong Kong, China
Duration: Dec 18 2006Dec 22 2006

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other6th International Conference on Data Mining, ICDM 2006
CountryChina
CityHong Kong
Period12/18/0612/22/06

Fingerprint

Set theory
Information retrieval
Data mining

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Ando, S., & Suzuki, E. (2006). An information theoretic approach to detection of minority subsets in database. In Proceedings - Sixth International Conference on Data Mining, ICDM 2006 (pp. 11-20). [4053030] (Proceedings - IEEE International Conference on Data Mining, ICDM). https://doi.org/10.1109/ICDM.2006.19

An information theoretic approach to detection of minority subsets in database. / Ando, Shin; Suzuki, Einoshin.

Proceedings - Sixth International Conference on Data Mining, ICDM 2006. 2006. p. 11-20 4053030 (Proceedings - IEEE International Conference on Data Mining, ICDM).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ando, S & Suzuki, E 2006, An information theoretic approach to detection of minority subsets in database. in Proceedings - Sixth International Conference on Data Mining, ICDM 2006., 4053030, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 11-20, 6th International Conference on Data Mining, ICDM 2006, Hong Kong, China, 12/18/06. https://doi.org/10.1109/ICDM.2006.19
Ando S, Suzuki E. An information theoretic approach to detection of minority subsets in database. In Proceedings - Sixth International Conference on Data Mining, ICDM 2006. 2006. p. 11-20. 4053030. (Proceedings - IEEE International Conference on Data Mining, ICDM). https://doi.org/10.1109/ICDM.2006.19
Ando, Shin ; Suzuki, Einoshin. / An information theoretic approach to detection of minority subsets in database. Proceedings - Sixth International Conference on Data Mining, ICDM 2006. 2006. pp. 11-20 (Proceedings - IEEE International Conference on Data Mining, ICDM).
@inproceedings{2522eb4de32d4a16a8b5da4ffc2ce769,
title = "An information theoretic approach to detection of minority subsets in database",
abstract = "Detection of rare and exceptional occurrences in largescale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of {"}minority{"} data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.",
author = "Shin Ando and Einoshin Suzuki",
year = "2006",
month = "12",
day = "1",
doi = "10.1109/ICDM.2006.19",
language = "English",
isbn = "0769527019",
series = "Proceedings - IEEE International Conference on Data Mining, ICDM",
pages = "11--20",
booktitle = "Proceedings - Sixth International Conference on Data Mining, ICDM 2006",

}

TY - GEN

T1 - An information theoretic approach to detection of minority subsets in database

AU - Ando, Shin

AU - Suzuki, Einoshin

PY - 2006/12/1

Y1 - 2006/12/1

N2 - Detection of rare and exceptional occurrences in largescale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of "minority" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.

AB - Detection of rare and exceptional occurrences in largescale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of "minority" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.

UR - http://www.scopus.com/inward/record.url?scp=34548055703&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548055703&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2006.19

DO - 10.1109/ICDM.2006.19

M3 - Conference contribution

AN - SCOPUS:34548055703

SN - 0769527019

SN - 9780769527017

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 11

EP - 20

BT - Proceedings - Sixth International Conference on Data Mining, ICDM 2006

ER -