Unsupervised spam detection by document probability estimation with maximal overlap method

Takashi Uemura, Daisuke Ikeda, Takuya Kida, Hiroki Arimura

Research output: Contribution to journalArticle

Abstract

In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.

Original languageEnglish
Pages (from-to)297-306
Number of pages10
JournalTransactions of the Japanese Society for Artificial Intelligence
Volume26
Issue number1
DOIs
Publication statusPublished - Jan 13 2011

Fingerprint

Copying
Seed
Entropy

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Unsupervised spam detection by document probability estimation with maximal overlap method. / Uemura, Takashi; Ikeda, Daisuke; Kida, Takuya; Arimura, Hiroki.

In: Transactions of the Japanese Society for Artificial Intelligence, Vol. 26, No. 1, 13.01.2011, p. 297-306.

Research output: Contribution to journalArticle

@article{856c9c0042f54b51b727d601d58fdc4a,
title = "Unsupervised spam detection by document probability estimation with maximal overlap method",
abstract = "In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.",
author = "Takashi Uemura and Daisuke Ikeda and Takuya Kida and Hiroki Arimura",
year = "2011",
month = "1",
day = "13",
doi = "10.1527/tjsai.26.297",
language = "English",
volume = "26",
pages = "297--306",
journal = "Transactions of the Japanese Society for Artificial Intelligence",
issn = "1346-0714",
publisher = "Japanese Society for Artificial Intelligence",
number = "1",

}

TY - JOUR

T1 - Unsupervised spam detection by document probability estimation with maximal overlap method

AU - Uemura, Takashi

AU - Ikeda, Daisuke

AU - Kida, Takuya

AU - Arimura, Hiroki

PY - 2011/1/13

Y1 - 2011/1/13

N2 - In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.

AB - In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.

UR - http://www.scopus.com/inward/record.url?scp=78651113001&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78651113001&partnerID=8YFLogxK

U2 - 10.1527/tjsai.26.297

DO - 10.1527/tjsai.26.297

M3 - Article

AN - SCOPUS:78651113001

VL - 26

SP - 297

EP - 306

JO - Transactions of the Japanese Society for Artificial Intelligence

JF - Transactions of the Japanese Society for Artificial Intelligence

SN - 1346-0714

IS - 1

ER -