Unsupervised spam detection by document complexity estimation

Takashi Uemura, Daisuke Ikeda, Hiroki Arimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)

Abstract

In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

Original languageEnglish
Title of host publicationDiscovery Science - 11th International Conference, DS 2008, Proceedings
Pages319-331
Number of pages13
DOIs
Publication statusPublished - Dec 1 2008
Event11th International Conference on Discovery Science, DS 2008 - Budapest, Hungary
Duration: Oct 13 2008Oct 16 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5255 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other11th International Conference on Discovery Science, DS 2008
CountryHungary
CityBudapest
Period10/13/0810/16/08

Fingerprint

Spam
Bulletin boards
Suffix Tree
Blogs
Linear Time
Entropy
Experimental Results

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Uemura, T., Ikeda, D., & Arimura, H. (2008). Unsupervised spam detection by document complexity estimation. In Discovery Science - 11th International Conference, DS 2008, Proceedings (pp. 319-331). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5255 LNAI). https://doi.org/10.1007/978-3-540-88411-8-30

Unsupervised spam detection by document complexity estimation. / Uemura, Takashi; Ikeda, Daisuke; Arimura, Hiroki.

Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. p. 319-331 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5255 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Uemura, T, Ikeda, D & Arimura, H 2008, Unsupervised spam detection by document complexity estimation. in Discovery Science - 11th International Conference, DS 2008, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5255 LNAI, pp. 319-331, 11th International Conference on Discovery Science, DS 2008, Budapest, Hungary, 10/13/08. https://doi.org/10.1007/978-3-540-88411-8-30
Uemura T, Ikeda D, Arimura H. Unsupervised spam detection by document complexity estimation. In Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. p. 319-331. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-540-88411-8-30
Uemura, Takashi ; Ikeda, Daisuke ; Arimura, Hiroki. / Unsupervised spam detection by document complexity estimation. Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. pp. 319-331 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{e034d15c9ae145a3ae2620e839532600,
title = "Unsupervised spam detection by document complexity estimation",
abstract = "In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.",
author = "Takashi Uemura and Daisuke Ikeda and Hiroki Arimura",
year = "2008",
month = "12",
day = "1",
doi = "10.1007/978-3-540-88411-8-30",
language = "English",
isbn = "3540884106",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "319--331",
booktitle = "Discovery Science - 11th International Conference, DS 2008, Proceedings",

}

TY - GEN

T1 - Unsupervised spam detection by document complexity estimation

AU - Uemura, Takashi

AU - Ikeda, Daisuke

AU - Arimura, Hiroki

PY - 2008/12/1

Y1 - 2008/12/1

N2 - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

AB - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

UR - http://www.scopus.com/inward/record.url?scp=56749179442&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=56749179442&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-88411-8-30

DO - 10.1007/978-3-540-88411-8-30

M3 - Conference contribution

SN - 3540884106

SN - 9783540884101

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 319

EP - 331

BT - Discovery Science - 11th International Conference, DS 2008, Proceedings

ER -