Unsupervised spam detection by document complexity estimation

Takashi Uemura, Daisuke Ikeda, Hiroki Arimura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

Original languageEnglish
Title of host publicationDiscovery Science - 11th International Conference, DS 2008, Proceedings
Pages319-331
Number of pages13
DOIs
Publication statusPublished - Dec 1 2008
Event11th International Conference on Discovery Science, DS 2008 - Budapest, Hungary
Duration: Oct 13 2008Oct 16 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5255 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other11th International Conference on Discovery Science, DS 2008
CountryHungary
CityBudapest
Period10/13/0810/16/08

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Unsupervised spam detection by document complexity estimation'. Together they form a unique fingerprint.

  • Cite this

    Uemura, T., Ikeda, D., & Arimura, H. (2008). Unsupervised spam detection by document complexity estimation. In Discovery Science - 11th International Conference, DS 2008, Proceedings (pp. 319-331). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5255 LNAI). https://doi.org/10.1007/978-3-540-88411-8-30