Unsupervised spam detection based on string alienness measures

Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano, Masayuki Takeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)


We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.

Original languageEnglish
Title of host publicationDiscovery Science - 10th International Conference, DS 2007, Proceedings
PublisherSpringer Verlag
Number of pages12
ISBN (Print)9783540754879
Publication statusPublished - 2007
Event10th International Conference on Discovery Science, DS 2007 - Sendai, Japan
Duration: Oct 1 2007Oct 4 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4755 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other10th International Conference on Discovery Science, DS 2007

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Unsupervised spam detection based on string alienness measures'. Together they form a unique fingerprint.

Cite this