Unsupervised spam detection by document complexity estimation

Takashi Uemura, Daisuke Ikeda, Hiroki Arimura

研究成果: 著書/レポートタイプへの貢献会議での発言

12 引用 (Scopus)

抄録

In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

元の言語英語
ホスト出版物のタイトルDiscovery Science - 11th International Conference, DS 2008, Proceedings
ページ319-331
ページ数13
DOI
出版物ステータス出版済み - 12 1 2008
イベント11th International Conference on Discovery Science, DS 2008 - Budapest, ハンガリー
継続期間: 10 13 200810 16 2008

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
5255 LNAI
ISSN(印刷物)0302-9743
ISSN(電子版)1611-3349

その他

その他11th International Conference on Discovery Science, DS 2008
ハンガリー
Budapest
期間10/13/0810/16/08

Fingerprint

Spam
Bulletin boards
Suffix Tree
Blogs
Linear Time
Entropy
Experimental Results

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

これを引用

Uemura, T., Ikeda, D., & Arimura, H. (2008). Unsupervised spam detection by document complexity estimation. : Discovery Science - 11th International Conference, DS 2008, Proceedings (pp. 319-331). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 巻数 5255 LNAI). https://doi.org/10.1007/978-3-540-88411-8-30

Unsupervised spam detection by document complexity estimation. / Uemura, Takashi; Ikeda, Daisuke; Arimura, Hiroki.

Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. p. 319-331 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 巻 5255 LNAI).

研究成果: 著書/レポートタイプへの貢献会議での発言

Uemura, T, Ikeda, D & Arimura, H 2008, Unsupervised spam detection by document complexity estimation. : Discovery Science - 11th International Conference, DS 2008, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 巻. 5255 LNAI, pp. 319-331, 11th International Conference on Discovery Science, DS 2008, Budapest, ハンガリー, 10/13/08. https://doi.org/10.1007/978-3-540-88411-8-30
Uemura T, Ikeda D, Arimura H. Unsupervised spam detection by document complexity estimation. : Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. p. 319-331. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-540-88411-8-30
Uemura, Takashi ; Ikeda, Daisuke ; Arimura, Hiroki. / Unsupervised spam detection by document complexity estimation. Discovery Science - 11th International Conference, DS 2008, Proceedings. 2008. pp. 319-331 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{e034d15c9ae145a3ae2620e839532600,
title = "Unsupervised spam detection by document complexity estimation",
abstract = "In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.",
author = "Takashi Uemura and Daisuke Ikeda and Hiroki Arimura",
year = "2008",
month = "12",
day = "1",
doi = "10.1007/978-3-540-88411-8-30",
language = "English",
isbn = "3540884106",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "319--331",
booktitle = "Discovery Science - 11th International Conference, DS 2008, Proceedings",

}

TY - GEN

T1 - Unsupervised spam detection by document complexity estimation

AU - Uemura, Takashi

AU - Ikeda, Daisuke

AU - Arimura, Hiroki

PY - 2008/12/1

Y1 - 2008/12/1

N2 - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

AB - In this paper, we study a content-based spam detection for a specific type of spams, called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.

UR - http://www.scopus.com/inward/record.url?scp=56749179442&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=56749179442&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-88411-8-30

DO - 10.1007/978-3-540-88411-8-30

M3 - Conference contribution

SN - 3540884106

SN - 9783540884101

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 319

EP - 331

BT - Discovery Science - 11th International Conference, DS 2008, Proceedings

ER -