TY - GEN
T1 - Unsupervised spam detection based on string alienness measures
AU - Narisawa, Kazuyuki
AU - Bannai, Hideo
AU - Hatano, Kohei
AU - Takeda, Masayuki
PY - 2007
Y1 - 2007
N2 - We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.
AB - We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.
UR - http://www.scopus.com/inward/record.url?scp=38149047603&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38149047603&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-75488-6_16
DO - 10.1007/978-3-540-75488-6_16
M3 - Conference contribution
AN - SCOPUS:38149047603
SN - 9783540754879
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 161
EP - 172
BT - Discovery Science - 10th International Conference, DS 2007, Proceedings
PB - Springer Verlag
T2 - 10th International Conference on Discovery Science, DS 2007
Y2 - 1 October 2007 through 4 October 2007
ER -