TY - GEN
T1 - A fast algorithm for constructing phylogenetic trees with application to IoT malware clustering
AU - He, Tianxiang
AU - Han, Chansu
AU - Isawa, Ryoichi
AU - Takahashi, Takeshi
AU - Kijima, Shuji
AU - Takeuchi, Jun’ichi
AU - Nakao, Koji
N1 - Funding Information:
Acknowledgment. The authors wish to thank the IoTPOT team from Yokohama National University for providing the dataset. This research was partially supported by JSPS KAKENHI Grant Number 18H03291.
PY - 2019
Y1 - 2019
N2 - For efficiently handling thousands of malware specimens, we aim to quickly and automatically categorize those into malware families. A solution for this could be the neighbor-joining method using NCD (Normalized Compression Distance) as similarity of malware. It creates a phylogenetic tree of malware based on the NCDs between malware binaries for clustering. However, it is frustratingly slow because it requires (N2+N)/2 compression attempts for the NCDs, where N is the number of given specimens. For fast clustering, this paper presents an algorithm for efficiently constructing a phylogenetic tree by greatly reducing compression attempts. The key idea to do so is not to construct a tree of N specimens all at once. Instead, it divides N specimens into temporal clusters in advance, constructs a small tree for each temporal cluster, and joins the trees as a united tree. Intuitively, separately constructing small trees requires a much smaller number of compression attempts than (N2+N)/2. With experiments using 4,109 in-the-wild malware specimens, we confirm that our algorithm achieved clustering 22 times faster than the neighbor-joining method with a good accuracy of 97%.
AB - For efficiently handling thousands of malware specimens, we aim to quickly and automatically categorize those into malware families. A solution for this could be the neighbor-joining method using NCD (Normalized Compression Distance) as similarity of malware. It creates a phylogenetic tree of malware based on the NCDs between malware binaries for clustering. However, it is frustratingly slow because it requires (N2+N)/2 compression attempts for the NCDs, where N is the number of given specimens. For fast clustering, this paper presents an algorithm for efficiently constructing a phylogenetic tree by greatly reducing compression attempts. The key idea to do so is not to construct a tree of N specimens all at once. Instead, it divides N specimens into temporal clusters in advance, constructs a small tree for each temporal cluster, and joins the trees as a united tree. Intuitively, separately constructing small trees requires a much smaller number of compression attempts than (N2+N)/2. With experiments using 4,109 in-the-wild malware specimens, we confirm that our algorithm achieved clustering 22 times faster than the neighbor-joining method with a good accuracy of 97%.
UR - http://www.scopus.com/inward/record.url?scp=85077508217&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077508217&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-36708-4_63
DO - 10.1007/978-3-030-36708-4_63
M3 - Conference contribution
AN - SCOPUS:85077508217
SN - 9783030367077
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 766
EP - 778
BT - Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings
A2 - Gedeon, Tom
A2 - Wong, Kok Wai
A2 - Lee, Minho
PB - Springer
T2 - 26th International Conference on Neural Information Processing, ICONIP 2019
Y2 - 12 December 2019 through 15 December 2019
ER -