Document clustering is a practical and powerful data mining technique to analyze large amount of documents and large sets of text or hypertext documents. However, it also brings the problem of sensitive information leaking in disregard of privacy, especially when it is executed in distributed environment. In this paper, we propose a cryptography-based framework to realize privacy-preserving document clustering among the users under the distributed environment; there are two parties, each having his private document database, want to collaboratively execute agglomerative document clustering without disclosing their private contents. We provide two implementations of such a framework, one is with more precision and stronger security but requires more computational resources. The other is a simplified version with less computational complexity and achieves higher processing speed. Additionally, we provide the security proofs and experimental analysis of precision and scalability of our proposal.
All Science Journal Classification (ASJC) codes
- Information Systems
- Computer Networks and Communications