TY - GEN
T1 - Challenges in classifying privacy policies by machine learning with word-based features
AU - Fukushima, Keishiro
AU - Ikeda, Daisuke
AU - Nakamura, Toru
AU - Kiyomoto, Shinsaku
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
Copyright:
Copyright 2018 Elsevier B.V., All rights reserved.
PY - 2018/3/16
Y1 - 2018/3/16
N2 - In this paper, we discuss challenges when we try to automatically classify privacy policies using machine learning with words as the features. Since it is difficult for general public to understand privacy policies, it is necessary to support them to do that. To this end, the authors believe that machine learning is one of the promising ways because users can grasp the meaning of policies through outputs by a machine learning algorithm. Our final goal is to develop a system which automatically translates privacy policies into privacy labels [1]. Toward this goal, we classify sentences in privacy policies with category labels, using popular machine learning algorithms, such as a naive Bayes classifier. We choose these algorithms because we could use trained classifiers to evaluate keywords appropriate for privacy labels. Therefore, we adopt words as the features of those algorithms. Experimental results show about 85% accuracy. We think that much higher accuracy is necessary to achieve our final goal. By changing learning settings, we identified one reason of low accuracies such that privacy policies include many sentences which are not direct description of information about categories. It seems that such sentences are redundant but maybe they are essential in case of legal documents in order to prevent misinterpreting. Thus, it is important for machine learning algorithms to handle these redundant sentences appropriately.
AB - In this paper, we discuss challenges when we try to automatically classify privacy policies using machine learning with words as the features. Since it is difficult for general public to understand privacy policies, it is necessary to support them to do that. To this end, the authors believe that machine learning is one of the promising ways because users can grasp the meaning of policies through outputs by a machine learning algorithm. Our final goal is to develop a system which automatically translates privacy policies into privacy labels [1]. Toward this goal, we classify sentences in privacy policies with category labels, using popular machine learning algorithms, such as a naive Bayes classifier. We choose these algorithms because we could use trained classifiers to evaluate keywords appropriate for privacy labels. Therefore, we adopt words as the features of those algorithms. Experimental results show about 85% accuracy. We think that much higher accuracy is necessary to achieve our final goal. By changing learning settings, we identified one reason of low accuracies such that privacy policies include many sentences which are not direct description of information about categories. It seems that such sentences are redundant but maybe they are essential in case of legal documents in order to prevent misinterpreting. Thus, it is important for machine learning algorithms to handle these redundant sentences appropriately.
UR - http://www.scopus.com/inward/record.url?scp=85052022874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052022874&partnerID=8YFLogxK
U2 - 10.1145/3199478.3199486
DO - 10.1145/3199478.3199486
M3 - Conference contribution
AN - SCOPUS:85052022874
T3 - ACM International Conference Proceeding Series
SP - 62
EP - 66
BT - Proceedings of 2018 the 2nd International Conference on Cryptography, Security and Privacy, ICCSP 2018
PB - Association for Computing Machinery
T2 - 2nd International Conference on Cryptography, Security and Privacy, ICCSP 2018
Y2 - 16 March 2018 through 18 March 2018
ER -