TY - GEN
T1 - Learning Curves for Automating Content Analysis
T2 - 4th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2015
AU - Ishita, Emi
AU - Oard, Douglas W.
AU - Fleischmann, Kenneth R.
AU - Tomiura, Yoichi
AU - Takayama, Yasuhiro
AU - Cheng, An Shou
N1 - Publisher Copyright:
© 2015 IEEE.
Copyright:
Copyright 2016 Elsevier B.V., All rights reserved.
PY - 2016/1/6
Y1 - 2016/1/6
N2 - In this paper, we explore the potential for reducing human effort when coding text segments for use in content analysis. The key idea is to do some coding by hand, to use the results of that initial effort as training data, and then to code the remainder of the content automatically. The test collection includes 102 written prepared statements about Net neutrality from public hearings held by the U.S Congress and the U.S. Federal Communications Commission (FCC). Six categories used in this analysis: wealth, social order, justice, freedom, innovation and honor. A support vector machine (SVM) classifier and a Naïve Bayes (NB) classifier were trained on manually annotated sentences from between one and 51 documents and tested on a held out of set of 51 documents. The results show that the inflection point for a standard measure of classifier accuracy (F1) occurs early, reaching at least 85% of the best achievable result by the SVM classifier with only 30 training documents, and at least 88% of the best achievable result by NB classifier with only 30 training documents. With the exception of honor, the results show that the scale of machine classification would reasonably be scaled up to larger collections of similar documents without additional human annotation effort.
AB - In this paper, we explore the potential for reducing human effort when coding text segments for use in content analysis. The key idea is to do some coding by hand, to use the results of that initial effort as training data, and then to code the remainder of the content automatically. The test collection includes 102 written prepared statements about Net neutrality from public hearings held by the U.S Congress and the U.S. Federal Communications Commission (FCC). Six categories used in this analysis: wealth, social order, justice, freedom, innovation and honor. A support vector machine (SVM) classifier and a Naïve Bayes (NB) classifier were trained on manually annotated sentences from between one and 51 documents and tested on a held out of set of 51 documents. The results show that the inflection point for a standard measure of classifier accuracy (F1) occurs early, reaching at least 85% of the best achievable result by the SVM classifier with only 30 training documents, and at least 88% of the best achievable result by NB classifier with only 30 training documents. With the exception of honor, the results show that the scale of machine classification would reasonably be scaled up to larger collections of similar documents without additional human annotation effort.
UR - http://www.scopus.com/inward/record.url?scp=84964330694&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84964330694&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI.2015.295
DO - 10.1109/IIAI-AAI.2015.295
M3 - Conference contribution
AN - SCOPUS:84964330694
T3 - Proceedings - 2015 IIAI 4th International Congress on Advanced Applied Informatics, IIAI-AAI 2015
SP - 171
EP - 176
BT - Proceedings - 2015 IIAI 4th International Congress on Advanced Applied Informatics, IIAI-AAI 2015
A2 - Hirokawa, Sachio
A2 - Hashimoto, Kiyota
A2 - Matsuo, Tokuro
A2 - Mine, Tsunenori
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 12 July 2015 through 16 July 2015
ER -