TY - JOUR
T1 - Quantitative Evaluation System for Online Meetings Based on Multimodal Microbehavior Analysis
AU - Chen, Chenhao
AU - Arakawa, Yutaka
AU - Watanabe, Ko
AU - Ishimaru, Shoya
N1 - Publisher Copyright:
© 2022 M Y U Scientific Publishing Division. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Maintaining a positive interaction is the key to a healthy and efficient meeting. Aiming to improve the quality of online meetings, we present an end-to-end neural-network-based system, named MeetingPipe, which is capable of quantitative microbehavior detection (smiling, nodding, and speaking) from recorded meeting videos. For smile detection, we build a neural network framework that consists of an 18-layer residual network for feature representation, and a selfattention layer to explore the correlation between each receptive field. To perform nodding detection, we obtain head rotation data as the key nodding feature. Then we use a gated recurrent unit followed by a squeeze-and-excitation mechanism to capture the temporal information of nodding patterns from head pitch angles. In addition, we utilize TalkNet, an active speaker detection model, which can effectively recognize active speakers from videos. Experiments demonstrate that with K-fold cross validation, the F1 scores of the smile, nodding, and speaking detection are 97.34, 81.26, and 94.90%, respectively. The processing can be accelerated with multiple GPUs due to the multithread design. The code is available at https://github.com/ humanophilic/MeetingPipe.
AB - Maintaining a positive interaction is the key to a healthy and efficient meeting. Aiming to improve the quality of online meetings, we present an end-to-end neural-network-based system, named MeetingPipe, which is capable of quantitative microbehavior detection (smiling, nodding, and speaking) from recorded meeting videos. For smile detection, we build a neural network framework that consists of an 18-layer residual network for feature representation, and a selfattention layer to explore the correlation between each receptive field. To perform nodding detection, we obtain head rotation data as the key nodding feature. Then we use a gated recurrent unit followed by a squeeze-and-excitation mechanism to capture the temporal information of nodding patterns from head pitch angles. In addition, we utilize TalkNet, an active speaker detection model, which can effectively recognize active speakers from videos. Experiments demonstrate that with K-fold cross validation, the F1 scores of the smile, nodding, and speaking detection are 97.34, 81.26, and 94.90%, respectively. The processing can be accelerated with multiple GPUs due to the multithread design. The code is available at https://github.com/ humanophilic/MeetingPipe.
UR - http://www.scopus.com/inward/record.url?scp=85135719679&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85135719679&partnerID=8YFLogxK
U2 - 10.18494/SAM3959
DO - 10.18494/SAM3959
M3 - Article
AN - SCOPUS:85135719679
VL - 34
SP - 3017
EP - 3027
JO - Sensors and Materials
JF - Sensors and Materials
SN - 0914-4935
IS - 8
ER -