TY - GEN
T1 - Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
AU - Oura, Soichiro
AU - Matsukawa, Tetsu
AU - Suzuki, Einoshin
N1 - Funding Information:
A part of this research was supported by Grants-in-Aid for Scientific Research JP15K12100 and JP18H03290 from the Japan Society for the Promotion of Science (JSPS).
Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/10
Y1 - 2018/10/10
N2 - In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.
AB - In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.
UR - http://www.scopus.com/inward/record.url?scp=85056521515&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056521515&partnerID=8YFLogxK
U2 - 10.1109/IJCNN.2018.8489668
DO - 10.1109/IJCNN.2018.8489668
M3 - Conference contribution
AN - SCOPUS:85056521515
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Joint Conference on Neural Networks, IJCNN 2018
Y2 - 8 July 2018 through 13 July 2018
ER -