Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

Original languageEnglish
Title of host publication2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509060146
DOIs
Publication statusPublished - Oct 10 2018
Event2018 International Joint Conference on Neural Networks, IJCNN 2018 - Rio de Janeiro, Brazil
Duration: Jul 8 2018Jul 13 2018

Publication series

NameProceedings of the International Joint Conference on Neural Networks
Volume2018-July

Other

Other2018 International Joint Conference on Neural Networks, IJCNN 2018
CountryBrazil
CityRio de Janeiro
Period7/8/187/13/18

Fingerprint

Experiments
Defects
Deep neural networks

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Oura, S., Matsukawa, T., & Suzuki, E. (2018). Multimodal Deep Neural Network with Image Sequence Features for Video Captioning. In 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings [8489668] (Proceedings of the International Joint Conference on Neural Networks; Vol. 2018-July). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IJCNN.2018.8489668

Multimodal Deep Neural Network with Image Sequence Features for Video Captioning. / Oura, Soichiro; Matsukawa, Tetsu; Suzuki, Einoshin.

2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. 8489668 (Proceedings of the International Joint Conference on Neural Networks; Vol. 2018-July).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Oura, S, Matsukawa, T & Suzuki, E 2018, Multimodal Deep Neural Network with Image Sequence Features for Video Captioning. in 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings., 8489668, Proceedings of the International Joint Conference on Neural Networks, vol. 2018-July, Institute of Electrical and Electronics Engineers Inc., 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, 7/8/18. https://doi.org/10.1109/IJCNN.2018.8489668
Oura S, Matsukawa T, Suzuki E. Multimodal Deep Neural Network with Image Sequence Features for Video Captioning. In 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2018. 8489668. (Proceedings of the International Joint Conference on Neural Networks). https://doi.org/10.1109/IJCNN.2018.8489668
Oura, Soichiro ; Matsukawa, Tetsu ; Suzuki, Einoshin. / Multimodal Deep Neural Network with Image Sequence Features for Video Captioning. 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2018. (Proceedings of the International Joint Conference on Neural Networks).
@inproceedings{554f5a6217bf4242afea88e1b3ccf820,
title = "Multimodal Deep Neural Network with Image Sequence Features for Video Captioning",
abstract = "In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5{\%} higher than S2VT, with MSVD.",
author = "Soichiro Oura and Tetsu Matsukawa and Einoshin Suzuki",
year = "2018",
month = "10",
day = "10",
doi = "10.1109/IJCNN.2018.8489668",
language = "English",
series = "Proceedings of the International Joint Conference on Neural Networks",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings",
address = "United States",

}

TY - GEN

T1 - Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

AU - Oura, Soichiro

AU - Matsukawa, Tetsu

AU - Suzuki, Einoshin

PY - 2018/10/10

Y1 - 2018/10/10

N2 - In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

AB - In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

UR - http://www.scopus.com/inward/record.url?scp=85056521515&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056521515&partnerID=8YFLogxK

U2 - 10.1109/IJCNN.2018.8489668

DO - 10.1109/IJCNN.2018.8489668

M3 - Conference contribution

AN - SCOPUS:85056521515

T3 - Proceedings of the International Joint Conference on Neural Networks

BT - 2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -