Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

研究成果: Chapter in Book/Report/Conference proceedingConference contribution

抄録

In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSRVTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

本文言語英語
ホスト出版物のタイトル2018 International Joint Conference on Neural Networks, IJCNN 2018 - Proceedings
出版社Institute of Electrical and Electronics Engineers Inc.
ISBN(電子版)9781509060146
DOI
出版ステータス出版済み - 10 10 2018
イベント2018 International Joint Conference on Neural Networks, IJCNN 2018 - Rio de Janeiro, ブラジル
継続期間: 7 8 20187 13 2018

出版物シリーズ

名前Proceedings of the International Joint Conference on Neural Networks
2018-July

その他

その他2018 International Joint Conference on Neural Networks, IJCNN 2018
Countryブラジル
CityRio de Janeiro
Period7/8/187/13/18

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

フィンガープリント 「Multimodal Deep Neural Network with Image Sequence Features for Video Captioning」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル