TY - GEN
T1 - Fourth-Person Captioning
T2 - 2018 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018
AU - Nakashima, Kazuto
AU - Iwashita, Yumi
AU - Kawamura, Akihiro
AU - Kurazume, Ryo
N1 - Funding Information:
ACKNOWLEDGMENT This work was partially supported by JSPS KAKENHI Grant Number JP26249029 and JST CREST Grant Number JPMJCR17A5, Japan.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/1/16
Y1 - 2019/1/16
N2 - We aim to develop a supporting system which enhances the ability of human's short-term visual memory in an intelligent space where the human and a service robot coexist. Particularly, this paper focuses on how we can interpret and record diverse and complex life events on behalf of humans, from a multi-perspective viewpoint. We propose a novel method named 'fourth-person captioning', which generates natural language descriptions by summarizing visual contexts complementarily from three types of cameras corresponding the first-, second-, and third-person viewpoint. We first extend the latest image captioning technique and design a new model to generate a sequence of words given the multiple images. Then we provide an effective training strategy that needs only annotations supervising images from a single viewpoint in a general caption dataset and unsupervised triplet instances in the intelligent space. As the three types of cameras, we select a wearable camera on the human, a robot-mounted camera, and an embedded camera, which can be defined as the first-, second-, and third-person viewpoint, respectively. We hope our work will accelerate a cross-modal interaction bridging the human's egocentric cognition and multi-perspective intelligence.
AB - We aim to develop a supporting system which enhances the ability of human's short-term visual memory in an intelligent space where the human and a service robot coexist. Particularly, this paper focuses on how we can interpret and record diverse and complex life events on behalf of humans, from a multi-perspective viewpoint. We propose a novel method named 'fourth-person captioning', which generates natural language descriptions by summarizing visual contexts complementarily from three types of cameras corresponding the first-, second-, and third-person viewpoint. We first extend the latest image captioning technique and design a new model to generate a sequence of words given the multiple images. Then we provide an effective training strategy that needs only annotations supervising images from a single viewpoint in a general caption dataset and unsupervised triplet instances in the intelligent space. As the three types of cameras, we select a wearable camera on the human, a robot-mounted camera, and an embedded camera, which can be defined as the first-, second-, and third-person viewpoint, respectively. We hope our work will accelerate a cross-modal interaction bridging the human's egocentric cognition and multi-perspective intelligence.
UR - http://www.scopus.com/inward/record.url?scp=85062231854&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062231854&partnerID=8YFLogxK
U2 - 10.1109/SMC.2018.00365
DO - 10.1109/SMC.2018.00365
M3 - Conference contribution
AN - SCOPUS:85062231854
T3 - Proceedings - 2018 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018
SP - 2122
EP - 2127
BT - Proceedings - 2018 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 7 October 2018 through 10 October 2018
ER -