We aim to develop a supporting system which enhances the ability of human's short-term visual memory in an intelligent space where the human and a service robot coexist. Particularly, this paper focuses on how we can interpret and record diverse and complex life events on behalf of humans, from a multi-perspective viewpoint. We propose a novel method named 'fourth-person captioning', which generates natural language descriptions by summarizing visual contexts complementarily from three types of cameras corresponding the first-, second-, and third-person viewpoint. We first extend the latest image captioning technique and design a new model to generate a sequence of words given the multiple images. Then we provide an effective training strategy that needs only annotations supervising images from a single viewpoint in a general caption dataset and unsupervised triplet instances in the intelligent space. As the three types of cameras, we select a wearable camera on the human, a robot-mounted camera, and an embedded camera, which can be defined as the first-, second-, and third-person viewpoint, respectively. We hope our work will accelerate a cross-modal interaction bridging the human's egocentric cognition and multi-perspective intelligence.