Retrieval-Augmented Egocentric Video Captioning

Abstract

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct largescale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.

Model Architecture

Overall framework of our model. Given an egocentric video, we first retrieve relevant exocentric instructional videos using a frozen cross-view retrieval module pre-trained on pseudo ego-exo pairs generated automatically. The multimodal captioning model (consisting of a visual encoder, a perceiver resampler, and a text decoder.) takes the egocentric video and the retrieved videos and captions as references, and generates the caption of the ego-video.

Pretrain Dataset

In this work, we train the cross-view retrieval module by constructing pseudo paired egocentric (Ego4d) and exocentric (HowTo100M) videos. This is achieved by: (1) refining the HowTo100M ASR transcripts by an LLM to descriptive style. (2) pairing videos based on their narration similarities.

Visualisation Results of Retrieval-Augmented Captioning

BibTeX


      
      @article{xu2024retrieval,
        title={Retrieval-Augmented Egocentric Video Captioning},
        author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
        journal={arXiv preprint arXiv:2401.00789},
        year={2024}
      }