Published January 1, 2023 | Version v1
Conference paper Open

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

  • 1. Baskent Univ, Dept Comp Engn, Ankara, Turkiye

Description

Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.

Files

bib-6da5c957-b567-4cfc-b746-2a41e019abc4.txt

Files (219 Bytes)

Name Size Download all
md5:e7751953a703eeea1a65de846c0029c5
219 Bytes Preview Download