Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

Selbes, Berkay; Sert, Mustafa

doi:10.1145/3607540.3617141

Yayınlanmış 1 Ocak 2023 | Sürüm v1

Konferans bildirisi Açık

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

1. Baskent Univ, Dept Comp Engn, Ankara, Turkiye

Video captioning aims to generate natural language sentences of an input video. Generating coherent natural language sentences is a challenging task due to the complex nature of video content such as object and scene understanding, extraction of object- and event-specific auditory information, and acquisition of relationships among objects. In this study, we address the problem of efficient modeling of object interactions in scenes, as they include crucial information regarding the events in the visual scene. To this end, we propose to use object features along with auditory information to better model the audio-visual scene appearing within the video. Specifically, we extract Faster R-CNN as the object features and VGGish as the auditory features and design a transformer encoder-decoder architecture in the multimodal setup. Experiments on MSR-VTT show encouraging results and object features better model the object interactions along with the auditory information in comparison to the ResNet features.

Dosyalar

bib-6da5c957-b567-4cfc-b746-2a41e019abc4.txt

Dosyalar (219 Bytes)

Ad	Boyut	Hepisini indir
bib-6da5c957-b567-4cfc-b746-2a41e019abc4.txt md5:e7751953a703eeea1a65de846c0029c5	219 Bytes	Ön İzleme İndir

	Tüm sürümler	Bu sürüm
Görüntüleme	10	10
İndirilenler	4	4
Veri miktarı	876 Bytes	876 Bytes

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

Dosyalar

bib-6da5c957-b567-4cfc-b746-2a41e019abc4.txt

Dosyalar (219 Bytes)

TÜBİTAK ULAKBİM

İLETİŞİM

Multimodal Video Captioning using Object-Auditory Information Fusion with Transformers

Oluşturanlar

Açıklama

Dosyalar

bib-6da5c957-b567-4cfc-b746-2a41e019abc4.txt

Dosyalar (219 Bytes)