Published January 1, 2024 | Version v1
Conference paper Open

Turkish Image Captioning with Vision Transformer Based Encoders and Text Decoders

  • 1. Istanbul Univ, Bilgisayar Muhendisligi Bolumu, Istanbul, Turkiye

Description

Image captioning is defined as the process of describing of images by computer systems automatically. Thus, visual information regarding the content of the images is expressed in textual form. This paper presents a deep learning-based Turkish image captioning study implemented by using vision transformers and text decoders. In the proposed study, images are initially encoded with a vision transformer-based module. Afterwards, the features of the encoded image are normalized by passing them through a feature projection module. In the final stage, image captions are generated via a text decoder block. To test the performance of the Turkish image captioning system presented in this paper, TasvirEt, a benchmark dataset consisting of Turkish image captions, was used. In the tests performed, quite successful results were observed and a BLEU-1 value of 0.3406, a BLEU-2 value of 0.2110, a BLEU-3 value of 0.1253, a BLEU-4 value of 0.0690, a METEOR value of 0.1610, a ROUGE-L value of 0.3145 and a CIDEr value of 0.3879 were measured.

Files

bib-0f582afd-172f-44eb-ab13-ea2589e6d1f6.txt

Files (207 Bytes)

Name Size Download all
md5:f2f440f2cf59ae310d3a25a9e8f69579
207 Bytes Preview Download