Konferans bildirisi Açık Erişim

Automatic Transcription of Ottoman Documents Using Deep Learning

   Bilgin Taşdemir, Esma F.; Tandoğan, Zeynep; Akansu, S. Doğan; Kızılırmak, Fırat; Şen, Umut; Akca, Aysu; Kuru, Mehmet; Yanıkoğlu, Berrin

With the accelerated pace of digitization, a vast collection of Ottoman documents has become accessible to researchers and the general public. However, most users interested in these documents are unable to read them, as the text is Turkish written in the Arabic-Persian script. Manual transcription of such a massive amount of documents is also beyond the capacity of human experts. With the advancements in deep learning, we have been able to provide a solution to the long-standing problem of automatic transcription of printed Ottoman documents. We evaluated three decoding strategies including Word Beam Search that allows to use a recognition lexicon and n-gram statistics during the decoding phase. Furthermore, the effect of lexicon size and coverage and language modelling via character or word n-grams are also evaluated. Using a general purpose large lexicon of the Ottoman era (260K words and 86% test coverage), the performance is measured as 6.59% character error rate and 28.46% word error rate on a test set of 6, 828 text lines.

Dosyalar (643.4 kB)
Dosya adı Boyutu
DAS-2024.pdf
md5:7b5baefc21ac43df114ee607aa2748b4
643.4 kB İndir
259
74
görüntülenme
indirilme
Tüm sürümler Bu sürüm
Görüntülenme 259259
İndirme 7474
Veri hacmi 47.6 MB47.6 MB
Tekil görüntülenme 182182
Tekil indirme 6565

Alıntı yap