Yayınlanmış 1 Ocak 2012 | Sürüm v1
Dergi makalesi Açık

Performance Analysis and Improvement of Turkish Broadcast News Retrieval

  • 1. Rutgers State Univ, Dept Elect & Comp Engn, Piscataway, NJ 08854 USA
  • 2. Bogazici Univ, Dept Elect & Elect Engn, TR-34342 Istanbul, Turkey

Açıklama

This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well.

Dosyalar

bib-89774668-1761-4e63-a60a-baead5c78e99.txt

Dosyalar (182 Bytes)

Ad Boyut Hepisini indir
md5:987164ee3e18db82a92b7be504c6d021
182 Bytes Ön İzleme İndir