Published January 1, 2012
| Version v1
Journal article
Open
Performance Analysis and Improvement of Turkish Broadcast News Retrieval
Creators
- 1. Rutgers State Univ, Dept Elect & Comp Engn, Piscataway, NJ 08854 USA
- 2. Bogazici Univ, Dept Elect & Elect Engn, TR-34342 Istanbul, Turkey
Description
This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well.
Files
bib-89774668-1761-4e63-a60a-baead5c78e99.txt
Files
(182 Bytes)
| Name | Size | Download all |
|---|---|---|
|
md5:987164ee3e18db82a92b7be504c6d021
|
182 Bytes | Preview Download |