Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Bolucu, Necva; Can, Burcu

doi:10.1145/3292398

Yayınlanmış 1 Ocak 2019 | Sürüm v1

Dergi makalesi Açık

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

1. Hacettepe Univ, Ankara, Turkey

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.

Dosyalar

bib-3a93277f-a100-496a-aeb9-5a392ec269e6.txt

Dosyalar (184 Bytes)

Ad	Boyut	Hepisini indir
bib-3a93277f-a100-496a-aeb9-5a392ec269e6.txt md5:7f0c56c398f431e346bec58bf3883b3d	184 Bytes	Ön İzleme İndir

	Tüm sürümler	Bu sürüm
Görüntüleme	72	72
İndirilenler	58	58
Veri miktarı	10.7 kB	10.7 kB

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Dosyalar

bib-3a93277f-a100-496a-aeb9-5a392ec269e6.txt

Dosyalar (184 Bytes)

TÜBİTAK ULAKBİM

İLETİŞİM

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

Oluşturanlar

Açıklama

Dosyalar

bib-3a93277f-a100-496a-aeb9-5a392ec269e6.txt

Dosyalar (184 Bytes)