Using English Baits to Catch Serbian Multi-Word Terminology

Објеката

Тип
Рад у зборнику
Верзија рада
објављена верзија
Језик
енглески
Креатор
Cvetana Krstev, Branislava Šandrih, Ranka Stanković
Извор
Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018
Уредник
Nicoletta Calzolari et al.
Издавач
European Language Resources Association (ELRA)
Датум издавања
2018
Сажетак
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a tool for word and chunk alignment. In this first experiment a source language is English, a target language is Serbian, a domain is Library and Information Science for which a bilingual terminological dictionary exists. Our term extractor is based on e-dictionaries and shallow parsing, and for word alignment we use GIZA++. At the end of procedure we included a supervised binary classifier that decides whether an extracted term is a valid domain term. The classifier was evaluated in a 5-fold cross validation setting on a slightly unbalanced dataset, maintaining average F-score of 89%. After conducting the experiment our system extracted 846 different Serbian domain phrases, containing 515 Serbian phrases that were not present in the existing domain terminology.
isbn
979-10-95546-00-9
Subject
aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection
Шира категорија рада
M30
Ужа категорија рада
M33
Права
Отворен приступ
Лиценца
Creative Commons – Attribution-NonComercial-No Derivative Works 4.0 International
Формат
.pdf
Скупови објеката
Ранка Станковић
Radovi istraživača
Медија
pdf

Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)