Претрага
12 items
-
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... Bulgarian Grammar dictionary (DELAS dictionary) consists of 127,000 lemmas distributed as follows: app. 85,000 simple lemmas belong to general lexis, app. 6,000 lemmas represent domain specific lexis and app. 36,000 lemmas are simple proper names. The corresponding DELAF dictionary consists of app ...
... Linguistique) under the guidance of Maurice Gross. The format of a DELAS-type dictionary basically consist of simple word lemmas accompanied with inflectional class codes which enable production of a DELAF-type dictionary which consists of all inflectional forms with their grammatical information ...
... responsible for generation of all inflectional forms of each DELAS lemma corresponds to each inflectional class code. The Serbian morphological dictionary of simple words contains 121,000 lemmas which yield the production of approximately 1,450,000 different lexical words. Close to 87,000 simple ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
-
The Nooj System as Module within an Integrated Language Processing Environment
... and using external Perl, Awk, and XSLT scripts. pkg WS4LR moduls WSLR moduls + CONVERSION + DICTIONARY MANAGMENT + WORDNET DEVELOPMENT + EXPLOITATION OF ALIGNED TEXTS (from Use Case View) DICTIONARY MANAGMENT + Simple words manipulation + Compound words management + Nooj dictionaries management ...
... and multilingual dictionaries in wordnet development, we will now briefly describe the basic features of the dictionary management module. The lemma in a morphological dictionary of simple words has the following format: lemma.Knnn [+SinSem]*, where lemma is the word form usually used in ...
... are organized in modular fashion, in several sub-dictionaries as separate files. Without going into details of dictionary management, we will just point out that the dictionary management module enables the user to modify or delete all the information attached to a lemma, or the lemma itself ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... graphs and their use are presented: preprocessing graphs, graphs for the inflection of e-dictionary lemmas and graphs for enhancement of e-dictionaries (for word forms regularly derived from lemmas already in e- dictionaries). 8. The use of contexts in graphs that shift grammars modelled by regular ...
... texts. 3. The concept of e-dictionaries is introduced. The specificities of e-dictionaries developed to be used by applications and not humans are stressed and contrasted to those of “traditional dictionaries” (being either in paper or digital form). The content of e-dictionaries for Serbian ...
... Narodna biblioteka Srbije: Belgrade. p. 117-122. [10] Stanković, R., et al., Building terminological resources in an e-learning environment, in The third International Conference on e-Learning. Belgrade, Serbia. p. 114-119. [11] Paumier, S., Unitex 3.1 Beta: User Manual. 2015, Paris: Université ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
-
The Dictionary of the Serbian Academy: from the Text to the Lexical Database
In this paper we discuss the project of digitization of the Dictionary of the Serbo-Croatian Standard and Vernacular Language. Scanning and character recognition were a particular challenge, since various non-standard character set encoding was used in the course of the almost 60-year long production of the dictionary. The first aim of the project was to formalize the micro-structure of the dictionary articles in order to parse the digitized text of and transform it into structured data stored in relational lexical database. This approach ...... in gLobaL contexts The Dictionary of the Serbian Academy: from the Text to the Lexical Database Ranka Stanković1, Rada Stijović2, Duško Vitas1, Cvetana Krstev1, Olga Sabo2 1University of Belgrade, 2Institute for Serbian Language, Serbian Academy of Sciences and Arts E-mail: ranka.stankovic@rgf ...
... 6 5 6 4 Figure 1: The microstructure of dictionary articles. 4 The transformation from the dictionary article text form to the lexical database The guidelines for dictionary writing were used to defi ne the rules for the segmentation of the dictionary articles, the pattern recognition, and the ...
... Keywords: computer lexicography, lexical database, language resources, dictionary, Serbian language 1 Introduction The first volume of the Dictionary of the Serbo-Croatian Standard and Vernacular Language (re- ferred to as the Dictionary of Serbian Academy or DSA), prepared and compiled by the Institute ...Ranka Stanković, Rada Stijović, Duško Vitas, Cvetana Krstev, Olga Sabo. "The Dictionary of the Serbian Academy: from the Text to the Lexical Database" in Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, Ljubljana : Ljubljana University Press, Faculty of Arts (2018)
-
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... Analytics The main stages of thesaurus-based document processing include: • Tokenization and lemmatization, that is, the transfer of word forms to dictionary forms (lemmas); • Matching with the thesaurus based on the lemma representation of the document. Multiword terms from a thesaurus are matched with ...
... Lenat, D., Miller, G., and Yokoi, T. (1995). Cyc, wordnet, and edr: critiques and responses. Communications of the ACM, 38(11):45–48. Lipscomb, C. E. (2000). Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265. Loukachevitch, N. and Dobrov, B. (2014). Ruthes linguistic ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
-
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... The SR BASIC annotated dataset will also be published. Keywords: Part-of-Speech tagging, lemmatization, corpus, evaluation, Serbian, morphological dictionary 1. Introduction The task of assigning to each token its Part-of-Speech cat- egory (noun, verb, adjective, etc.) is a common Natural Language ...
... especially for the Novels test set. This comes as no surprise, due to the fact that it is a very specific text, which is fully covered by the new dictionary used for the TT19 model. Figure 3: Precision of lemmatization per token, obtained by two TreeTagger based taggers 3959 sentences tokens words ...
... y Serbian. INFOtheca, 12(2):36a–47a, December. 8. Language Resource References Cvetana Krstev, Duško Vitas. (2015). Serbian Morpho- logical Dictionary - SMD. University of Belgrade, HLT Group and Jerteh, Lexical resource, 2.0. Duško Vitas, Cvetana Krstev, Ranka Stanković, Miloš Utvić. (2019) ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
-
SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian
Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković (2019)У овом раду представљамо модел за избор добрих примера за речник српског језика и развој иницијалних компоненти модела. Метода која се користи заснива се на детаљној анализи различитих лексичких и синтактичких карактеристика у корпусу састављених од примера из пет дигитализованих свезака речника САНУ. Почетни скуп функција био је инспирисан сличним приступом и за друге језике. Дистрибуција карактеристика примера из овог корпуса упоређује се са карактеристиком дистрибуције узорака реченица ексцерпираних из корпуса који садрже различите текстове. Анализа је показала да ...Српски, добри примери из речника, аутоматизација израде речника, издвајање својстава, Машинско учење... different goals: speeding up the dictionary-making process, but also the development of a lexical database as the source for building new dictionaries of Serbian. 248 Proceedings of eLex 2019 In the e-lexicography era, with the imperatives of faster dictionary-making and “smart lexicography” ...
... material of the Dictionary of the SANU - the needs and possibilities of digitization in the light of contemporary approaches (in Cyrillic)]. Kilgarriff, A., Husák, M., McAdam, K., Rundell, M. & Rychlý, P. (2008). GDEX: Automatically Finding Good Dictionary Examples in a Corpus. In E. Bernal & J. ...
... selection of dictionary examples from corpora, and the presented approach supports the selection of dictionary examples making the process of dictionary development faster and more productive. 1.2 The role of dictionary examples Dictionary examples play an important role in dictionary entries and ...Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, Aleksandra Marković. "SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian" in Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference , Lexical Computing CZ, s.r.o. (2019)
-
Automatic construction of a morphological dictionary of multi-word units
The development of a comprehensive morphological dictionary of multi-word units for Serbian is a very demanding task, due to the complexity of Serbian morphology. Manual production of such a dictionary proved to be extremely time-consuming. In this paper we present a procedure that automatically produces dictionary lemmas for a given list of multi-word units. To accomplish this task the procedure relies on data in e-dictionaries of Serbian simple words, which are already well developed. We also offer an evaluation ...electronic dictionary, Serbian, morphology, inflection, multiwordn units, noun phrases, query expansion... Namely, these MWUs cannot be described by FSTs - they have to be listed in an e-dictionary in a similar way and for similar reasons as simple words. That means that some regular form, or a lemma, has to be listed in a DELAC dictionary, together with some additional information that would enable the gen- eration ...
... source software Automatic Construction of a Morphological Dictionary of MWUs 11 distributed under the terms of LGPL, we easily incorporated its modules in LeXimir for many tasks that involve manipulation of e-dictionaries, including dictionary look-up used in the module for (automated) production of ...
... be extremely time-consuming. In this paper we present a procedure that automatically produces dictionary lemmas for a given list of multi-word units. To accomplish this task the procedure relies on data in e-dictionaries of Serbian simple words, which are already well developed. We also offer an evaluation ...Cvetana Krstev, Ranka Stanković, Ivan Obradović, Duško Vitas, Miloš Utvić. "Automatic construction of a morphological dictionary of multi-word units" in Lecture Notes in Computer Science 6233, Advances in Natural Language Processing, Proceedings of the 7thInternational Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 2010, Springer (2010): 226-237. https://doi.org/10.1007/978-3-642-14770-8_26
-
Production of morphological dictionaries of multi-word units using a multipurpose tool
The development of a comprehensive morphological dictionary of multi-word units for Serbian is a very demanding task, due to the complexity of Serbian morphology. Manual production of such a dictionary proved to be extremely time-consuming. In this paper we present a procedure that automatically produces dictionary lemmas for a given list of multi-word units. To accomplish this task the procedure relies on data in e-dictionaries of Serbian simple words, which are already well developed. We also offer an evaluation ...electronic dictionary, Serbian, morphology, inflection, multi-word units, noun phrases, query expansion... pre- sented for French in [1]. E-dictionaries in the same format have been produced for many other languages. This format can be briefly described in the following way: in a dictionary of lemmas (DELAS) every lemma is described in full detail so that a dictionary of forms containing all necessary ...
... (DELAF) can be generated from it. The dictionary of forms is used in NLP tasks. Two corpus processing systems that support work with this dictionary format were developed, Unitex [2] and Nooj [3], both of which are based on the use of finite-state technology. Serbian e-dictionaries of simple forms have ...
... key-words to Google. The tool relies on Serbian e-dictionaries, inflection transducers for simple words and MWUs, and uses Unitex and Multiflex modules for inflection and dictionary look-up. As for the free phrases that are not in the MWU dictionary, VeBrana relies on its built-in strategy, and always ...Ranka Stanković, Ivan Obradović, Cvetana Krstev, Duško Vitas. "Production of morphological dictionaries of multi-word units using a multipurpose tool" in Proceedings of the Computational Linguistics-Applications Conference, October 2011, Jachranka, Poland, Jachranka, Poland : PTI - Polish Information Processing Society (2011)
-
Чији је пример? Анализа лексичких обележја на примерима Речника САНУ
У овом раду поставља се питање: да ли се може утврдити ко је аутор неког текста уколико се анализирају искључиво његова лексичка обележја? Како бисмо покушали да добијемо одговор на ово питање, посматрали смо примере у оквиру речничког чланка појединачне лексеме Речника САНУ, који су забележени у пет томова (и то: I, II, XVIII, XIX и XX). Сваки пример је преузет из неког извора на шта упућују скраћенице, наведене у заградама. Од преко 5.000 понуђених извора, определили смо се ...... GDEX: Automatically Finding Good Dictionary Examples in a Cor- pus, In E. Bernal & J. DeCesaris (eds.). Proceedings of the XIII EURALEX International Congress, Barcelona: Universitat Pompeu Fabra, 425–432. Косем 2017: Iztok Kosem, Dictionary examples, In Dictionary of Modern Slov- ene: Problems and ...
... get an answer, we observed examples that support lexical entries listed in five of the total of twenty volumes of the Dictionary of Serbian Academy of Science and Arts. Each dictionary example is documented with its author, so we decided to examine only examples that origin from twelve great names in ...
... др. 2018: Iztok Kosem, Kristina Koppel, Tanara Zingano Kuhn, Jan Michelfeit & Carole Tiberius, Identification and Automatic Extraction of Good Dictionary Examples: the Case(s) of GDEX, International Journal of Lexicography. Чији је пример? анализа лексичких обележја на примерима речника сану 315 ...Бранислава Б. Шандрих, Ранка М. Станковић, Мирјана С. Гочанин. "Чији је пример? Анализа лексичких обележја на примерима Речника САНУ" in Српски језик и његови ресурси, Међународни славистички центар, Филолошки факултет, Универзитет у Београду (2019). https://doi.org/10.18485/msc.2019.48.3.ch13
-
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... variant containing e.11 The problem, though different in nature, has similarities with problems of OCR error correction and diacritic restoration: – Like in the case of diacritic omission “errors” are limited to a small num- ber of letters and/or syllables, which implies that a dictionary solution might ...
... multiple candidates for znaci (a form of znak ‘sign’ and značiti ‘to mean’); – A dictionary of multi-word units (MWU) (nouns, adjectives, adverbs, pronouns, conjunctions and interjections) obtained from a dictionary of more than 18,000 MWU lemmas; for instance, Dobro vece ⇒ Dobro veče ‘Good evening’ ...
... that contains letters c, s, z or digraphs dj, dz, a list of zero8 or more can- didates obtained from the dictionary SMD_DR, or one candidate obtained from lists of trigrams or bigrams, or a dictionary of MWUs for a sequence of words. The result of the application of the procedure to a sample text is given ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3
-
A Data Driven Approach for Raw Material Terminology
Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja (2021)The research presented in this paper aims at creating a bilingual (sr-en), easily searchable, hypertext, born-digital, corpus-based terminological database of raw material terminology for dictionary production. The approach is based on linking dictionaries related to the raw material domain, both digitally born and printed, into a lexicon structure, aligning terminology from different dictionaries as much as possible. This paper presents the main features of this approach, data used for compilation of the terminological database, the procedure by which it has ...sirovine, rudarstvo, terminologija, rečnik, terminološka aplikacija, mobilna aplikacija, digitizacija, leksički podaci, korpusi, otvoreni povezani podaci... (abbreviation, 77), etc. Figure 2) presents the entry ‘accessory plate’ with five senses, marked by letters a-e. Two senses (a and e) are related to other dictionary terms (a to ‘quartz wedge’ by CF, and e to three synonyms and two other terms by CF), and two senses (b and c) are followed by their source (Pryor) ...
... adding domain terms to general purpose morphological e-dictionaries and extraction of bilingual lists. The process of terminology compilation, from the perspective of monolingual and bilingual extraction, a well as the web and mobile form of the dictionary are given in Section 4. The last section Appl ...
... (UBFMG), and it is the main dictionary covering mining terminology in English in our approach. Online version of dictionary is published on The Edumine platform that provides professional development training for people in the mining industry [13]. A multilingual “Mining dictionary: Serbo-Croatian: English: ...Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja. "A Data Driven Approach for Raw Material Terminology" in Applied Sciences, MDPI AG (2021). https://doi.org/10.3390/app11072892