54 items
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... Identifying id- iomatic expressions using automatic word-alignment. In Proceedings of the EACL 2006 Workshop on Multi-word expressions in a multilingual context, pages 33–40. Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational linguistics, 29(1):19–51 ...
... tool for word and chunk alignment. In this first experiment a source language is English, a target language is Serbian, a domain is Library and Information Science for which a bilingual terminological dictionary exists. Our term extractor is based on e-dictionaries and shallow parsing, and for word alignment ...
... different Serbian domain phrases, containing 515 Serbian phrases that were not present in the existing domain terminology. Keywords: aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection 1. Motivation Terminology is rapidly developing in many research and ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment
Sina Ahmadi, John P McCrae, Sanni Nimb, Fahad Khan, Monica Monachini, Bolette S Pedersen, Thierry Declerck, Tanja Wissik, Andrea Bellandi, Irene Pisani, [...] Ranka Stanković and others (2020)Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages ...... monolingual word sense alignment. Different dictionaries and related resources such as word- nets and encyclopedia have significant differences in struc- ture and heterogeneity in content, which makes aligning information across resources and languages a challenging task. Word sense alignment (WSA) is ...
... Gurevych, I. (2013). Dijkstra-WSA: A graph-based approach to word sense alignment. Trans- actions of the Association for Computational Linguistics, 1:151–164. Matuschek, M. and Gurevych, I. (2014). High perfor- mance word sense alignment by joint modeling of sense distance and gloss similarity. In ...
... further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA. Keywords: lexical semantic resources, sense alignment, lexicography, language ...Sina Ahmadi, John P McCrae, Sanni Nimb, Fahad Khan, Monica Monachini, Bolette S Pedersen, Thierry Declerck, Tanja Wissik, Andrea Bellandi, Irene Pisani, [...] Ranka Stanković and others . "A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment" in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, European Language Resources Association (ELRA) (2020)
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas. Keywords: term extraction, terminology, multi-word units, lemmatization, finite-state transducers 1. Motivation Various approaches have been proposed for multi-word term (MWT) ...
... method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is ...
... this path rejects nouns that are homographous with other PoS word forms in order to avoid false recognitions. Given the high level of homography of word forms in Serbian it is possible that two or more graphs recognize the same word sequence where only one of them is correct. In the case of ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
Resource-based WordNet Augmentation and Enrichment
In this paper we present an approach to support production of synsets for SerbianWordNet(SerWN)byadjustingPrincetonWordNet(PWN)synsetsusing several bilingual English-Serbian resources. PWN synset definitions were automatically translated and post-edited, if needed, while candidate literals for Serbian synsets were obtained automatically from a list of translational equivalents compiled form bilingual resources. Preliminary results obtained from a setof1248selectedPWNsynsetsshowthattheproducedSerbiansynsetscontain 4024 literals, out of which 2278 were offered by the system we present in this paper, whereas experts added the remaining 1746. Approximately one half of ...... the two wordnets, such as the ILI. Automatic alignment of synsets belonging to different languages is closely related to the task of pairing their word senses. This approach was followed by Matuschek and Gurevych (2013) who solved the word sense alignment (WSA) task by pairing senses with the same ...
... were to identify the most plausible point for placing each of the word senses from this set in PWN, either by merging it into an existing synset, or adding it as a new hyponym synset. Keywords: WordNet, bilingual resources, term alignment, parallel lists 104 Five teams submitted 13 systems, with all ...
... aligned wordnets. The English part of each corpus was semantically tagged, after which the process of wordnet creation was transformed into a word alignment problem, where wordnet synsets in the English part of the corpus were aligned with in the target language part of the corpus. The obtained precision ...Ranka Stanković, Miljana Mladenović, Ivan Obradović, Marko Vitas, Cvetana Krstev. "Resource-based WordNet Augmentation and Enrichment" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018)
Towards translation of educational resources using GIZA++
... translation variants in large parallel corpora [17]. Volk et al. argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but they describe the system for efficiently searching ...
... The corpus was lemmatized and the method applied on lemmas of word forms from the corpus, by extracting candidate translational equivalents through a ranking based on lemma frequencies. Similar experiments with the alignment on the word level were performed also on the Intera English/Serbian corpus ...
... bin/build_binary \ edX.arpa.en \ edX.blm.en Finally, we came to the main event - training the translation model. To do this, we ran word-alignment (using GIZA++), phrase extraction and scoring, created lexicalised reordering tables and Moses configuration file, all with a single command ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)
The shear strength evaluation of rough and infilled joints and its indications for stability of rock cutting in schist rock mass
Construction of E75 highway section through Grdelica gorge was one of the most demanding projects realized in recent Serbian history. The alignment approximately 25 km long consists of several tens of bridges, two tunnels, three galleries and cuts with total length of 6 km. The alignment passes through highly anisotropic Palaeozoic schist rock formation of different weathering grades. This study focuses on shear strength properties of discontinuities, which are found to be the critical feature contributing to the occurrence ...Dušan Berisavljević, Zoran Berisavljević, Svetlana Melentijević. "The shear strength evaluation of rough and infilled joints and its indications for stability of rock cutting in schist rock mass" in Bulletin of engineering geology and the environment, Springer (2022). https://doi.org/10.1007/s10064-022-02580-8
Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++
Branislava Šandrih, Ranka Stanković (2020)U nauci, industriji i mnogim istraživačkim oblastima, terminologija se brzo razvija. Najčešće, jezik koji je „lingua franca“ za većinu ovih oblasti je engleski. Kao posledica toga, za mnoga polja termini domena su koncipirani na engleskom, a kasnije se prevode na druge jezike. U ovom radu predstavljamo pristup za automatsko izdvajanje dvojezične terminologije za englesko-srpski jezički par koji se oslanja na usaglašeni dvojezični korpus domena, ekstraktor terminologije za ciljni jezik i alat za usklađivanje delova. Ispitujemo performanse metode na domenu ...... The alignment of chunks began with pre-processing using MOSES (Koehn et al., 2007) to perform tokenisation, truecasing and cleaning. In the next step a 3-gram translation model was built us- ing KenLM (Heafield, 2011), followed by the training of this translation model. For the purpose of word-alignment ...
... tools for the extraction of English MWTs (Eng-TE) and Serbian MWEs (Serb-TE) implemented in LeX- imir (Stanković et al., 2016) and on GIZA++ for word alignment, while all other components are newly developed. In our experiments we combined each of the three following parameters, all related to the preparation ...
... Stanković. “Two Approaches to Compilation of Bilingual Multi-Word Terminology Lists from Lexical Resources”. Natural Language Engineering, 2019 Xu, Yan, Luoxin Chen, Junsheng Wei, Sophia Ananiadou, Yubo Fan et al.. “Bilingual Term Alignment from Comparable Corpora in English Dis- charge Summary and Chinese ...Branislava Šandrih, Ranka Stanković. "Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.6
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Two approaches to compilation of bilingual multi-word terminology lists from lexical resources" in Natural Language Engineering, Cambridge University Press (CUP) (2020). https://doi.org/10.1017/S1351324919000615
Integrisano okruženje za pripremu paralelizovanog korpusa
Razvoj paralelizovanih korpusa zahteva pripremu paralelnih tekstova za njihovu integraciju u paralelizovani korpus. Reč je o jednom kompleksnom zadatku koji se može rešiti na različite načine, i koji mora da se odvija u nekoliko koraka. U ovom radu najpre je iznet postupak pripreme paralelnih tekstova za paralelizovani korpus koji se koristi u Grupi za jezičke tehnologije Univerziteta u Beogradu. Potom je dat kratak pregled programa (XAlign, Concordancier, WS4LR), odnosno softverskih alata koji se pri tome koriste. Nedostatak udobnog okruženja ...... datoteka u TMX formatu na datoteke pojedinačnih jezika • vertikalizaciju teksta Sve navedene funkcije su dostupne preko menija Alignment, Tools i TMX. Meni Alignment obezbeđuje GUI za programske pakete za paralelizaciju laboratorije Loria. Pojedinačne stavke u meniju omogućavaju korišćenje svakog ...
... Encoding Initiative) consortium recommendations, and their alignment is performed at the level of paragraphs and sentences. We then give an overview of the software, namely programs (XAlign, Concordancier, WS4LR) that are used for alignment. The absence of a comfortable environment with a graphical ...
... construction of this environment we chose the C# programming language. Among other things, ACIDE provides a graphical user interface (GUI) for alignment and visualization of aligned texts, their control and correction, as well as generation of files in TMX format. ACIDE also enables the decomposition ...Ivan Obradović, Ranka Stanković, Miloš Utvić. "Integrisano okruženje za pripremu paralelizovanog korpusa" in Zbornik radova međunarodnog simpozijuma Razlike između bosanskog/bošnjačkog, hrvatskog i srpskog jezika, Graz, Austria, April 2007, - (2007)
Keyword-Based Search on Bilingual Digital Libraries
This paper outlines the main features of Biblisha, a tool that offers various possibilities of enhancing queries submitted to large collections of aligned parallel text residing in bilingual digital library. Biblishsa supports keyword queries as an intuitive way of specifying information needs. The keyword queries initiated, in Serbian or English, can be expanded, both semantically, morphologically and in other language, using different supporting monolingual and bilingual resources. Terminological and lexical resources are of various types, such as wordnets, electronic ...Ranka Stanković, Cvetana Krstev, Duško Vitas, Nikola Vulović, Olivera Kitanović. "Keyword-Based Search on Bilingual Digital Libraries" in Semantic Keyword-Based Search on Structured Data Sources - Second COST Action IC1302 International KEYSTONE Conference, IKC 2016, Springer (2017). https://doi.org/10.1007/978-3-319-53640-8_10
A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed ...... two alignment tools developed by LORIA (Laboratoire lorrain de recherche en informatique et ses applications), one for automatic sentence alignment of texts (Xalign, http://led.loria.fr/outils/ALIGN/align.html), and another for alignment visualization and manual correction of alignment errors ...
... for Serbian and English. Sr Tokens Types Words Word types Sent ences total 200,694 68,501 159,031 65,488 7,986 min 746 369 609 342 27 max 12,214 3,260 8,440 3,105 414 avg 4,561 1,557 3,614 1,488 182 En Tokens Types Words Word types Sent ences total 220,120 50,747 178,269 47 ...
... the system recall without negative effects on precision. Moreover, for multi-word units not found in dictionaries, there exists a rule-based strategy, which attempts to recognize syntactic structure of the multi-word and how it should be inflected [Stankovic et al., 2011]. Another type of resources ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Aleksandra Trtovac, Miloš Utvić. "A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals" in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, May 2012, Istanbul, Turkey, Istanbul, Turkey : European Language Resources Association (2012)
A bilingual digital library for academic and entrepreneurial knowledge management
A generic knowledge management process of organization, storage and retrieval of knowledge can suitably be fitted in a digital library. In the digital and knowledge age digital libraries can be used in knowledge management to handle intellectual assets and support knowledge creation. A multilingual digital library either stores content in more than one language or provides multilingual query access to monolingual content. In Serbia 18 of 308 scientific journals regularly published are bi-lingual, with papers simultaneously being in English ...... Concordancier, developed in Loria labaratory in France (Laboratoire Lorrain de Recherche en Informatique et ses Applications) are used for alignment. The alignment method is based on the number of characters (length of the segment). Utvić reports that this approach is very successful (as much as 96% ...
... native XML DBMS database to enterprise NoSQL. In one platform, it combines a database, search engine and application services. The preliminary alignment phase consists of preparing an XML document (eXtensible Markup Language) according to TEI (Text Encoding Initiative) guidelines.2 Practically, ...
... formedness checking and validation according to a DTD (Document Type Definition) or XML Scheme can be used for that purpose. The next key step is the alignment itself: the task is to establish relations between translation equivalents in both texts. In this case, segments are paired that usually represent ...Ranka Stanković, Cvetana Krstev, Biljana Lazić, Dalibor Vorkapić. "A bilingual digital library for academic and entrepreneurial knowledge management" in Proceeding of 10th International Forum on Knowledge Asset Dynamics — IFKAD 2015: Culture, Innovation and Entrepreneurship: connecting the knowledge dots, Bari, Italy, 10-12 June 2015, Bari : IFKAD (2015)
The Dictionary of the Serbian Academy: from the Text to the Lexical Database
In this paper we discuss the project of digitization of the Dictionary of the Serbo-Croatian Standard and Vernacular Language. Scanning and character recognition were a particular challenge, since various non-standard character set encoding was used in the course of the almost 60-year long production of the dictionary. The first aim of the project was to formalize the micro-structure of the dictionary articles in order to parse the digitized text of and transform it into structured data stored in relational lexical database. This approach ...... Similarly, comparison and partial alignment of the DSA tag 3 http://www.tei-c.org/ 5 / 9 946 ProceediNGS oF tHe xviii euraLex iNterNatioNaL coNGreSS set was done with Ontolex4 and LexInfo5, but a more precise and detailed alignment is envisaged. The dictionary article ...
... of lexicographic leaflets is described in Stijović (2018). Out of 19 volumes, two were available as MS Word files, two as PDF files, and the others only in paper form. Unfortunately, neither MS Word nor PDF files could be used without further preprocessing, since non-Unicode character sets were used ...
... guidelines for dictionary writing were used to defi ne the rules for the segmentation of the dictionary articles, the pattern recognition, and the alignment of the recognized markers with the predefi ned categories, as described in the previous section. The dictionary article units that were recognized ...Ranka Stanković, Rada Stijović, Duško Vitas, Cvetana Krstev, Olga Sabo. "The Dictionary of the Serbian Academy: from the Text to the Lexical Database" in Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, Ljubljana : Ljubljana University Press, Faculty of Arts (2018)
A Data Driven Approach for Raw Material Terminology
Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja (2021)The research presented in this paper aims at creating a bilingual (sr-en), easily searchable, hypertext, born-digital, corpus-based terminological database of raw material terminology for dictionary production. The approach is based on linking dictionaries related to the raw material domain, both digitally born and printed, into a lexicon structure, aligning terminology from different dictionaries as much as possible. This paper presents the main features of this approach, data used for compilation of the terminological database, the procedure by which it has ...sirovine, rudarstvo, terminologija, rečnik, terminološka aplikacija, mobilna aplikacija, digitizacija, leksički podaci, korpusi, otvoreni povezani podaci... a total of 2285 English and 308 Serbian terms. The GIZA++ [43] and Moses toolkit [44] for statistical machine translation (SMT) were used for word alignment. Aligned chunks, presented in the so-called phrase table, are obtained as output from Moses, together with their phrase translation scores. After ...
... lingual dictionaries and Serbian part of bilingual dictionaries. Translation equivalents are retrieved from bilingual dictionaries and within the word alignment phase (more in Section 4.2), keeping again information about the original dictionary source. Extracted terms were also subject to a labeling procedure ...
... http://rudonto.rgf.bg.ac.rs/ (accessed on 12 February 2020). 27. Andonovski, J.; šandrih, B.; Kitanović, O. Bilingual lexical extraction based on word alignment for improving corpus search. Electron. Libr. 2019, 37, 722–739. [CrossRef] 28. Radojičić, M.; Obradović, I.; Stanković, R.; Utvić, M.; Kaplar ...Olivera Kitanović, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić, Ivan Babić, Ljiljana Kolonja. "A Data Driven Approach for Raw Material Terminology" in Applied Sciences, MDPI AG (2021). https://doi.org/10.3390/app11072892
WS4LR - a Worksation for Lexical Resources
... lemmas and of inflected forms. In LADL format, all the entries in the dictionary of simple word lemmas, the so called DELAS, have the following form: lemma.Knnn [+SinSem]* where lemma is the simple word, in general in the form usually used in traditional dictionaries, K is the part of speech mark ...
... inflectional properties are used to produce the morphological dictionary of word forms, the so called DELAF. All the entries in this dictionary have the following form: form,lemma[:categories]* where form is a simple word form of a lemma that is represented by its DELAS entry form, and :categories ...
... 1 Term multi-word unit is sometimes used. 2 Intex homepage: http://msh.univ-fcomte.fr/intex/ 3 Unitex homepage: http://www-igm.univ-mlv.fr/~unitex/ 4 Nooj homepage: http://www.nooj4nlp.net 1692 the possible grammatical categories of the word form, each category represented ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović. "WS4LR - a Worksation for Lexical Resources" in Proceedings of the Fifth Interantional Conference on Language Resources and Evaluation, Genoa, Italy, May 2006, ELRA - European Language Resources Association (2006)
Softverski alati za korišćenje resursa za srpski jezik
Ivan Obradović, Ranka Stanković (2008)... smaller, such as words. The sec- ond step is the alignment of segmented parallel texts by means of one of the available alignment methods. The goal is to connect equivalent seg- ments in two or more parallel texts. The method usually used for alignment at the sentence level, which is the most common ...
... production of all morphological forms of the word. The generated inflectional forms of the word are stored in the morphological dictionary of simple word forms called DELAF (Dictionnaire électronique des formes fléchies – electronic dictionary of word forms), where the main format of data is the ...
... the morphological dictionary of simple word, named DELAS (Dictionnaire électronique des mots simples – electronic dictionary of simple words), are in the following form: lema.Knnn [+SinSem]* where lema in general stands for the word form of a simple word used in traditional dictionar- ies. Knnn ...Ivan Obradović, Ranka Stanković. "Softverski alati za korišćenje resursa za srpski jezik" in INFOteka: časopis za informatiku i bibliotekarstvo, Belgrade, Serbia : Zajednica biblioteka univerziteta u Srbiji (2008)
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... text logical layout. At the beginning of the alignment process all segments coincided with sentences automatically tagged by Unitex. The XAlign system [1] was used for the alignment process. Starting from the French version, the goal of the alignment was to establish 1:1 relations on the segment ...
... second language, presented in Figure 6. Two Bulgarian literals thus obtained are плавателен съд and малък кораб which are multi-word units. Since inflection of multi-word units for Bulgarian is not yet integrated in WS4LR, as will be explained in the final section, a user can choose to delete it ...
... example of Serbian and Greek 5. Further Work Our main concern for the future work is adequate processing of multi-word units. That is, we would like our tool to treat multi-word units in the same way as simple words and to inflect them correctly upon request. The first version of this approach ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... that is precisely formulated terms referring to implied concepts. If an unambiguous and clear name in form of an existing word or a phrase cannot be found, than an ambiguous word can be used for naming and supplied with a “relator” (a brief note in parentheses). The RuThes concepts are not divided ...
... Processing for Text Analytics The main stages of thesaurus-based document processing include: • Tokenization and lemmatization, that is, the transfer of word forms to dictionary forms (lemmas); • Matching with the thesaurus based on the lemma representation of the document. Multiword terms from a thesaurus ...
... Disambiguation of ambiguous text entries. Brown and blue boxes on Fig. 3 highlight ambiguous terms, which were automatically resolved. For example, Russian word demografiya (demography) can mean demographic situation or demographic science. The quality of the disambiguation proce- dure was previously evaluated ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian news paper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annota tion, which were further used to train two Named Entity Recognition (NER) sys tems: Stanford and spaCy. All obtained models, together with a rule- and lexicon based system were evaluated on ...... the same) or weighted, where partial overlapping is taken into account, but with some weighted value to mea- sure overlapping segment. To indicate alignment type, one can choose among the two options: the first option is greedyMatching, where the match- ing of annotations in the first and second files ...
... 2 × 3 × 4 evaluation rounds: two test sets, three NERs and four models per each. All trials were run with strict matching type and max- Matching alignment type. To indicate the chosen score type to evaluate the correspondence between one annotation from the first file and one annotation from the second ...
... entities, classes, attributes per doc- ument and collection; Gemini tool allows comparison of two text anno- tation files and provides different alignment scores. It is possible to compare a pair of XML files, a pair of files in BRAT for mat and one XML file against a file in BRAT for- mat. The first ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names" in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria (2019). https://doi.org/10.26615/978-954-452-056-4_122