Претрага
88 items
-
On the compatibility of lexical resources for NooJ
Lexical resources for many languages are provided for the NooJ linguistic development environment. Meta-data descriptions of morphosyntactic and semantic properties of these languages and their resources are a mandatory part of each language module. In this paper we analyze how well the meta-data actually describe resources for a chosen subset of languages and to what extent are they compatible across languages to support multilingual processing. We show that there is place for improvement in both directions.... applied to texts of Verne’s novel and a linguistic analysis of the results obtained was performed. The analyzed texts were in XML format in compliance with TEI, and their alignment was performed at the sentence level using the ACIDE system (Obradović et al 2008), which can handle aligned texts in various ...
... http://www.meta-net.eu/projects/cesar/ 2 [Type text] texts of Jules Verne’s novel “Around the world in eighty days” in the same languages was performed. These seven languages were selected due to the fact that both NooJ resources and aligned versions of this novel were available for them. The resources ...
... Dictionary Properties’ Definition files. The section that follows outlines the results of lexical analysis of the application of NooJ resources to aligned texts. Finally, a section is dedicated to some related issues of compatibility and standardization. The paper ends with concluding remarks. Comparison ...Ranka Stanković, Miloš Utvić, Duško Vitas, Cvetana Krstev, Ivan Obradović. "On the compatibility of lexical resources for NooJ" in Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the 2011 International Nooj Conference, Cambridge Scholars Publishing (2012): 96-108
-
Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian
Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek (2021)Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek. "Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian" in Language and Text: Data, models, information and applications, John Benjamins Publishing Company (2021). https://doi.org/10.1075/cilt.356.04ruj
-
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
-
Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking
U radu se prikazuju rezultati istraživanja vezanih za pripremu paralelnih korpusa, fokusirajući se na transformaciju u RDF grafove koristeći NLP Interchange Format (NIF) za lingvističku anotaciju. Pružamo pregled paralelnog korpusa koji je korišćen u ovom studijskom slučaju, kao i proces označavanja delova govora, lematizacije i prepoznavanja imenovanih entiteta (NER). Zatim opisujemo povezivanje imenovanih entiteta (NEL), konverziju podataka u RDF, i uključivanje NIF anotacija. Proizvedene NIF datoteke su evaluirane kroz istraživanje triplestore-a korišćenjem SPARQL upita. Na kraju, razmatra se povezivanje Linked ...paralelni korpusi, povezivanje imenovanih entiteta, prepoznavanje imenovanih entiteta, NER, NEL, povezani podaci, NIF, VikipodaciRanka Stanković, Milica Ikonić Nešić, Olja Perisic, Mihailo Škorić, Olivera Kitanović. "Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking" in Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, Turin, 20-25 May 2024, ELRA and ICCL (2024)
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... queries in form of regular expressions and graphs; text transformations; processing of monolingual and bilingual texts (bi- texts in which basic segments are aligned). Unitex is freely distributed under the terms of the Lesser General Public License (LGPL). This means that everyone ...
... language utterances, as well as enabling various forms of human-machine interaction. It becomes very important in view of the rising amount of texts and data on the web. The term NLP is also used to describe the function of software or hardware components in a computer system which analyze ...
... various languages, but mainly in Serbian, both in video and audio format, but also in written form as parallel (multilingual) corpora of lessons and texts, supported by electronic terminological resources[10], services, and functionalities for searching and browsing of terminological resources and ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
-
A Tel Platform Blending Academic And Entrepreneurial Knowledge
... language support system also handles aligned texts or bitexts, pairs of semantically equivalent texts in different languages, such as an original text and its translation, that are aligned on a structural level (paragraph, sentence, phrase, etc.). Aligned texts in BAEKTEL enable better understanding ...
... understanding of OER and follow the standard format for representing aligned texts, the Translation Memory eXchange format (TMX) that is XML-compliant. It should finally be mentioned that due to the complex Serbian grammar the language support system also features grammars implemented ...
... the multilingual approach, the BAEKTEL platform provides electronic terminological resources, parallel (multilingual) corpora of lessons and texts in written form, and functionalities for searching and browsing of terminological resources and using them for text annotation. The contents of ...Ivan Obradović, Ranka Stanković, Jelena Prodanović, Olivera Kitanović. "A Tel Platform Blending Academic And Entrepreneurial Knowledge" in Proceedings of the The Fourth International Conference on e-Learning (eLearning-2013), September 2013, Belgrade, Serbia, Belgrade, Serbia : Belgrade Metropolitan University (2013)
-
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... concepts, not between terms (Clarke and Zeng, 2012). However, information-retrieval thesauri are not intended for use in automatic processing of texts: they should be used in manual indexing by human experts for improvement of information retrieval in physical or digital libraries. Thus, there exist ...
... news flows because they are, in fact, lists of selected keywords, denoting the most significant concepts of the domain, with low coverage of real texts (Mdivani, 2013). There are also several Russian versions of international information-retrieval thesauri or controlled vocabularies (Lipscomb, 2000) ...
... are more similar to information-retrieval thesauri guidelines (NISO, 2005). Each concept is linked with words and phrases conveying the concept in texts (text entries). Detailed description of lexical units (words in specific senses), representation of senses of ambiguous words are closer to wordnets ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
-
Using technology for knowledge transfer between academia and enterprises
Ivan Obradović, Ranka Stanković (2014)... In addition to that, textual resources feature aligned texts an corpora. Aligned texts are pairs of texts in different languages, mainly an original and its translation, aligned on some structural level, most often the sentence. Aligned texts in LSS are in the standard, Translation Memory eXchange ...
... eXchange (TMX) format, which is XML-compliant. Corpora are large and structured sets of texts, both monolingual and multilingual, the latter often composed of aligned texts. Finally the web itself represents a textual resource that LSS makes use of. Specific features of Serbian grammar need c ...
... Serbian Wordnet. Romanian Journal of Information Science and Technology, 7(1-2), 147-161. Krstev C., (2008). Processing of Serbian – Automata, Texts and Electronic dictionaries. Faculty of Philology, University of Belgrade, Belgrade. Lee, W. O. (2008). The repositioning of high education from ...Ivan Obradović, Ranka Stanković. "Using technology for knowledge transfer between academia and enterprises" in Knowledge and Management Models for Sustainable Growth, Proc. of IFKAD 2014, 9th International Forum on Knowledge Asset Dynamics, 11-13 June 2013, Matera, Italy, Bari : IFKAD (2014)
-
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely ...
... (MWT) extraction as this problem has been gaining in importance in the field of Natural Language Processing. Initially, MWT extraction from domain texts has been tackled mainly using the statistical approach based on different statistical measures, following the seminal work of Kenneth Church and ...
... documents (Chen et al., 2006). Statistical measures of co-occurrence (MI3 – mutual information) were used for finding MWT candidates in Croatian texts (Tadić&Šojat, 2003). Although the statistical approach has been steadily pursued by a number of researchers, development of lexical resources ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
-
SrpELTeC: A Serbian Literary Corpus for Distant Reading
U članku je predstavljen SrpELTeC, korpus razvijen u okviru akcije COST Distant Reading for European Literary History (CA16204). Svi romani u SrpELTeC-u su odabrani, pripremljeni i obeleženi korišćenjem zajedničkih principa uspostavljenih za sve jezičke zbirke u Evropskoj zbirci književnog teksta (ELTeC). Navedeni su izazovi i rešenja u pripremi SrpELTeC od nule. Svi romani su ručno kodirani u TEI sa bogatim metapodacima i strukturnim napomenama. Automatska anotacija je uključivala POS-označavanje, lematizaciju i imenovane entitete, oslanjajući se na resurse za obradu ...digital humanities, Serbian literature, text corpora, distant reading , linked data, named entity recognition, text analyticsRanka Stanković, Cvetana Krstev, Duško Vitas. "SrpELTeC: A Serbian Literary Corpus for Distant Reading" in Primerjalna književnost, Research Centre of the Slovenian Academy of Sciences and Arts (2024). https://doi.org/10.3986/pkn.v47.i2.03
-
A Mathematical Learning Environment Based on Serbian Language Resources
In recent years, in line with ever growing usage of Information technology, the learning environments are changing. The amount of available learning materials in various forms has increased. These new environments demand comprehensive learning systems, which enable management of the learning corpus with special attention paid to relevant lexical resources. In this paper we present the concept of a Mathematical Learning Environment in Serbian (MLES), which is based on a corpus of mathematical materials and various lexical resources, enabling ...... challenge to corpus processing results from the use of two alphabets: Latin and Cyrillic, with different coding schemas and formats of source texts, as well as from various ways of expressing mathematical content. In order to resolve the problem of two alphabets, the entire corpus is tran ...
... real life problems from engineering practice based on mathematical concepts (Figure 3). Results of the third component are annotated and linked texts, where every mathematical term in the text is linked to the appropriate dictionary entry or relevant corpus content related to that term. This ...
... GOALS AND CHALLENGES Searching and processing mathematical materials is a complex problem. Standard text processors cannot recognize mathematical texts in a proper way. There is thus a need for developing new and adapting existing processors for that purpose. Processing of mathematical content ...Radojičić Marija, Obradović Ivan, Stanković Ranka, Utvić Miloć, Kaplar Sebastijan. "A Mathematical Learning Environment Based on Serbian Language Resources" in Proceedings of the 7th International Scientific Conference Technics and Informatics in Education, Faculty of Technical Sciences, Čačak (2018)
-
Integrisano okruženje za pripremu paralelizovanog korpusa
Razvoj paralelizovanih korpusa zahteva pripremu paralelnih tekstova za njihovu integraciju u paralelizovani korpus. Reč je o jednom kompleksnom zadatku koji se može rešiti na različite načine, i koji mora da se odvija u nekoliko koraka. U ovom radu najpre je iznet postupak pripreme paralelnih tekstova za paralelizovani korpus koji se koristi u Grupi za jezičke tehnologije Univerziteta u Beogradu. Potom je dat kratak pregled programa (XAlign, Concordancier, WS4LR), odnosno softverskih alata koji se pri tome koriste. Nedostatak udobnog okruženja ...... the IJS-ELAN Parallel Corpus. Informatica, 26(3), pp. 299-307, 2002. SUMMARY The development of aligned corpora requires a preparation of parallel texts for their integration into aligned corpora. This is a very complex task, which can be solved in different ways, and which has to be realized ...
... steps. At the beginning of this paper we outline the procedure for preparation of parallel texts for aligned corpora which is being used in the Human Language Technology Group at the University of Belgrade. Texts are marked using XML tags, in accordance with the TEI (Text Encoding Initiative) consortium ...
... environment for the preparation of aligned corpora, under the name of ACIDE. For the construction of this environment we chose the C# programming language. Among other things, ACIDE provides a graphical user interface (GUI) for alignment and visualization of aligned texts, their control and correction ...Ivan Obradović, Ranka Stanković, Miloš Utvić. "Integrisano okruženje za pripremu paralelizovanog korpusa" in Zbornik radova međunarodnog simpozijuma Razlike između bosanskog/bošnjačkog, hrvatskog i srpskog jezika, Graz, Austria, April 2007, - (2007)
-
Building learning capacity by blending different sources of knowledge
... of storing specific textual resources, such as aligned texts and corpora. Aligned texts are pairs of texts in different languages, mainly an original and its translation, aligned on some structural level, most often the sentence. Aligned texts in BMP are in the standard, Translation Memory eXchange ...
... eXchange (TMX) format, which is XML-compliant. Corpora are large and structured sets of texts, both monolingual and multilingual, the latter often composed of aligned texts. Finally the World Wide Web itself represents a textual resource that BMP language support system makes use of. The ...
... In Digital Repositories: Practices and Perspectives, D-Lib Magazine, Volume 16, Number 1/2. Krstev C., (2008). Processing of Serbian – Automata, Texts and Electronic dictionaries. Faculty of Philology, University of Belgrade, Belgrade. Lee, W. O. (2008). The repositioning of high education from ...Ivan Obradović, Ranka Stanković, Olivera Kitanović, Dalibor Vorkapić. "Building learning capacity by blending different sources of knowledge" in International Journal of Learning and Intellectual Capital (2016). https://doi.org/10.1504/IJLIC.2016.075698
-
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... should be used for extraction from unstructured texts than are necessary when modelling dictionary definitions. 5 Conclusion The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language, with the aim of a ...
... Serbia Abstract The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language with the aim of accelerating dictionary development. Definitions in the Serbian Academy of Sciences and Arts (SASA) dictionary ...
... (2019) associate a detailed annotation scheme with the corpus in order to explore diverse structures of term definitions in free and semi-structured texts. In addition to the basic concept (Term) and its main definitions (Definition), sentence segments containing pseudonyms or additional names (Alias ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
-
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
-
An Approach to Development of Bilingual Lexical Resources
... [Information Storage and Retrieval]: Digital Libraries – Collection General Terms Documentation, Languages Keywords Digital libraries, aligned parallel texts, TMX document collections, multilingual lexical resources, bilingual search 1. INTRODUCTION Multilingual information exchange is growing ...
... collection was generated from INFOtheca articles using another of our tools, named ACIDE, an integrated development environment for generating aligned parallel texts [Obradović et al., 2008]. As for available lexical resources, we had at our disposal Serbian morphological e-dictionaries [Krstev, 2008] ...
... wordnets connected via the interlingual index, and a bilingual Dictionary of Librarianship, as well as on a TMX document collection generated from aligned Serbian-English journal articles published in INFOtheca, a scientific journal in the area of Library and Information Sciences. The aim of the new ...Stanković Ranka, Obradović Ivan, Trtovac Aleksandra. "An Approach to Development of Bilingual Lexical Resources" in Proceedings of the Fifth Balkan Conference in Informatics BCI 2012, Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages – CLoBL 2012, September 2012, Novi Sad : BCI (2012)
-
Improvement of geodatabase queries within GeolISS
Ranka Stanković (2008)... handles aligned texts. A pair of semantically equivalent texts in different languages, such as an original text and its translation, that are aligned on a structural level (paragraph, sentence, phrase, etc.) is known as an aligned text or bitext. The standard format for representing aligned texts ...
... is the Translation Memory eXchange format (TMX) that is XML-compliant [13]. Expanded query can be applied on TXM documents in order to retrieve aligned segments that correspond to search criteria in the source and target language. A filtered TMX document is transformed into XML, TXT and HTML output ...
... Developer network (http://edn.esri.com) [8] Vitas D., G. Pavlović-Lažetić, C. Krstev, Lj. Popović, I. Obradović (2003): „Processing Serbian Written Texts: An Overview of Resources and Basic Tools“, Proceedings of the International Workshop on Balkan Language Resources and Tools, Thessaloniki, Greece ...Ranka Stanković. "Improvement of geodatabase queries within GeolISS" in Review of the National Center for Digitization, Beograd : Faculty of Mathematics, Belgrade (2008)
-
Transformer-Based Composite Language Models for Text Evaluation and Classification
Parallel natural language processing systems were previously successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modeling, for which they achieved significantly better results than independent methods in the cases of seven European languages. The aim of this paper is to present the advantages of using composite language models in the processing and evaluation of texts written in arbitrary highly inflective and morphology-rich natural language, particularly Serbian. A perplexity-based dataset, the main asset for the ...Mihailo Škorić, Miloš Utvić, Ranka Stanković. "Transformer-Based Composite Language Models for Text Evaluation and Classification" in Mathematics, MDPI AG (2023). https://doi.org/10.3390/math11224660
-
Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these ...Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, Maciej Eder. "Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution" in Mathematics, MDPI AG (2022). https://doi.org/10.3390/math10050838
-
Towards translation of educational resources using GIZA++
... of semantically related word pairs, ideally translational equivalents, is presented, from aligned texts in SELFEH, a Serbian- English corpus of texts related to education, finance, health and law, aligned at the sentence level within Intera project. The corpus was lemmatized and the method applied ...
... attached to each aligned sentence (element) in order to establish a direct relation to metadata and the original (pdf, edX, docx,…) form of resource document, article, course or other resource. Image 2 presents one part from the TMX document with ID: 1.2010.1.4. From aligned TMX documents ...
... needs several reviews before publishing or preparation for voice recording. [10] A Computer Aided Translation (CAT) Tool is based on collection of aligned sentence pairs in the form of Translation Memory, which facilitates and speeds up the translator's work. Main key functions of a CAT tool that speed ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)