Претрага
219 items
-
An Italian-Serbian Sentence Aligned Parallel Literary Corpus
This article presents the construction and relevance of an Italian-Serbian sentence-aligned parallel corpus, delving into the aligned sentences in order to facilitate effective translation between the two languages. The parallel corpus serves as a valuable resource for language experts, researchers, and language enthusiasts, fostering a deeper understanding of linguistic nuances and cultural expressions. By bridging the gap between Serbian and Italian, this corpus opens new avenues for cross-cultural communication and collaboration, and ultimately contributes to the improvement of language-related ...Saša Moderc, Ranka Stanković, Aleksandra Tomašević, Mihailo Škorić. "An Italian-Serbian Sentence Aligned Parallel Literary Corpus" in Review of the National Center for Digitization, Belgrade : Faculty of Mathematics, University of Belgrade (2023). https://doi.org/10.5281/zenodo.11203388
-
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... stems from the fact that it presents the sample text for the French distribution of the Unitex system [15]. Versions of the novel in fifteen languages have been acquired, but not all of these texts have yet been aligned; Among already aligned texts are French original and translations in English ...
... visualization of aligned texts by applying appropriate XSLT transformations. Thus visualized texts user can freely browse. One such visualization is represented in Figure 1. Browsing, however, is not a particularly successful form of text exploration. WS4LR module for aligned texts offers users ...
... the Prolex database since we plan to use it in a translation environment [14]. It is our wish to work in a future with a true aligned Balkan text – that is, a text originally written in some Balkan language and translated to other Balkan languages. Figure 11. Results of a query bilingually ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
-
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... Serbian-English (in this text referred to as ‘Dictionary’) (Ljiljana Kovačević, 2014) has started in 2001 at the National Library of Serbia, with the aim of presenting the librarianship terminology on different me- dia (Kovačević et al., 2004). This resource was first used on aligned texts in query ex- ...
... an analysis of several Serbian terminological dictionaries and Serbian e-dictionary of MWEs. This system was applied to the Serbian part of the aligned text (presented in 4.1.) and the results of its work are pre- sented in Table 1.7 For each class the syntactic structure it recognizes is output as well ...
... 43 text features (also referred as “linguistic” features in (Ebert, 2017; Repar and Pollak, 2017)) from original (GIZA_SRP_ORIG) and lemmatized (GIZA_SRP_LEMM) form of Serbian chunk obtained from GIZA++, corresponding extracted Serbian term (SRP_EXTRACTED) and from the English part of the aligned chunk ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
-
Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons
Mihailo Škorić (2017)The goal of this paper is to draw attention to the possibility of using emoticon-riddled text on the web in language-neutral sentiment analysis. It introduces several innovations in the existing framework of research and tests their effectiveness. It also presents a software tool especially made for that purpose, explains how it builds a database with sentimental value of terms and offers the user manual. Finally, it presents a software tool that tests the new database and gives some examples ...... meaning of written text, but only the grammar of the language that text is written on, which enables wider application. – Software that has a deeper understanding of the meaning of the text, often limited to one or a small number of areas. This type of software is predominantly used for text classification ...
... message does not contain text, and its determiner must refer to previous message. 3. if the message contains both the determiner and the text, and the following message contains determiner but not text – determiners from both messages will refer to the message that contains text. Example: A: I missed the ...
... g and analysis: understanding of written text and text queries, analysis of moods in the text, processing of digital linguistic resources such as automatic parallelization and automation of any operation that requires a deep understanding of the written text. – Artificial intelligence: automated co ...Mihailo Škorić. "Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons" in Infotheca, Faculty of Philology, University of Belgrade (2017). https://doi.org/10.18485/infotheca.2017.17.1.4
-
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... Mining and Geology ranka.stankovic@rgf.bg.ac.rs Belgrade, Serbia 1 Text mending – introduction to problems Text mending is one of the simplest text transformation problems, when compared to speech recognition and generation, text summarization and machine translation. It is also one of the first problems ...
... character recognition (OCR) is applied. A text that fully corresponds to the original is rarely obtained since OCR is prone to errors. The quality of the resulting text depends on various factors: the software used, quality of the paper and print of the original text, and its language and alphabet. OCR software ...
... to a clean text.5 A text after OCR - Е. *нпjе него броћ! Тебе ће неко *еад *пптатн шта ти хоћеш, а *пгга нећеш! Него. кажи ти мени. jе ли теби *бнла позната моjа наредба, коjом се забрањуjе тумарање по турским кућама? — *Нпjе. — Jа где си ти *бно за ово месец дана — У *болннци. A text after automatic ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3
-
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... the text using lemma sequences. Fig. 3 shows the term coverage of news text ”Kudrin’s experts named the main demographic threats for Russia”4), according to the Security thesaurus. Fig. 4 shows the coverage of matching the same text with RuThes text entries; • Disambiguation of ambiguous text entries ...
... etrieval and NLP applications (Loukachevitch and Do- brov, 2014). RuThes was successfully evaluated in text summarization (Mani et al., 2002), text clustering (Loukachevitch et al., 2017), text categorization (Loukachevitch and Dobrov, 2014), detecting Russian paraphrases (Loukachevitch et al., 2017) ...
... Economy dependence, Energy dependence, Import substitution, Imported goods, and Import. The lower right form shows text entries for a related concept. The low right form of Fig. 1 describes text entries of Import substitution concept. Fig. 2 shows a fragment from the Security thesaurus. The visible list ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
-
Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis
U ovom radu predstavljen je model koji omogućava prikupljanje, pripremu, opis metapodataka, upravljanje i eksploataciju, uključujući pretragu punog teksta dokumenata iz domena kriminalistike napisanih na srpskom jeziku. Predloženi pristup primenjuje se na veb portalu koji sakuplja različite tekstove nastale iz časopisa Akademije za kriminalistiku i policijske studije, Krivičnog zakona Srbije, konferencija „Tara“ i „Reiss“, kao i iz nekih doktorskih disertacija vezanih za ovu oblast istraživanje. Nakon obrade teksta, korpus koji sadrži preko 5500 stranica običnog teksta, kreiran je i ...... research. After text processing, a corpus containing over 5500 pages of plain text is created and prepared for publication as an online resource for full text search using Omeka, an open source content management system for on line digital library development. Search capabilities, both full text and metadata ...
... research. The text that is not in Serbian language was removed, as well as tables, figures, references and links, as usual preparation for corpus processing. After this preparation, the text collection contained 5,500 pages of plain text, in A4 format, which was used for further text analysis and ...
... Forensic Text Analysis Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović Дигитални репозиторијум Рударско-геолошког факултета Универзитета у Београду [ДР РГФ] Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis ...Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović. "Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis" in International Scientific Conference “Archibald Reiss Days” Thematic Conference Proceedings Of International Significance, Belgrade, 7-9 November 2017, Academy Of Criminalistic And Police Studies Belgrade (2017)
-
Keyword-Based Search on Bilingual Digital Libraries
This paper outlines the main features of Biblisha, a tool that offers various possibilities of enhancing queries submitted to large collections of aligned parallel text residing in bilingual digital library. Biblishsa supports keyword queries as an intuitive way of specifying information needs. The keyword queries initiated, in Serbian or English, can be expanded, both semantically, morphologically and in other language, using different supporting monolingual and bilingual resources. Terminological and lexical resources are of various types, such as wordnets, electronic ...Ranka Stanković, Cvetana Krstev, Duško Vitas, Nikola Vulović, Olivera Kitanović. "Keyword-Based Search on Bilingual Digital Libraries" in Semantic Keyword-Based Search on Structured Data Sources - Second COST Action IC1302 International KEYSTONE Conference, IKC 2016, Springer (2017). https://doi.org/10.1007/978-3-319-53640-8_10
-
On the compatibility of lexical resources for NooJ
Lexical resources for many languages are provided for the NooJ linguistic development environment. Meta-data descriptions of morphosyntactic and semantic properties of these languages and their resources are a mandatory part of each language module. In this paper we analyze how well the meta-data actually describe resources for a chosen subset of languages and to what extent are they compatible across languages to support multilingual processing. We show that there is place for improvement in both directions.... (Obradović et al 2008), which can handle aligned texts in various formats (TEI, TMX, html, Vanilla). During the alignment process, the texts were segmented in such a way as to establish a one-to-one correspondence between the aligned segments and the original text in French. An example follows, showing ...
... http://www.meta-net.eu/projects/cesar/ 2 [Type text] texts of Jules Verne’s novel “Around the world in eighty days” in the same languages was performed. These seven languages were selected due to the fact that both NooJ resources and aligned versions of this novel were available for them. ...
... character strings: kilometer, meter, foot, feet, pond, mile, or as patterns, . An example of aligned concordances obtained by the application of these three graphs on the text is: En: … embraces fourteen hundred thousand square miles, upon which is spread unequally a population ... Ranka Stanković, Miloš Utvić, Duško Vitas, Cvetana Krstev, Ivan Obradović. "On the compatibility of lexical resources for NooJ" in Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the 2011 International Nooj Conference, Cambridge Scholars Publishing (2012): 96-108
-
Advancing Sentiment Analysis in Serbian Literature: A Zero and Few-Shot Learning Approach Using the Mistral Model
Ova studija predstavlja analizu sentimenta srpskih starih romana iz perioda 1840-1920, koristeći veliki jezički model (LLM) Mistral za tehniku učenja sa zasnovani na takozvanim "zero" i "few-shot" pokušajima. Glavni pristup uvodi inovacije osmišljavanjem istraživačkih upita (promptova) uključuju tekst sa uputstvom za klasifikaciju bez primera i na osnovu nekoliko primera, omogućavajući jezičkom modelu da klasifikuje osećanja u pozitivne, negativne ili objektivne kategorije. Ova metodologija ima za cilj da pojednostavi analizu osećanja ograničavanjem odgovora, čime se povećava preciznost ...Milica Ikonić Nešić, Saša Petalinkar, Mihailo Škorić, Ranka Stanković, Biljana Rujević. "Advancing Sentiment Analysis in Serbian Literature: A Zero and Few-Shot Learning Approach Using the Mistral Model" in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Sofia, Bulgaria, 9-10 September 2024, LREC | COLING (2024)
-
Keyword Extraction from Parallel Abstracts of Scientific Publications
... During the period of 2004–2012, the journal published 55 papers bilingually, in Serbian and in English. These papers are available online as aligned parallel text in the Biblisha1 digital library, as well as separate documents. The Biblisha digital library contains scientific publications from other journals ...
... Scientific Publications 47 2.2 Text Preprocessing Tools Serbian is a highly inflectional Slavic language. Although we use the keyword extraction method designed with light or no linguistic knowledge, some text pre- processing is needed and includes the conversion of the input text to lowercase, the removal ...
... previous research, all the texts were joined and the entire collection was treated as a single text, while for the research presented in this paper, the text process- ing and analysis is performed per each text document in the collection. SBKE method does not include calculation of C-Value, T-Score, LLR ...Slobodan Beliga, Olivera Kitanović, Ranka Stanković, Sanda Martinčić-Ipšić . "Keyword Extraction from Parallel Abstracts of Scientific Publications" in Sematic Keyword-Based Search on Structured Data Sources - Third International KEYSTONE Conference, IKC 2017 Gdańsk, Poland, September 11–12, 2017 Revised Selected Papers and COST Action IC1302 Reports, Springer (2017)
-
Towards translation of educational resources using GIZA++
... Tool is based on collection of aligned sentence pairs in the form of Translation Memory, which facilitates and speeds up the translator's work. Main key functions of a CAT tool that speed up and improve translation are: [11] A CAT tool segments the source text in segments, usually sentences ...
... Preparation For our research we used five text collections, three of them being scientific journals and two resources produced within international projects. Total number of documents is 299 in English and the same number in Serbian, while the total of aligned sentences is 67,206. Haddow et al. [16] ...
... in Parallel Corpus Search Tools. In LREC (pp. 3172-3178), 2014 [19] I. Obradović, “A Method for Extracting Translational Equivalents from Aligned Text”, “Methods and applications of quantitative linguistics”: selected papers of the 8th International Conference on quantitative linguistics (QUALICO) ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... speech tagging and information extraction, question answering, text summarization, collocations and information retrieval, sentiment analysis and semantics, discourse, machine translation, regular expressions, language models, text classification, and name entity recognition. All of them combine ...
... them. Text analyses can be performed at the levels of strings, morphology, and syntax. Some of the functions are: developing and applying electronic dictionaries of simple words and multi-word units; pattern matching with queries in form of regular expressions and graphs; text tra ...
... are introduced as well as operations on them. 11. Organization of graphs in cascades enables complex text transformation. Each graph in a cascade is a transducer that transforms a text. A graph that follows works on this transformed input. A full Named Entity Recognition System for Serbian ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
-
A bilingual digital library for academic and entrepreneurial knowledge management
A generic knowledge management process of organization, storage and retrieval of knowledge can suitably be fitted in a digital library. In the digital and knowledge age digital libraries can be used in knowledge management to handle intellectual assets and support knowledge creation. A multilingual digital library either stores content in more than one language or provides multilingual query access to monolingual content. In Serbia 18 of 308 scientific journals regularly published are bi-lingual, with papers simultaneously being in English ...... search and the browsing of aligned bilingual text collections. Design/methodology/approach – The approach to the development of the presented digital library was to store its content in a NoSQL-database, with a web tool to enable the use of rich information in the stored text collections. The library ...
... The result of the metadata search is presented as a list of documents matching the metadata query with links to full-text articles in a PDF format, as well as to a TMX aligned text in an HTML format. Figure 5 shows the result of the previous query for the selected language (English), but selection ...
... documents are provided with the usual metadata (article's author(s), publication date, title, etc.) and are aligned at the sentence level. Besides searching by metadata, Bibliša offers a full-text search by keywords of the user’s choice. A user’s original query, which can be issued in each of the two ...Ranka Stanković, Cvetana Krstev, Biljana Lazić, Dalibor Vorkapić. "A bilingual digital library for academic and entrepreneurial knowledge management" in Proceeding of 10th International Forum on Knowledge Asset Dynamics — IFKAD 2015: Culture, Innovation and Entrepreneurship: connecting the knowledge dots, Bari, Italy, 10-12 June 2015, Bari : IFKAD (2015)
-
The Dictionary of the Serbian Academy: from the Text to the Lexical Database
In this paper we discuss the project of digitization of the Dictionary of the Serbo-Croatian Standard and Vernacular Language. Scanning and character recognition were a particular challenge, since various non-standard character set encoding was used in the course of the almost 60-year long production of the dictionary. The first aim of the project was to formalize the micro-structure of the dictionary articles in order to parse the digitized text of and transform it into structured data stored in relational lexical database. This approach ...... Database 3.1 Formalization of the structure of dictionary articles The first phase of the conversion of the DSA from the text form (unstructured text) into the lexical base (structured text) consisted of a thorough analysis of formatting conventions that were used for typesetting dictionary entries as well ...
... the Serbian Academy: from the Text to the Lexical Database Ranka Stanković, Rada Stijović, Duško Vitas, Cvetana Krstev, Olga Sabo Дигитални репозиторијум Рударско-геолошког факултета Универзитета у Београду [ДР РГФ] The Dictionary of the Serbian Academy: from the Text to the Lexical Database | Ranka ...
... of the Oxford English Dictionary (Berg et al., 1988), the transformation from an unstructured to a structured text was recognized as the main task of such endeavors. To do this, the text of a dictionary has to be parsed and the structure of the arti- cles has to be formalized. For the representation ...Ranka Stanković, Rada Stijović, Duško Vitas, Cvetana Krstev, Olga Sabo. "The Dictionary of the Serbian Academy: from the Text to the Lexical Database" in Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, Ljubljana : Ljubljana University Press, Faculty of Arts (2018)
-
It-Sr-NER: Web Services for Recognizing and Linking Named Entities in Text and Displaying Them on a Web Map
The paper will present the results of the project `“It-Sr-NER: Web services for named entities recognition, linking and mapping,” in which teams from the University of Turin and the Society for Language Resources and Technologies JeRTeh participated, and whose goal was the development of the It-Sr-NER web service for named entity annotations in the text and displaying them on the map. Named entities in these services are names of persons, places, organizations, demonyms (ethnicities), events and works of art.Olja Perišić, Ranka Stanković, Milica Ikonić Nešić, Mihailo Škorić. "It-Sr-NER: Web Services for Recognizing and Linking Named Entities in Text and Displaying Them on a Web Map" in Infotheca, Belgrade : Faculty of Philology, University of Belgrade (2023). https://doi.org/10.18485/infotheca.2023.23.1.3
-
Frequency and Length of Syllables in Serbian
Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová (2019)Basic analyses of several properties of syllables (the rank-frequency distribution, the distribution of length, and the relation between length and frequency) in Serbian is presented. The syllabification algorithm used combines the maximum onset principle and the sonority hierarchy. Results indicate that syllables behave similarly to words as far as mathematical models are concerned, but values of parameters in models for syllables are quite different from those for words.... work with a complete novel consisting of 110104 words, it is necessarily a text mixture rather than a homogeneous text (Popescu et al., 2009, set an upper limit - admittedly an arbitrary one - of 10000 words for a homogeneous text, see p.3). Lower language units, such as graphemes, phonemes, or syllables ...
... work with a complete novel consisting of 110104 words, it is necessarily a text mixture rather than a homogeneous text (Popescu et al., 2009, set an upper limit - admittedly an arbitrary one - of 10000 words for a homogeneous text, see p.3). Lower language units, such as graphemes, phonemes, or syllables ...
... between spaces. We are aware of problems related to this definition, but it facilitates easy automatic text processing (see e.g. a discussion on this topic in Antić et al., 2006, pp. 118-121). The text under analysis (see Section 3) is pre-processed, so that it does not contain any zero-syllable words ...Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová. "Frequency and Length of Syllables in Serbian" in Glottometrics (2019)
-
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action ``Distant Reading for European Literary History'' (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, ...Milica Ikonić Nešić, Ranka Stanković, Christof Schöch and Mihailo Škorić. "From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)" in Proceedings of The 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data
Овај рад описује студију случаја о генерисању повезаних података креираних на основу обечежених текстуалних корпуса коришћењем формата размене података у обради природних језика (NIF). Као основа за ово истраживање послужио је подскуп корпуса ELTeC, који се састоји од 900 романа из периода 1840-1920 за 9 европских језика. Верзија романа са коментарима, у такозваном TEI level-2 формату, трансформисана је у NIF, формат заснован на RDF/OWL који има за циљ постизање интероперабилности између алата за обраду природних језика, језичких ресурса и ...Ranka Stanković, Christian Chiarcos, Miloš Utvić, Olivera Kitanović. "Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj
-
Indexing of textual databases based on lexical resources: A case study for Serbian
In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia in the last half century. Each document within this database is described by metadata, consisting of several fields such as title, domain, keywords, abstract, geographical location and the like. A bag of words was produced from these ...... However, a large number of other forms cannot be found by scanning the text, for example, the form zlata (genitive singular) cannot be aligned with the query keyword key zlato (nominative singular). The disadvantage of the system based on text scanning which affects the precision is especially visible when ...
... improved ranking uses tf idf measure that is based on frequencies of words allocated to the text, text length, and the document frequency [8]. Index- ing is performed in following steps: 1. Generating a Di text from several records and fields in the database related to a particular document or project; ...
... Query Language) form. The query generated in such a way searches the text of the subset of attributes in the database that correspond to the selected criteria of search. 4 The Improved Solution One of the problems of full text search in Serbian is its rich morphology, where the keyword for search ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Indexing of textual databases based on lexical resources: A case study for Serbian" in Semantic Keyword-based Search on Structured Data Sources : First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers, Springer (2015). https://doi.org/10.1007/978-3-319-27932-9_15