Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution
Објеката
- Тип
- Рад у часопису
- Верзија рада
- објављена верзија
- Језик
- енглески
- Креатор
- Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, Maciej Eder
- Извор
- Mathematics
- Издавач
- MDPI AG
- Датум издавања
- 2022
- том
- 10
- издање
- 5
- doi
- 10.3390/math10050838
- issn
- 2227-7390
- Subject
- General Mathematics, Engineering (miscellaneous), Computer Science (miscellaneous)
- Шира категорија рада
- M20
- Ужа категорија рада
- М21а
- Права
- Отворени приступ
- Лиценца
- All rights reserved
- Формат
- Сажетак
- This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l2 norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F1-score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods.
- Медија
- mathematics-10-00838.pdf
Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk, Maciej Eder. "Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution" in Mathematics, MDPI AG (2022). https://doi.org/10.3390/math10050838
This item was submitted on 7. март 2022. by [anonymous user] using the form “Рад у часопису” on the site “Радови”: https://dr.rgf.bg.ac.rs/s/repo
Click here to view the collected data.