Search | arXiv e-print repository

How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions

Authors: Bojana Bašaragin, Adela Ljajić, Darija Medvecki, Lorenzo Cassano, Miloš Košprdić, Nikola Milošević

Abstract: Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augme… ▽ More Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses. The system is based on a fine-tuned LLM for the referenced question-answering, where retrieved relevant abstracts from PubMed are passed to LLM's context as input through a prompt. Its output is an answer based on PubMed abstracts, where each statement is referenced accordingly, allowing the users to verify the answer. Our retrieval system achieves an absolute improvement of 23% compared to the PubMed search engine. Based on the manual evaluation on a small sample, our fine-tuned LLM component achieves comparable results to GPT-4 Turbo in referencing relevant abstracts. We make the dataset used to fine-tune the models and the fine-tuned models based on Mistral-7B-instruct-v0.1 and v0.2 publicly available. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: Accepted at BioNLP Workshop 2024, colocated with ACL 2024

arXiv:2406.03845 [pdf, other]

Open Problem: Active Representation Learning

Authors: Nikola Milosevic, Gesine Müller, Jan Huisken, Nico Scherf

Abstract: In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Map** (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a frame… ▽ More In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Map** (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a framework that derives exploration skills from representations that are in some sense actionable, aiming to enhance the efficiency and effectiveness of data collection and model building in the natural sciences. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2402.18589 [pdf, other]

Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers

Authors: Miloš Košprdić, Adela Ljajić, Bojana Bašaragin, Darija Medvecki, Nikola Milošević

Abstract: In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and genera… ▽ More In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: Accepted as a short paper at The Sixteenth International Conference on Evolving Internet (INTERNET 2024)

Journal ref: The Sixteenth International Conference on Evolving Internet (INTERNET 2024)

arXiv:2402.03067 [pdf]

doi 10.1007/978-3-031-50755-7_16

Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

Authors: Darija Medvecki, Bojana Bašaragin, Adela Ljajić, Nikola Milošević

Abstract: This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and… ▽ More This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Journal ref: Trajanovic, M., Filipovic, N., Zdravkovic, M. (eds) Disruptive Information Technologies for a Smart Society. ICIST 2023. Lecture Notes in Networks and Systems, vol 872. Springer, Cham

arXiv:2312.03736 [pdf]

doi 10.1016/j.artmed.2024.102845

De-identification of clinical free text using natural language processing: A systematic review of current approaches

Authors: Aleksandar Kovačević, Bojana Bašaragin, Nikola Milošević, Goran Nenadić

Abstract: Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in au… ▽ More Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. Majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. △ Less

Submitted 28 November, 2023; originally announced December 2023.

Comments: Submitted to Artificial Intelligence in Medicine

Journal ref: Artificial Intelligence in Medicine, Volume 151, May 2024

arXiv:2305.04928 [pdf, other]

From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts

Authors: Miloš Košprdić, Nikola Prodanović, Adela Ljajić, Bojana Bašaragin, Nikola Milošević

Abstract: Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomed… ▽ More Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available. △ Less

Submitted 25 January, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: Collaboration between Bayer Pharma R&D and Serbian Institute for Artificial Intelligence Research and Development

arXiv:2304.05468 [pdf, ps, other]

A Survey of Resources and Methods for Natural Language Processing of Serbian Language

Authors: Ulfeta A. Marovac, Aldina R. Avdić, Nikola Lj. Milošević

Abstract: The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbia… ▽ More The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbian challenging. Nevertheless, over the past three decades, there have been a number of initiatives to develop resources and methods for natural language processing of Serbian, ranging from develo** a corpus of free text from books and the internet, annotated corpora for classification and named entity recognition tasks to various methods and models performing these tasks. In this paper, we review the initiatives, resources, methods, and their availability. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: 43 pages, submitted to Artificial Intelligence Review Journal

ACM Class: A.1

arXiv:2205.11269 [pdf, other]

Dynamic Split Computing for Efficient Deep Edge Intelligence

Authors: Arian Bakhtiarnia, Nemanja Milošević, Qi Zhang, Dragana Bajović, Alexandros Iosifidis

Abstract: Deploying deep neural networks (DNNs) on IoT and mobile devices is a challenging task due to their limited computational resources. Thus, demanding tasks are often entirely offloaded to edge servers which can accelerate inference, however, it also causes communication cost and evokes privacy concerns. In addition, this approach leaves the computational capacity of end devices unused. Split computi… ▽ More Deploying deep neural networks (DNNs) on IoT and mobile devices is a challenging task due to their limited computational resources. Thus, demanding tasks are often entirely offloaded to edge servers which can accelerate inference, however, it also causes communication cost and evokes privacy concerns. In addition, this approach leaves the computational capacity of end devices unused. Split computing is a paradigm where a DNN is split into two sections; the first section is executed on the end device, and the output is transmitted to the edge server where the final section is executed. Here, we introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. By using natural bottlenecks that already exist in modern DNN architectures, dynamic split computing avoids retraining and hyperparameter optimization, and does not have any negative impact on the final accuracy of DNNs. Through extensive experiments, we show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time. △ Less

Submitted 17 June, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted by the 2022 International Conference on Machine Learning (ICML 2022) DyNN Workshop

arXiv:2204.02593 [pdf, other]

Nonlinear gradient map**s and stochastic optimization: A general framework with applications to heavy-tail noise

Authors: Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, Soummya Kar, Nemanja Milosevic, Dusan Stamenkovic

Abstract: We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assum… ▽ More We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assuming a strongly convex cost function with Lipschitz continuous gradients under very general assumptions on the gradient noise. Most notably, we show that, for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost function's optimality gap, converges to zero at rate~$O(1/t^ζ)$, $ζ\in (0,1)$. In contrast, for the same noise setting, the linear SGD generates a sequence with unbounded variances. Furthermore, for the nonlinearities that can be decoupled component wise, like, e.g., sign gradient or component-wise clip**, we show that the nonlinear SGD asymptotically (locally) achieves a $O(1/t)$ rate in the weak convergence sense and explicitly quantify the corresponding asymptotic variance. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets with heavy tail noises. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: Submitted for publication Nov 2021

arXiv:2201.01647 [pdf, other]

doi 10.1016/j.websem.2022.100756

Comparison of biomedical relationship extraction methods and models for knowledge graph creation

Authors: Nikola Milosevic, Wolfgang Thielemann

Abstract: Biomedical research is growing at such an exponential pace that scientists, researchers, and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a way that claims and hypotheses can be easily found, accessed, and validated. Knowledge graphs can provide such a framework for semantic… ▽ More Biomedical research is growing at such an exponential pace that scientists, researchers, and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a way that claims and hypotheses can be easily found, accessed, and validated. Knowledge graphs can provide such a framework for semantic knowledge representation from literature. However, in order to build a knowledge graph, it is necessary to extract knowledge as relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and DistilBERT, PubMedBERT, T5 and SciFive-based models as examples of modern deep learning transformers) methods for scalable relationship extraction from biomedical literature, and for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets. Our experiments show that transformer-based models handle well both small (due to pre-training on a large dataset) and unbalanced datasets. The best performing model was the PubMedBERT-based model fine-tuned on balanced data, with a reported F1-score of 0.92. DistilBERT-based model followed with F1-score of 0.89, performing faster and with lower resource requirements. BERT-based models performed better then T5-based generative models. △ Less

Submitted 7 August, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

Comments: Paper submitted to Journal of Semantic Web

ACM Class: E.2; I.7

Journal ref: Nikola Milosevic, Wolfgang Thielemann, Comparison of biomedical relationship extraction methods and models for knowledge graph creation, Journal of Web Semantics, 2022, 100756, ISSN 1570-8268,

arXiv:2005.11687 [pdf, other]

MASK: A flexible framework to facilitate de-identification of clinical texts

Authors: Nikola Milosevic, Gangamma Kalappa, Hesam Dadafarin, Mahmoud Azimaee, Goran Nenadic

Abstract: Medical health records and clinical summaries contain a vast amount of important information in textual form that can help advancing research on treatments, drugs and public health. However, the majority of these information is not shared because they contain private information about patients, their families, or medical staff treating them. Regulations such as HIPPA in the US, PHIPPA in Canada an… ▽ More Medical health records and clinical summaries contain a vast amount of important information in textual form that can help advancing research on treatments, drugs and public health. However, the majority of these information is not shared because they contain private information about patients, their families, or medical staff treating them. Regulations such as HIPPA in the US, PHIPPA in Canada and GDPR regulate the protection, processing and distribution of this information. In case this information is de-identified and personal information are replaced or redacted, they could be distributed to the research community. In this paper, we present MASK, a software package that is designed to perform the de-identification task. The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities. The user is able to select named entity recognition algorithm (currently implemented are two versions of CRF-based techniques and BiLSTM-based neural network with pre-trained GLoVe and ELMo embedding) and masking algorithm (e.g. shift dates, replace names/locations, totally redact entity). △ Less

Submitted 9 October, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

arXiv:1910.10660 [pdf]

Deep learning guided Android malware and anomaly detection

Authors: Nikola Milosevic, Junfan Huang

Abstract: In the past decade, the cyber-crime related to mobile devices has increased. Mobile devices, especially the ones running on Android operating system are particularly interesting to malware creators, as the users often keep the biggest amount of personal information on their mobile devices, such as their contacts, social media profiles, emails, and bank accounts. Both dynamic and static malware ana… ▽ More In the past decade, the cyber-crime related to mobile devices has increased. Mobile devices, especially the ones running on Android operating system are particularly interesting to malware creators, as the users often keep the biggest amount of personal information on their mobile devices, such as their contacts, social media profiles, emails, and bank accounts. Both dynamic and static malware analysis is necessary to prevent and detect malware, as both techniques have their benefits and shortcomings. In this paper, we propose a deep learning technique that relies on LSTM and encoder-decoder neural network architectures for dynamic malware analysis based on CPU, memory and battery usage. The proposed system is able to detect and notify users about anomalies in system that is likely consequence of malware behaviour. The method was implemented as a part of OWASP Seraphimdroids anti-malware mechanism and notifies users about anomalies on their devices. The method proved to perform with an F1-score of 79.2%. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: First (draft) version of the paper

arXiv:1909.10390 [pdf]

GNTeam at 2018 n2c2: Feature-augmented BiLSTM-CRF for drug-related entity recognition in hospital discharge summaries

Authors: Maksim Belousov, Nikola Milosevic, Ghada Alfattni, Haifa Alrdahi, Goran Nenadic

Abstract: Monitoring the administration of drugs and adverse drug reactions are key parts of pharmacovigilance. In this paper, we explore the extraction of drug mentions and drug-related information (reason for taking a drug, route, frequency, dosage, strength, form, duration, and adverse events) from hospital discharge summaries through deep learning that relies on various representations for clinical name… ▽ More Monitoring the administration of drugs and adverse drug reactions are key parts of pharmacovigilance. In this paper, we explore the extraction of drug mentions and drug-related information (reason for taking a drug, route, frequency, dosage, strength, form, duration, and adverse events) from hospital discharge summaries through deep learning that relies on various representations for clinical named entity recognition. This work was officially part of the 2018 n2c2 shared task, and we use the data supplied as part of the task. We developed two deep learning architecture based on recurrent neural networks and pre-trained language models. We also explore the effect of augmenting word representations with semantic features for clinical named entity recognition. Our feature-augmented BiLSTM-CRF model performed with F1-score of 92.67% and ranked 4th for entity extraction sub-task among submitted systems to n2c2 challenge. The recurrent neural networks that use the pre-trained domain-specific word embeddings and a CRF layer for label optimization perform drug, adverse event and related entities extraction with micro-averaged F1-score of over 91%. The augmentation of word vectors with semantic features extracted using available clinical NLP toolkits can further improve the performance. Word embeddings that are pre-trained on a large unannotated corpus of relevant documents and further fine-tuned to the task perform rather well. However, the augmentation of word embeddings with semantic features can help improve the performance (primarily by boosting precision) of drug-related named entity recognition from electronic health records. △ Less

Submitted 23 September, 2019; originally announced September 2019.

arXiv:1905.11716 [pdf, other]

Extracting adverse drug reactions and their context using sequence labelling ensembles in TAC2017

Authors: Maksim Belousov, Nikola Milosevic, William Dixon, Goran Nenadic

Abstract: Adverse drug reactions (ADRs) are unwanted or harmful effects experienced after the administration of a certain drug or a combination of drugs, presenting a challenge for drug development and drug administration. In this paper, we present a set of taggers for extracting adverse drug reactions and related entities, including factors, severity, negations, drug class and animal. The systems used a mi… ▽ More Adverse drug reactions (ADRs) are unwanted or harmful effects experienced after the administration of a certain drug or a combination of drugs, presenting a challenge for drug development and drug administration. In this paper, we present a set of taggers for extracting adverse drug reactions and related entities, including factors, severity, negations, drug class and animal. The systems used a mix of rule-based, machine learning (CRF) and deep learning (BLSTM with word2vec embeddings) methodologies in order to annotate the data. The systems were submitted to adverse drug reaction shared task, organised during Text Analytics Conference in 2017 by National Institute for Standards and Technology, archiving F1-scores of 76.00 and 75.61 respectively. △ Less

Submitted 28 May, 2019; originally announced May 2019.

Comments: Paper describing submission for TAC ADR shared task

Journal ref: Text Analytics Conference 2017

arXiv:1905.09086 [pdf, other]

From web crawled text to project descriptions: automatic summarizing of social innovation projects

Authors: Nikola Milosevic, Dimitar Marinov, Abdullah Gok, Goran Nenadic

Abstract: In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and su… ▽ More In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and summarized. In this paper, we propose and compare several methods (e.g. SVM-based, recurrent neural network based, ensambled) for describing projects based on the text that is available on project websites. We also address and propose a new metric for automated evaluation of summaries based on topic modelling. △ Less

Submitted 22 May, 2019; originally announced May 2019.

Comments: Keywords: Summarization, evaluation metrics, text mining, natural language processing, social innovation, SVM, neural networks Accepted for publication in Proceedings of 24th International Conference on Applications of Natural Language to Information Systems (NLDB2019)

Journal ref: Preceeding of 24th International Conference on Applications of Natural Language to Information Systems (NLDB2019)

arXiv:1902.10031 [pdf]

doi 10.1007/s10032-019-00317-0

A framework for information extraction from tables in biomedical literature

Authors: Nikola Milosevic, Cassie Gregson, Robert Hernandez, Goran Nenadic

Abstract: The scientific literature is growing exponentially, and professionals are no more able to cope with the current amount of publications. Text mining provided in the past methods to retrieve and extract information from text; however, most of these approaches ignored tables and figures. The research done in mining table data still does not have an integrated approach for mining that would consider a… ▽ More The scientific literature is growing exponentially, and professionals are no more able to cope with the current amount of publications. Text mining provided in the past methods to retrieve and extract information from text; however, most of these approaches ignored tables and figures. The research done in mining table data still does not have an integrated approach for mining that would consider all complexities and challenges of a table. Our research is examining the methods for extracting numerical (number of patients, age, gender distribution) and textual (adverse reactions) information from tables in the clinical literature. We present a requirement analysis template and an integral methodology for information extraction from tables in clinical domain that contains 7 steps: (1) table detection, (2) functional processing, (3) structural processing, (4) semantic tagging, (5) pragmatic processing, (6) cell selection and (7) syntactic processing and extraction. Our approach performed with the F-measure ranged between 82 and 92%, depending on the variable, task and its complexity. △ Less

Submitted 26 February, 2019; originally announced February 2019.

Comments: 24 pages

Journal ref: 2019, International Journal on Document Analysis and Recognition (IJDAR)

arXiv:1811.10422 [pdf]

Creating a contemporary corpus of similes in Serbian by using natural language processing

Authors: Nikola Milosevic, Goran Nenadic

Abstract: Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning t… ▽ More Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning techniques. We expanded an existing corpus by collecting 442 similes from the internet and adding them to the existing corpus collected by Vuk Stefanovic Karadzic that contained 333 similes. We, also, introduce crowdsourcing to the collection of figures of speech, which helped us to build corpus containing 787 unique similes. △ Less

Submitted 22 November, 2018; originally announced November 2018.

Comments: 15 pages, submitted to journal Slovo, however, later withdrawn to correct. Additional work was not done on it, so it is still waiting to be extended. Output of the system can be seen here: http://ezbirka.starisloveni.com/. arXiv admin note: text overlap with arXiv:1605.06319

arXiv:1605.06319 [pdf, other]

As Cool as a Cucumber: Towards a Corpus of Contemporary Similes in Serbian

Authors: Nikola Milosevic, Goran Nenadic

Abstract: Similes are natural language expressions used to compare unlikely things, where the comparison is not taken literally. They are often used in everyday communication and are an important part of cultural heritage. Having an up-to-date corpus of similes is challenging, as they are constantly coined and/or adapted to the contemporary times. In this paper we present a methodology for semi-automated co… ▽ More Similes are natural language expressions used to compare unlikely things, where the comparison is not taken literally. They are often used in everyday communication and are an important part of cultural heritage. Having an up-to-date corpus of similes is challenging, as they are constantly coined and/or adapted to the contemporary times. In this paper we present a methodology for semi-automated collection of similes from the world wide web using text mining techniques. We expanded an existing corpus of traditional similes (containing 333 similes) by collecting 446 additional expressions. We, also, explore how crowdsourcing can be used to extract and curate new similes. △ Less

Submitted 20 May, 2016; originally announced May 2016.

Comments: Phrase modelling, simile extraction, language resource building, crowdsourcing

arXiv:1603.00751 [pdf]

doi 10.1453/jel.v3i2.750

Equity forecast: Predicting long term stock price movement using machine learning

Authors: Nikola Milosevic

Abstract: Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provi… ▽ More Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provided from the machines. In this paper we present a machine learning aided approach to evaluate the equity's future price over the long time. Our method is able to correctly predict whether some company's value will be 10% higher or not over the period of one year in 76.5% of cases. △ Less

Submitted 22 November, 2018; v1 submitted 2 March, 2016; originally announced March 2016.

Comments: 9 pages, 3 tables, computational finance, algorithmic finance

Journal ref: Journal of Economics Library, 3(2), 2016, 288-294

arXiv:1602.00515 [pdf]

Marvin: Semantic annotation using multiple knowledge sources

Authors: Nikola Milosevic

Abstract: People are producing more written material then anytime in the history. The increase is so high that professionals from the various fields are no more able to cope with this amount of publications. Text mining tools can offer tools to help them and one of the tools that can aid information retrieval and information extraction is semantic text annotation. In this report we present Marvin, a text an… ▽ More People are producing more written material then anytime in the history. The increase is so high that professionals from the various fields are no more able to cope with this amount of publications. Text mining tools can offer tools to help them and one of the tools that can aid information retrieval and information extraction is semantic text annotation. In this report we present Marvin, a text annotator written in Java, which can be used as a command line tool and as a Java library. Marvin is able to annotate text using multiple sources, including WordNet, MetaMap, DBPedia and thesauri represented as SKOS. △ Less

Submitted 2 February, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

Comments: 9 pages, 4 figures, keywords: Semantic annotation, text normalization, semantic web, linked data, information management, text mining, information extraction, data curation

ACM Class: D.3.2; K.2; H.2.4

arXiv:1302.5392 [pdf]

History of malware

Authors: Nikola Milošević

Abstract: In past three decades almost everything has changed in the field of malware and malware analysis. From malware created as proof of some security concept and malware created for financial gain to malware created to sabotage infrastructure. In this work we will focus on history and evolution of malware and describe most important malwares. In past three decades almost everything has changed in the field of malware and malware analysis. From malware created as proof of some security concept and malware created for financial gain to malware created to sabotage infrastructure. In this work we will focus on history and evolution of malware and describe most important malwares. △ Less

Submitted 16 January, 2014; v1 submitted 21 February, 2013; originally announced February 2013.

Comments: 11 pages, 8 figures describing history and evolution of PC malware from first PC malware to Stuxnet, DoQu and Flame. This article has been withdrawed due some errors in text and publication in the jurnal that asked to withdraw article from other sources

MSC Class: 68-03

arXiv:1209.4471 [pdf]

Stemmer for Serbian language

Authors: Nikola Milošević

Abstract: In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form; generally a written word form. In this work is presented suffix strip** stemmer for Serbian language, one of the highly inflectional languages. In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form; generally a written word form. In this work is presented suffix strip** stemmer for Serbian language, one of the highly inflectional languages. △ Less

Submitted 20 September, 2012; originally announced September 2012.

Comments: 16 pages, 8 figures, code included

Showing 1–22 of 22 results for author: Milosevic, N