-
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Authors:
Bojana Bašaragin,
Adela Ljajić,
Darija Medvecki,
Lorenzo Cassano,
Miloš Košprdić,
Nikola Milošević
Abstract:
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augme…
▽ More
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses. The system is based on a fine-tuned LLM for the referenced question-answering, where retrieved relevant abstracts from PubMed are passed to LLM's context as input through a prompt. Its output is an answer based on PubMed abstracts, where each statement is referenced accordingly, allowing the users to verify the answer. Our retrieval system achieves an absolute improvement of 23% compared to the PubMed search engine. Based on the manual evaluation on a small sample, our fine-tuned LLM component achieves comparable results to GPT-4 Turbo in referencing relevant abstracts. We make the dataset used to fine-tune the models and the fine-tuned models based on Mistral-7B-instruct-v0.1 and v0.2 publicly available.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Open Problem: Active Representation Learning
Authors:
Nikola Milosevic,
Gesine Müller,
Jan Huisken,
Nico Scherf
Abstract:
In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Map** (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a frame…
▽ More
In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Map** (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a framework that derives exploration skills from representations that are in some sense actionable, aiming to enhance the efficiency and effectiveness of data collection and model building in the natural sciences.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Verif.ai: Towards an Open-Source Scientific Generative Question-Answering System with Referenced and Verifiable Answers
Authors:
Miloš Košprdić,
Adela Ljajić,
Bojana Bašaragin,
Darija Medvecki,
Nikola Milošević
Abstract:
In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and genera…
▽ More
In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative question-answering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. Therefore, we believe that by using our method, we can make scientists more productive, while building trust in the use of generative language models in scientific environments, where hallucinations and misinformation cannot be tolerated.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian
Authors:
Darija Medvecki,
Bojana Bašaragin,
Adela Ljajić,
Nikola Milošević
Abstract:
This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and…
▽ More
This paper presents the results of the first application of BERTopic, a state-of-the-art topic modeling technique, to short text written in a morphologi-cally rich language. We applied BERTopic with three multilingual embed-ding models on two levels of text preprocessing (partial and full) to evalu-ate its performance on partially preprocessed short text in Serbian. We also compared it to LDA and NMF on fully preprocessed text. The experiments were conducted on a dataset of tweets expressing hesitancy toward COVID-19 vaccination. Our results show that with adequate parameter setting, BERTopic can yield informative topics even when applied to partially pre-processed short text. When the same parameters are applied in both prepro-cessing scenarios, the performance drop on partially preprocessed text is minimal. Compared to LDA and NMF, judging by the keywords, BERTopic offers more informative topics and gives novel insights when the number of topics is not limited. The findings of this paper can be significant for re-searchers working with other morphologically rich low-resource languages and short text.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
De-identification of clinical free text using natural language processing: A systematic review of current approaches
Authors:
Aleksandar Kovačević,
Bojana Bašaragin,
Nikola Milošević,
Goran Nenadić
Abstract:
Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in au…
▽ More
Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. Majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.
△ Less
Submitted 28 November, 2023;
originally announced December 2023.
-
From Zero to Hero: Harnessing Transformers for Biomedical Named Entity Recognition in Zero- and Few-shot Contexts
Authors:
Miloš Košprdić,
Nikola Prodanović,
Adela Ljajić,
Bojana Bašaragin,
Nikola Milošević
Abstract:
Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomed…
▽ More
Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. To address these challenges, this paper proposes a method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large amount of datasets and biomedical entities, which allow the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.
△ Less
Submitted 25 January, 2024; v1 submitted 5 May, 2023;
originally announced May 2023.
-
A Survey of Resources and Methods for Natural Language Processing of Serbian Language
Authors:
Ulfeta A. Marovac,
Aldina R. Avdić,
Nikola Lj. Milošević
Abstract:
The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbia…
▽ More
The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbian challenging. Nevertheless, over the past three decades, there have been a number of initiatives to develop resources and methods for natural language processing of Serbian, ranging from develo** a corpus of free text from books and the internet, annotated corpora for classification and named entity recognition tasks to various methods and models performing these tasks. In this paper, we review the initiatives, resources, methods, and their availability.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Dynamic Split Computing for Efficient Deep Edge Intelligence
Authors:
Arian Bakhtiarnia,
Nemanja Milošević,
Qi Zhang,
Dragana Bajović,
Alexandros Iosifidis
Abstract:
Deploying deep neural networks (DNNs) on IoT and mobile devices is a challenging task due to their limited computational resources. Thus, demanding tasks are often entirely offloaded to edge servers which can accelerate inference, however, it also causes communication cost and evokes privacy concerns. In addition, this approach leaves the computational capacity of end devices unused. Split computi…
▽ More
Deploying deep neural networks (DNNs) on IoT and mobile devices is a challenging task due to their limited computational resources. Thus, demanding tasks are often entirely offloaded to edge servers which can accelerate inference, however, it also causes communication cost and evokes privacy concerns. In addition, this approach leaves the computational capacity of end devices unused. Split computing is a paradigm where a DNN is split into two sections; the first section is executed on the end device, and the output is transmitted to the edge server where the final section is executed. Here, we introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. By using natural bottlenecks that already exist in modern DNN architectures, dynamic split computing avoids retraining and hyperparameter optimization, and does not have any negative impact on the final accuracy of DNNs. Through extensive experiments, we show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
△ Less
Submitted 17 June, 2022; v1 submitted 23 May, 2022;
originally announced May 2022.
-
Nonlinear gradient map**s and stochastic optimization: A general framework with applications to heavy-tail noise
Authors:
Dusan Jakovetic,
Dragana Bajovic,
Anit Kumar Sahu,
Soummya Kar,
Nemanja Milosevic,
Dusan Stamenkovic
Abstract:
We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assum…
▽ More
We introduce a general framework for nonlinear stochastic gradient descent (SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed framework subsumes several popular nonlinearity choices, like clipped, normalized, signed or quantized gradient, but we also consider novel nonlinearity choices. We establish for the considered class of methods strong convergence guarantees assuming a strongly convex cost function with Lipschitz continuous gradients under very general assumptions on the gradient noise. Most notably, we show that, for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost function's optimality gap, converges to zero at rate~$O(1/t^ζ)$, $ζ\in (0,1)$. In contrast, for the same noise setting, the linear SGD generates a sequence with unbounded variances. Furthermore, for the nonlinearities that can be decoupled component wise, like, e.g., sign gradient or component-wise clip**, we show that the nonlinear SGD asymptotically (locally) achieves a $O(1/t)$ rate in the weak convergence sense and explicitly quantify the corresponding asymptotic variance. Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets with heavy tail noises.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Comparison of biomedical relationship extraction methods and models for knowledge graph creation
Authors:
Nikola Milosevic,
Wolfgang Thielemann
Abstract:
Biomedical research is growing at such an exponential pace that scientists, researchers, and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a way that claims and hypotheses can be easily found, accessed, and validated. Knowledge graphs can provide such a framework for semantic…
▽ More
Biomedical research is growing at such an exponential pace that scientists, researchers, and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a way that claims and hypotheses can be easily found, accessed, and validated. Knowledge graphs can provide such a framework for semantic knowledge representation from literature. However, in order to build a knowledge graph, it is necessary to extract knowledge as relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and DistilBERT, PubMedBERT, T5 and SciFive-based models as examples of modern deep learning transformers) methods for scalable relationship extraction from biomedical literature, and for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets. Our experiments show that transformer-based models handle well both small (due to pre-training on a large dataset) and unbalanced datasets. The best performing model was the PubMedBERT-based model fine-tuned on balanced data, with a reported F1-score of 0.92. DistilBERT-based model followed with F1-score of 0.89, performing faster and with lower resource requirements. BERT-based models performed better then T5-based generative models.
△ Less
Submitted 7 August, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
MASK: A flexible framework to facilitate de-identification of clinical texts
Authors:
Nikola Milosevic,
Gangamma Kalappa,
Hesam Dadafarin,
Mahmoud Azimaee,
Goran Nenadic
Abstract:
Medical health records and clinical summaries contain a vast amount of important information in textual form that can help advancing research on treatments, drugs and public health. However, the majority of these information is not shared because they contain private information about patients, their families, or medical staff treating them. Regulations such as HIPPA in the US, PHIPPA in Canada an…
▽ More
Medical health records and clinical summaries contain a vast amount of important information in textual form that can help advancing research on treatments, drugs and public health. However, the majority of these information is not shared because they contain private information about patients, their families, or medical staff treating them. Regulations such as HIPPA in the US, PHIPPA in Canada and GDPR regulate the protection, processing and distribution of this information. In case this information is de-identified and personal information are replaced or redacted, they could be distributed to the research community. In this paper, we present MASK, a software package that is designed to perform the de-identification task. The software is able to perform named entity recognition using some of the state-of-the-art techniques and then mask or redact recognized entities. The user is able to select named entity recognition algorithm (currently implemented are two versions of CRF-based techniques and BiLSTM-based neural network with pre-trained GLoVe and ELMo embedding) and masking algorithm (e.g. shift dates, replace names/locations, totally redact entity).
△ Less
Submitted 9 October, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
Deep learning guided Android malware and anomaly detection
Authors:
Nikola Milosevic,
Junfan Huang
Abstract:
In the past decade, the cyber-crime related to mobile devices has increased. Mobile devices, especially the ones running on Android operating system are particularly interesting to malware creators, as the users often keep the biggest amount of personal information on their mobile devices, such as their contacts, social media profiles, emails, and bank accounts. Both dynamic and static malware ana…
▽ More
In the past decade, the cyber-crime related to mobile devices has increased. Mobile devices, especially the ones running on Android operating system are particularly interesting to malware creators, as the users often keep the biggest amount of personal information on their mobile devices, such as their contacts, social media profiles, emails, and bank accounts. Both dynamic and static malware analysis is necessary to prevent and detect malware, as both techniques have their benefits and shortcomings. In this paper, we propose a deep learning technique that relies on LSTM and encoder-decoder neural network architectures for dynamic malware analysis based on CPU, memory and battery usage. The proposed system is able to detect and notify users about anomalies in system that is likely consequence of malware behaviour. The method was implemented as a part of OWASP Seraphimdroids anti-malware mechanism and notifies users about anomalies on their devices. The method proved to perform with an F1-score of 79.2%.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
GNTeam at 2018 n2c2: Feature-augmented BiLSTM-CRF for drug-related entity recognition in hospital discharge summaries
Authors:
Maksim Belousov,
Nikola Milosevic,
Ghada Alfattni,
Haifa Alrdahi,
Goran Nenadic
Abstract:
Monitoring the administration of drugs and adverse drug reactions are key parts of pharmacovigilance. In this paper, we explore the extraction of drug mentions and drug-related information (reason for taking a drug, route, frequency, dosage, strength, form, duration, and adverse events) from hospital discharge summaries through deep learning that relies on various representations for clinical name…
▽ More
Monitoring the administration of drugs and adverse drug reactions are key parts of pharmacovigilance. In this paper, we explore the extraction of drug mentions and drug-related information (reason for taking a drug, route, frequency, dosage, strength, form, duration, and adverse events) from hospital discharge summaries through deep learning that relies on various representations for clinical named entity recognition. This work was officially part of the 2018 n2c2 shared task, and we use the data supplied as part of the task. We developed two deep learning architecture based on recurrent neural networks and pre-trained language models. We also explore the effect of augmenting word representations with semantic features for clinical named entity recognition. Our feature-augmented BiLSTM-CRF model performed with F1-score of 92.67% and ranked 4th for entity extraction sub-task among submitted systems to n2c2 challenge. The recurrent neural networks that use the pre-trained domain-specific word embeddings and a CRF layer for label optimization perform drug, adverse event and related entities extraction with micro-averaged F1-score of over 91%. The augmentation of word vectors with semantic features extracted using available clinical NLP toolkits can further improve the performance. Word embeddings that are pre-trained on a large unannotated corpus of relevant documents and further fine-tuned to the task perform rather well. However, the augmentation of word embeddings with semantic features can help improve the performance (primarily by boosting precision) of drug-related named entity recognition from electronic health records.
△ Less
Submitted 23 September, 2019;
originally announced September 2019.
-
Extracting adverse drug reactions and their context using sequence labelling ensembles in TAC2017
Authors:
Maksim Belousov,
Nikola Milosevic,
William Dixon,
Goran Nenadic
Abstract:
Adverse drug reactions (ADRs) are unwanted or harmful effects experienced after the administration of a certain drug or a combination of drugs, presenting a challenge for drug development and drug administration. In this paper, we present a set of taggers for extracting adverse drug reactions and related entities, including factors, severity, negations, drug class and animal. The systems used a mi…
▽ More
Adverse drug reactions (ADRs) are unwanted or harmful effects experienced after the administration of a certain drug or a combination of drugs, presenting a challenge for drug development and drug administration. In this paper, we present a set of taggers for extracting adverse drug reactions and related entities, including factors, severity, negations, drug class and animal. The systems used a mix of rule-based, machine learning (CRF) and deep learning (BLSTM with word2vec embeddings) methodologies in order to annotate the data. The systems were submitted to adverse drug reaction shared task, organised during Text Analytics Conference in 2017 by National Institute for Standards and Technology, archiving F1-scores of 76.00 and 75.61 respectively.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.
-
From web crawled text to project descriptions: automatic summarizing of social innovation projects
Authors:
Nikola Milosevic,
Dimitar Marinov,
Abdullah Gok,
Goran Nenadic
Abstract:
In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and su…
▽ More
In the past decade, social innovation projects have gained the attention of policy makers, as they address important social issues in an innovative manner. A database of social innovation is an important source of information that can expand collaboration between social innovators, drive policy and serve as an important resource for research. Such a database needs to have projects described and summarized. In this paper, we propose and compare several methods (e.g. SVM-based, recurrent neural network based, ensambled) for describing projects based on the text that is available on project websites. We also address and propose a new metric for automated evaluation of summaries based on topic modelling.
△ Less
Submitted 22 May, 2019;
originally announced May 2019.
-
A framework for information extraction from tables in biomedical literature
Authors:
Nikola Milosevic,
Cassie Gregson,
Robert Hernandez,
Goran Nenadic
Abstract:
The scientific literature is growing exponentially, and professionals are no more able to cope with the current amount of publications. Text mining provided in the past methods to retrieve and extract information from text; however, most of these approaches ignored tables and figures. The research done in mining table data still does not have an integrated approach for mining that would consider a…
▽ More
The scientific literature is growing exponentially, and professionals are no more able to cope with the current amount of publications. Text mining provided in the past methods to retrieve and extract information from text; however, most of these approaches ignored tables and figures. The research done in mining table data still does not have an integrated approach for mining that would consider all complexities and challenges of a table. Our research is examining the methods for extracting numerical (number of patients, age, gender distribution) and textual (adverse reactions) information from tables in the clinical literature. We present a requirement analysis template and an integral methodology for information extraction from tables in clinical domain that contains 7 steps: (1) table detection, (2) functional processing, (3) structural processing, (4) semantic tagging, (5) pragmatic processing, (6) cell selection and (7) syntactic processing and extraction. Our approach performed with the F-measure ranged between 82 and 92%, depending on the variable, task and its complexity.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Creating a contemporary corpus of similes in Serbian by using natural language processing
Authors:
Nikola Milosevic,
Goran Nenadic
Abstract:
Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning t…
▽ More
Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning techniques. We expanded an existing corpus by collecting 442 similes from the internet and adding them to the existing corpus collected by Vuk Stefanovic Karadzic that contained 333 similes. We, also, introduce crowdsourcing to the collection of figures of speech, which helped us to build corpus containing 787 unique similes.
△ Less
Submitted 22 November, 2018;
originally announced November 2018.
-
As Cool as a Cucumber: Towards a Corpus of Contemporary Similes in Serbian
Authors:
Nikola Milosevic,
Goran Nenadic
Abstract:
Similes are natural language expressions used to compare unlikely things, where the comparison is not taken literally. They are often used in everyday communication and are an important part of cultural heritage. Having an up-to-date corpus of similes is challenging, as they are constantly coined and/or adapted to the contemporary times. In this paper we present a methodology for semi-automated co…
▽ More
Similes are natural language expressions used to compare unlikely things, where the comparison is not taken literally. They are often used in everyday communication and are an important part of cultural heritage. Having an up-to-date corpus of similes is challenging, as they are constantly coined and/or adapted to the contemporary times. In this paper we present a methodology for semi-automated collection of similes from the world wide web using text mining techniques. We expanded an existing corpus of traditional similes (containing 333 similes) by collecting 446 additional expressions. We, also, explore how crowdsourcing can be used to extract and curate new similes.
△ Less
Submitted 20 May, 2016;
originally announced May 2016.
-
Equity forecast: Predicting long term stock price movement using machine learning
Authors:
Nikola Milosevic
Abstract:
Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provi…
▽ More
Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provided from the machines. In this paper we present a machine learning aided approach to evaluate the equity's future price over the long time. Our method is able to correctly predict whether some company's value will be 10% higher or not over the period of one year in 76.5% of cases.
△ Less
Submitted 22 November, 2018; v1 submitted 2 March, 2016;
originally announced March 2016.
-
Marvin: Semantic annotation using multiple knowledge sources
Authors:
Nikola Milosevic
Abstract:
People are producing more written material then anytime in the history. The increase is so high that professionals from the various fields are no more able to cope with this amount of publications. Text mining tools can offer tools to help them and one of the tools that can aid information retrieval and information extraction is semantic text annotation. In this report we present Marvin, a text an…
▽ More
People are producing more written material then anytime in the history. The increase is so high that professionals from the various fields are no more able to cope with this amount of publications. Text mining tools can offer tools to help them and one of the tools that can aid information retrieval and information extraction is semantic text annotation. In this report we present Marvin, a text annotator written in Java, which can be used as a command line tool and as a Java library. Marvin is able to annotate text using multiple sources, including WordNet, MetaMap, DBPedia and thesauri represented as SKOS.
△ Less
Submitted 2 February, 2016; v1 submitted 1 February, 2016;
originally announced February 2016.
-
History of malware
Authors:
Nikola Milošević
Abstract:
In past three decades almost everything has changed in the field of malware and malware analysis. From malware created as proof of some security concept and malware created for financial gain to malware created to sabotage infrastructure. In this work we will focus on history and evolution of malware and describe most important malwares.
In past three decades almost everything has changed in the field of malware and malware analysis. From malware created as proof of some security concept and malware created for financial gain to malware created to sabotage infrastructure. In this work we will focus on history and evolution of malware and describe most important malwares.
△ Less
Submitted 16 January, 2014; v1 submitted 21 February, 2013;
originally announced February 2013.
-
Stemmer for Serbian language
Authors:
Nikola Milošević
Abstract:
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form; generally a written word form. In this work is presented suffix strip** stemmer for Serbian language, one of the highly inflectional languages.
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form; generally a written word form. In this work is presented suffix strip** stemmer for Serbian language, one of the highly inflectional languages.
△ Less
Submitted 20 September, 2012;
originally announced September 2012.