-
ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification
Authors:
Taja Kuzman,
Igor Mozetič,
Nikola Ljubešić
Abstract:
ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annota…
▽ More
ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annotated with genres. The models are compared on test sets in two languages: English and Slovenian. Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models. Even when applied on Slovenian language as an under-resourced language, ChatGPT's performance is no worse than when applied to English. However, if the model is fully prompted in Slovenian, the performance drops significantly, showing the current limitations of ChatGPT usage on smaller languages. The presented results lead us to questioning whether this is the beginning of an end of laborious manual annotation campaigns even for smaller languages, such as Slovenian.
△ Less
Submitted 8 March, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Retweet communities reveal the main sources of hate speech
Authors:
Bojan Evkoski,
Andraz Pelicon,
Igor Mozetic,
Nikola Ljubesic,
Petra Kralj Novak
Abstract:
We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to…
▽ More
We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to three years of Slovenian Twitter data. We report a number of interesting results. Hate speech is dominated by offensive tweets, related to political and ideological issues. The share of unacceptable tweets is moderately increasing with time, from the initial 20% to 30% by the end of 2020. Unacceptable tweets are retweeted significantly more often than acceptable tweets. About 60% of unacceptable tweets are produced by a single right-wing community of only moderate size. Institutional Twitter accounts and media accounts post significantly less unacceptable tweets than individual accounts. In fact, the main sources of unacceptable tweets are anonymous accounts, and accounts that were suspended or closed during the years 2018-2020.
△ Less
Submitted 17 March, 2022; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Online Hate: Behavioural Dynamics and Relationship with Misinformation
Authors:
Matteo Cinelli,
Andraž Pelicon,
Igor Mozetič,
Walter Quattrociocchi,
Petra Kralj Novak,
Fabiana Zollo
Abstract:
Online debates are often characterised by extreme polarisation and heated discussions among users. The presence of hate speech online is becoming increasingly problematic, making necessary the development of appropriate countermeasures. In this work, we perform hate speech detection on a corpus of more than one million comments on YouTube videos through a machine learning model fine-tuned on a lar…
▽ More
Online debates are often characterised by extreme polarisation and heated discussions among users. The presence of hate speech online is becoming increasingly problematic, making necessary the development of appropriate countermeasures. In this work, we perform hate speech detection on a corpus of more than one million comments on YouTube videos through a machine learning model fine-tuned on a large set of hand-annotated data. Our analysis shows that there is no evidence of the presence of "serial haters", intended as active users posting exclusively hateful comments. Moreover, coherently with the echo chamber hypothesis, we find that users skewed towards one of the two categories of video channels (questionable, reliable) are more prone to use inappropriate, violent, or hateful language within their opponents community. Interestingly, users loyal to reliable sources use on average a more toxic language than their counterpart. Finally, we find that the overall toxicity of the discussion increases with its length, measured both in terms of number of comments and time. Our results show that, coherently with Godwin's law, online debates tend to degenerate towards increasingly toxic exchanges of views.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Community evolution in retweet networks
Authors:
Bojan Evkoski,
Igor Mozetic,
Nikola Ljubesic,
Petra Kralj Novak
Abstract:
Communities in social networks often reflect close social ties between their members and their evolution through time. We propose an approach that tracks two aspects of community evolution in retweet networks: flow of the members in, out and between the communities, and their influence. We start with high resolution time windows, and then select several timepoints which exhibit large differences b…
▽ More
Communities in social networks often reflect close social ties between their members and their evolution through time. We propose an approach that tracks two aspects of community evolution in retweet networks: flow of the members in, out and between the communities, and their influence. We start with high resolution time windows, and then select several timepoints which exhibit large differences between the communities. For community detection, we propose a two-stage approach. In the first stage, we apply an enhanced Louvain algorithm, called Ensemble Louvain, to find stable communities. In the second stage, we form influence links between these communities, and identify linked super-communities. For the detected communities, we compute internal and external influence, and for individual users, the retweet h-index influence. We apply the proposed approach to three years of Twitter data of all Slovenian tweets. The analysis shows that the Slovenian tweetosphere is dominated by politics, that the left-leaning communities are larger, but that the right-leaning communities and users exhibit significantly higher impact. An interesting observation is that retweet networks change relatively gradually, despite such events as the emergence of the Covid-19 pandemic or the change of government.
△ Less
Submitted 2 September, 2021; v1 submitted 13 May, 2021;
originally announced May 2021.
-
Cross-lingual Transfer of Sentiment Classifiers
Authors:
Marko Robnik-Sikonja,
Kristjan Reba,
Igor Mozetic
Abstract:
Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by constructing a map** between vector spaces of two languages or learning a joint vector space for multiple languag…
▽ More
Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by constructing a map** between vector spaces of two languages or learning a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.
△ Less
Submitted 24 March, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Evaluating time series forecasting models: An empirical study on performance estimation methods
Authors:
Vitor Cerqueira,
Luis Torgo,
Igor Mozetic
Abstract:
Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. These procedures are part of the pipeline in every machine learning project and are used for assessing the overall generalisation ability of predictive models. In this paper we address the application of these methods to time series forecasting tasks. For independent and identically distributed da…
▽ More
Performance estimation aims at estimating the loss that a predictive model will incur on unseen data. These procedures are part of the pipeline in every machine learning project and are used for assessing the overall generalisation ability of predictive models. In this paper we address the application of these methods to time series forecasting tasks. For independent and identically distributed data the most common approach is cross-validation. However, the dependency among observations in time series raises some caveats about the most appropriate way to estimate performance in this type of data and currently there is no settled way to do so. We compare different variants of cross-validation and of out-of-sample approaches using two case studies: One with 62 real-world time series and another with three synthetic time series. Results show noticeable differences in the performance estimation methods in the two scenarios. In particular, empirical experiments suggest that cross-validation approaches can be applied to stationary time series. However, in real-world scenarios, when different sources of non-stationary variation are at play, the most accurate estimates are produced by out-of-sample methods that preserve the temporal order of observations.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.
-
Forex trading and Twitter: Spam, bots, and reputation manipulation
Authors:
Igor Mozetič,
Peter Gabrovšek,
Petra Kralj Novak
Abstract:
Currency trading (Forex) is the largest world market in terms of volume. We analyze trading and tweeting about the EUR-USD currency pair over a period of three years. First, a large number of tweets were manually labeled, and a Twitter stance classification model is constructed. The model then classifies all the tweets by the trading stance signal: buy, hold, or sell (EUR vs. USD). The Twitter sta…
▽ More
Currency trading (Forex) is the largest world market in terms of volume. We analyze trading and tweeting about the EUR-USD currency pair over a period of three years. First, a large number of tweets were manually labeled, and a Twitter stance classification model is constructed. The model then classifies all the tweets by the trading stance signal: buy, hold, or sell (EUR vs. USD). The Twitter stance is compared to the actual currency rates by applying the event study methodology, well-known in financial economics. It turns out that there are large differences in Twitter stance distribution and potential trading returns between the four groups of Twitter users: trading robots, spammers, trading companies, and individual traders. Additionally, we observe attempts of reputation manipulation by post festum removal of tweets with poor predictions, and deleting/reposting of identical tweets to increase the visibility without tainting one's Twitter timeline.
△ Less
Submitted 16 April, 2018; v1 submitted 6 April, 2018;
originally announced April 2018.
-
How to evaluate sentiment classifiers for Twitter time-ordered data?
Authors:
Igor Mozetič,
Luis Torgo,
Vitor Cerqueira,
Jasmina Smailović
Abstract:
Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so.…
▽ More
Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.
△ Less
Submitted 14 March, 2018;
originally announced March 2018.
-
Twitter Sentiment around the Earnings Announcement Events
Authors:
Peter Gabrovsek,
Darko Aleksovski,
Igor Mozetic,
Miha Grcar
Abstract:
We investigate the relationship between social media, Twitter in particular, and stock market. We provide an in-depth analysis of the Twitter volume and sentiment about the 30 companies in the Dow Jones Industrial Average index, over a period of three years. We focus on Earnings Announcements and show that there is a considerable difference with respect to when the announcements are made: before t…
▽ More
We investigate the relationship between social media, Twitter in particular, and stock market. We provide an in-depth analysis of the Twitter volume and sentiment about the 30 companies in the Dow Jones Industrial Average index, over a period of three years. We focus on Earnings Announcements and show that there is a considerable difference with respect to when the announcements are made: before the market opens or after the market closes. The two different timings of the Earnings Announcements were already investigated in the financial literature, but not yet in the social media. We analyze the differences in terms of the Twitter volumes, cumulative abnormal returns, trade returns, and earnings surprises.
We report mixed results. On the one hand, we show that the Twitter sentiment (the collective opinion of the users) on the day of the announcement very well reflects the stock moves on the same day. We demonstrate this by applying the event study methodology, where the polarity of the Earnings Announcements is computed from the Twitter sentiment. Cumulative abnormal returns are high (2--4\%) and statistically significant. On the other hand, we find only weak predictive power of the Twitter sentiment one day in advance. It turns out that it is important how to account for the announcements made after the market closes. These after-hours announcements draw high Twitter activity immediately, but volume and price changes in trading are observed only on the next day. On the day before the announcements, the Twitter volume is low, and the sentiment has very weak predictive power. A useful lesson learned is the importance of the proper alignment between the announcements, trading and Twitter data.
△ Less
Submitted 9 January, 2017; v1 submitted 7 November, 2016;
originally announced November 2016.
-
Cohesion and Coalition Formation in the European Parliament: Roll-Call Votes and Twitter Activities
Authors:
Darko Cherepnalkoski,
Andreas Karpf,
Igor Mozetic,
Miha Grcar
Abstract:
We study the cohesion within and the coalitions between political groups in the Eighth European Parliament (2014--2019) by analyzing two entirely different aspects of the behavior of the Members of the European Parliament (MEPs) in the policy-making processes. On one hand, we analyze their co-voting patterns and, on the other, their retweeting behavior. We make use of two diverse datasets in the a…
▽ More
We study the cohesion within and the coalitions between political groups in the Eighth European Parliament (2014--2019) by analyzing two entirely different aspects of the behavior of the Members of the European Parliament (MEPs) in the policy-making processes. On one hand, we analyze their co-voting patterns and, on the other, their retweeting behavior. We make use of two diverse datasets in the analysis. The first one is the roll-call vote dataset, where cohesion is regarded as the tendency to co-vote within a group, and a coalition is formed when the members of several groups exhibit a high degree of co-voting agreement on a subject. The second dataset comes from Twitter; it captures the retweeting (i.e., endorsing) behavior of the MEPs and implies cohesion (retweets within the same group) and coalitions (retweets between groups) from a completely different perspective.
We employ two different methodologies to analyze the cohesion and coalitions. The first one is based on Krippendorff's Alpha reliability, used to measure the agreement between raters in data-analysis scenarios, and the second one is based on Exponential Random Graph Models, often used in social-network analysis. We give general insights into the cohesion of political groups in the European Parliament, explore whether coalitions are formed in the same way for different policy areas, and examine to what degree the retweeting behavior of MEPs corresponds to their co-voting patterns. A novel and interesting aspect of our work is the relationship between the co-voting and retweeting patterns.
△ Less
Submitted 14 October, 2016; v1 submitted 17 August, 2016;
originally announced August 2016.
-
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
Authors:
Igor Mozetic,
Miha Grcar,
Jasmina Smailovic
Abstract:
What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate th…
▽ More
What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered.
△ Less
Submitted 5 May, 2016; v1 submitted 24 February, 2016;
originally announced February 2016.
-
Sentiment of Emojis
Authors:
Petra Kralj Novak,
Jasmina Smailović,
Borut Sluban,
Igor Mozetič
Abstract:
There is a new generation of emoticons, called emojis, that is increasingly being used in mobile communications and social media. In the past two years, over ten billion emojis were used on Twitter. Emojis are Unicode graphic symbols, used as a shorthand to express concepts and ideas. In contrast to the small number of well-known emoticons that carry clear emotional contents, there are hundreds of…
▽ More
There is a new generation of emoticons, called emojis, that is increasingly being used in mobile communications and social media. In the past two years, over ten billion emojis were used on Twitter. Emojis are Unicode graphic symbols, used as a shorthand to express concepts and ideas. In contrast to the small number of well-known emoticons that carry clear emotional contents, there are hundreds of emojis. But what are their emotional contents? We provide the first emoji sentiment lexicon, called the Emoji Sentiment Ranking, and draw a sentiment map of the 751 most frequently used emojis. The sentiment of the emojis is computed from the sentiment of the tweets in which they occur. We engaged 83 human annotators to label over 1.6 million tweets in 13 European languages by the sentiment polarity (negative, neutral, or positive). About 4% of the annotated tweets contain emojis. The sentiment analysis of the emojis allows us to draw several interesting conclusions. It turns out that most of the emojis are positive, especially the most popular ones. The sentiment distribution of the tweets with and without emojis is significantly different. The inter-annotator agreement on the tweets with emojis is higher. Emojis tend to occur at the end of the tweets, and their sentiment polarity increases with the distance. We observe no significant differences in the emoji rankings between the 13 languages and the Emoji Sentiment Ranking. Consequently, we propose our Emoji Sentiment Ranking as a European language-independent resource for automated sentiment analysis. Finally, the paper provides a formalization of sentiment and a novel visualization in the form of a sentiment bar.
△ Less
Submitted 8 December, 2015; v1 submitted 25 September, 2015;
originally announced September 2015.
-
Analysis of Financial News with NewsStream
Authors:
Petra Kralj Novak,
Miha Grcar,
Borut Sluban,
Igor Mozetic
Abstract:
Unstructured data, such as news and blogs, can provide valuable insights into the financial world. We present the NewsStream portal, an intuitive and easy-to-use tool for news analytics, which supports interactive querying and visualizations of the documents at different levels of detail. It relies on a scalable architecture for real-time processing of a continuous stream of textual data, which in…
▽ More
Unstructured data, such as news and blogs, can provide valuable insights into the financial world. We present the NewsStream portal, an intuitive and easy-to-use tool for news analytics, which supports interactive querying and visualizations of the documents at different levels of detail. It relies on a scalable architecture for real-time processing of a continuous stream of textual data, which incorporates data acquisition, cleaning, natural-language preprocessing and semantic annotation components. It has been running for over two years and collected over 18 million news articles and blog posts. The NewsStream portal can be used to answer the questions when, how often, in what context, and with what sentiment was a financial entity or term mentioned in a continuous stream of news and blogs, and therefore providing a complement to news aggregators. We illustrate some features of our system in three use cases: relations between the rating agencies and the PIIGS countries, reflection of financial news on credit default swap (CDS) prices, the emergence of the Bitcoin digital currency, and visualizing how the world is connected through news.
△ Less
Submitted 7 November, 2015; v1 submitted 31 July, 2015;
originally announced August 2015.
-
The Effects of Twitter Sentiment on Stock Price Returns
Authors:
Gabriele Ranco,
Darko Aleksovski,
Guido Caldarelli,
Miha Grčar,
Igor Mozetič
Abstract:
Social media are increasingly reflecting and influencing behavior of other complex systems. In this paper we investigate the relations between a well-know micro-blogging platform Twitter and financial markets. In particular, we consider, in a period of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the Dow Jones Industrial Average (DJIA) index. We find a relativ…
▽ More
Social media are increasingly reflecting and influencing behavior of other complex systems. In this paper we investigate the relations between a well-know micro-blogging platform Twitter and financial markets. In particular, we consider, in a period of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the Dow Jones Industrial Average (DJIA) index. We find a relatively low Pearson correlation and Granger causality between the corresponding time series over the entire time period. However, we find a significant dependence between the Twitter sentiment and abnormal returns during the peaks of Twitter volume. This is valid not only for the expected Twitter volume peaks (e.g., quarterly announcements), but also for peaks corresponding to less obvious events. We formalize the procedure by adapting the well-known "event study" from economics and finance to the analysis of Twitter data. The procedure allows to automatically identify events as Twitter volume peaks, to compute the prevailing sentiment (positive or negative) expressed in tweets at these peaks, and finally to apply the "event study" methodology to relate them to stock returns. We show that sentiment polarity of Twitter peaks implies the direction of cumulative abnormal returns. The amount of cumulative abnormal returns is relatively low (about 1-2%), but the dependence is statistically significant for several days after the events.
△ Less
Submitted 11 August, 2015; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Emotional Dynamics in the Age of Misinformation
Authors:
Fabiana Zollo,
Petra Kralj Novak,
Michela Del Vicario,
Alessandro Bessi,
Igor Mozetic,
Antonio Scala,
Guido Caldarelli,
Walter Quattrociocchi
Abstract:
According to the World Economic Forum, the diffusion of unsubstantiated rumors on online social media is one of the main threats for our society.
The disintermediated paradigm of content production and consumption on online social media might foster the formation of homophile communities (echo-chambers) around specific worldviews. Such a scenario has been shown to be a vivid environment for the…
▽ More
According to the World Economic Forum, the diffusion of unsubstantiated rumors on online social media is one of the main threats for our society.
The disintermediated paradigm of content production and consumption on online social media might foster the formation of homophile communities (echo-chambers) around specific worldviews. Such a scenario has been shown to be a vivid environment for the diffusion of false claims, in particular with respect to conspiracy theories. Not rarely, viral phenomena trigger naive (and funny) social responses -- e.g., the recent case of Jade Helm 15 where a simple military exercise turned out to be perceived as the beginning of the civil war in the US. In this work, we address the emotional dynamics of collective debates around distinct kind of news -- i.e., science and conspiracy news -- and inside and across their respective polarized communities (science and conspiracy news).
Our findings show that comments on conspiracy posts tend to be more negative than on science posts. However, the more the engagement of users, the more they tend to negative commenting (both on science and conspiracy). Finally, zooming in at the interaction among polarized communities, we find a general negative pattern. As the number of comments increases -- i.e., the discussion becomes longer -- the sentiment of the post is more and more negative.
△ Less
Submitted 29 May, 2015;
originally announced May 2015.
-
Twitter-based analysis of the dynamics of collective attention to political parties
Authors:
Young-Ho Eom,
Michelangelo Puliga,
Jasmina Smailović,
Igor Mozetič,
Guido Caldarelli
Abstract:
Large-scale data from social media have a significant potential to describe complex phenomena in real world and to anticipate collective behaviors such as information spreading and social trends. One specific case of study is represented by the collective attention to the action of political parties. Not surprisingly, researchers and stakeholders tried to correlate parties' presence on social medi…
▽ More
Large-scale data from social media have a significant potential to describe complex phenomena in real world and to anticipate collective behaviors such as information spreading and social trends. One specific case of study is represented by the collective attention to the action of political parties. Not surprisingly, researchers and stakeholders tried to correlate parties' presence on social media with their performances in elections. Despite the many efforts, results are still inconclusive since this kind of data is often very noisy and significant signals could be covered by (largely unknown) statistical fluctuations. In this paper we consider the number of tweets (tweet volume) of a party as a proxy of collective attention to the party, identify the dynamics of the volume, and show that this quantity has some information on the elections outcome. We find that the distribution of the tweet volume for each party follows a log-normal distribution with a positive autocorrelation of the volume over short terms, which indicates the volume has large fluctuations of the log-normal distribution yet with a short-term tendency. Furthermore, by measuring the ratio of two consecutive daily tweet volumes, we find that the evolution of the daily volume of a party can be described by means of a geometric Brownian motion (i.e., the logarithm of the volume moves randomly with a trend). Finally, we determine the optimal period of averaging tweet volume for reducing fluctuations and extracting short-term tendencies. We conclude that the tweet volume is a good indicator of parties' success in the elections when considered over an optimal time window. Our study identifies the statistical nature of collective attention to political issues and sheds light on how to model the dynamics of collective attention in social media.
△ Less
Submitted 14 July, 2015; v1 submitted 26 April, 2015;
originally announced April 2015.
-
Extraction of Temporal Networks from Term Co-occurrences in Online Textual Sources
Authors:
Marko Popović,
Hrvoje Štefančić,
Borut Sluban,
Petra Kralj Novak,
Miha Grčar,
Igor Mozetič,
Michelangelo Puliga,
Vinko Zlatić
Abstract:
A stream of unstructured news can be a valuable source of hidden relations between different entities, such as financial institutions, countries, or persons. We present an approach to continuously collect online news, recognize relevant entities in them, and extract time-varying networks. The nodes of the network are the entities, and the links are their co-occurrences. We present a method to esti…
▽ More
A stream of unstructured news can be a valuable source of hidden relations between different entities, such as financial institutions, countries, or persons. We present an approach to continuously collect online news, recognize relevant entities in them, and extract time-varying networks. The nodes of the network are the entities, and the links are their co-occurrences. We present a method to estimate the significance of co-occurrences, and a benchmark model against which their robustness is evaluated. The approach is applied to a large set of financial news, collected over a period of two years. The entities we consider are 50 countries which issue sovereign bonds, and which are insured by Credit Default Swaps (CDS) in turn. We compare the country co-occurrence networks to the CDS networks constructed from the correlations between the CDS. The results show relatively small, but significant overlap between the networks extracted from the news and those from the CDS correlations.
△ Less
Submitted 20 June, 2014;
originally announced June 2014.
-
News Cohesiveness: an Indicator of Systemic Risk in Financial Markets
Authors:
Matija Piškorec,
Nino Antulov-Fantulin,
Petra Kralj Novak,
Igor Mozetič,
Miha Grčar,
Irena Vodenska,
Tomislav Šmuc
Abstract:
Motivated by recent financial crises significant research efforts have been put into studying contagion effects and herding behaviour in financial markets. Much less has been said about influence of financial news on financial markets. We propose a novel measure of collective behaviour in financial news on the Web, News Cohesiveness Index (NCI), and show that it can be used as a systemic risk indi…
▽ More
Motivated by recent financial crises significant research efforts have been put into studying contagion effects and herding behaviour in financial markets. Much less has been said about influence of financial news on financial markets. We propose a novel measure of collective behaviour in financial news on the Web, News Cohesiveness Index (NCI), and show that it can be used as a systemic risk indicator. We evaluate the NCI on financial documents from large Web news sources on a daily basis from October 2011 to July 2013 and analyse the interplay between financial markets and financially related news. We hypothesized that strong cohesion in financial news reflects movements in the financial markets. Cohesiveness is more general and robust measure of systemic risk expressed in news, than measures based on simple occurrences of specific terms. Our results indicate that cohesiveness in the financial news is highly correlated with and driven by volatility on the financial markets.
△ Less
Submitted 14 February, 2014;
originally announced February 2014.