Search | arXiv e-print repository

Boosting classification reliability of NLP transformer models in the long run

Authors: Zoltán Kmetty, Bence Kollányi, Krisztián Boros

Abstract: Transformer-based machine learning models have become an essential tool for many natural language processing (NLP) tasks since the introduction of the method. A common objective of these projects is to classify text data. Classification models are often extended to a different topic and/or time period. In these situations, deciding how long a classification is suitable for and when it is worth re-… ▽ More Transformer-based machine learning models have become an essential tool for many natural language processing (NLP) tasks since the introduction of the method. A common objective of these projects is to classify text data. Classification models are often extended to a different topic and/or time period. In these situations, deciding how long a classification is suitable for and when it is worth re-training our model is difficult. This paper compares different approaches to fine-tune a BERT model for a long-running classification task. We use data from different periods to fine-tune our original BERT model, and we also measure how a second round of annotation could boost the classification quality. Our corpus contains over 8 million comments on COVID-19 vaccination in Hungary posted between September 2020 and December 2021. Our results show that the best solution is using all available unlabeled comments to fine-tune a model. It is not advisable to focus only on comments containing words that our model has not encountered before; a more efficient solution is randomly sample comments from the new period. Fine-tuning does not prevent the model from losing performance but merely slows it down. In a rapidly changing linguistic environment, it is not possible to maintain model performance without regularly annotating new text. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: 18 pages, 3 figures

arXiv:2002.12069 [pdf]

Junk News & Information Sharing During the 2019 UK General Election

Authors: Nahema Marchal, Bence Kollanyi, Lisa-Maria Neudert, Hubert Au, Philip N. Howard

Abstract: Today, an estimated 75% of the British public access information about politics and public life online, and 40% do so via social media. With this context in mind, we investigate information sharing patterns over social media in the lead-up to the 2019 UK General Elections, and ask: (1) What type of political news and information were social media users sharing on Twitter ahead of the vote? (2) How… ▽ More Today, an estimated 75% of the British public access information about politics and public life online, and 40% do so via social media. With this context in mind, we investigate information sharing patterns over social media in the lead-up to the 2019 UK General Elections, and ask: (1) What type of political news and information were social media users sharing on Twitter ahead of the vote? (2) How much of it is extremist, sensationalist, or conspiratorial junk news? (3) How much public engagement did these sites get on Facebook in the weeks leading and (4) What are the most common narratives and themes relayed by junk news outlets △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:1901.07920 [pdf, other]

The Junk News Aggregator: Examining junk news posted on Facebook, starting with the 2018 US Midterm Elections

Authors: Dimitra Liotsiou, Bence Kollanyi, Philip N. Howard

Abstract: In recent years, the phenomenon of online misinformation and junk news circulating on social media has come to constitute an important and widespread problem affecting public life online across the globe, particularly around important political events such as elections. At the same time, there have been calls for more transparency around misinformation on social media platforms, as many of the mos… ▽ More In recent years, the phenomenon of online misinformation and junk news circulating on social media has come to constitute an important and widespread problem affecting public life online across the globe, particularly around important political events such as elections. At the same time, there have been calls for more transparency around misinformation on social media platforms, as many of the most popular social media platforms function as "walled gardens," where it is impossible for researchers and the public to readily examine the scale and nature of misinformation activity as it unfolds on the platforms. In order to help address this, we present the Junk News Aggregator, a publicly available interactive web tool, which allows anyone to examine, in near real-time, all of the public content posted to Facebook by important junk news sources in the US. It allows the public to gain access to and examine the latest articles posted on Facebook (the most popular social media platform in the US and one where content is not readily accessible at scale from the open Web), as well as organise them by time, news publisher, and keywords of interest, and sort them based on all eight engagement metrics available on Facebook. Therefore, the Aggregator allows the public to gain insights on the volume, content, key themes, and types and volumes of engagement received by content posted by junk news publishers, in near real-time, hence opening up and offering transparency in these activities as they unfold, at scale across the top most popular junk news publishers. In this way, the Aggregator can help increase transparency around the nature, volume, and engagement with junk news on social media, and serve as a media literacy tool for the public. △ Less

Submitted 17 April, 2019; v1 submitted 23 January, 2019; originally announced January 2019.

arXiv:1803.01845 [pdf]

Polarization, Partisanship and Junk News Consumption over Social Media in the US

Authors: Vidya Narayanan, Vlad Barash, John Kelly, Bence Kollanyi, Lisa-Maria Neudert, Philip N. Howard

Abstract: What kinds of social media users read junk news? We examine the distribution of the most significant sources of junk news in the three months before President Donald Trump first State of the Union Address. Drawing on a list of sources that consistently publish political news and information that is extremist, sensationalist, conspiratorial, masked commentary, fake news and other forms of junk news… ▽ More What kinds of social media users read junk news? We examine the distribution of the most significant sources of junk news in the three months before President Donald Trump first State of the Union Address. Drawing on a list of sources that consistently publish political news and information that is extremist, sensationalist, conspiratorial, masked commentary, fake news and other forms of junk news, we find that the distribution of such content is unevenly spread across the ideological spectrum. We demonstrate that (1) on Twitter, a network of Trump supporters shares the widest range of known junk news sources and circulates more junk news than all the other groups put together; (2) on Facebook, extreme hard right pages, distinct from Republican pages, share the widest range of known junk news sources and circulate more junk news than all the other audiences put together; (3) on average, the audiences for junk news on Twitter share a wider range of known junk news sources than audiences on Facebook public pages. △ Less

Submitted 4 March, 2018; originally announced March 2018.

Comments: arXiv admin note: text overlap with arXiv:1802.03572

Report number: Data Memo 2018.1

arXiv:1802.03573 [pdf]

Social Media, News and Political Information during the US Election: Was Polarizing Content Concentrated in Swing States?

Authors: Philip N. Howard, Bence Kollanyi, Samantha Bradshaw, Lisa-Maria Neudert

Abstract: US voters shared large volumes of polarizing political news and information in the form of links to content from Russian, WikiLeaks and junk news sources. Was this low quality political information distributed evenly around the country, or concentrated in swing states and particular parts of the country? In this data memo we apply a tested dictionary of sources about political news and information… ▽ More US voters shared large volumes of polarizing political news and information in the form of links to content from Russian, WikiLeaks and junk news sources. Was this low quality political information distributed evenly around the country, or concentrated in swing states and particular parts of the country? In this data memo we apply a tested dictionary of sources about political news and information being shared over Twitter over a ten day period around the 2016 Presidential Election. Using self-reported location information, we place a third of users by state and create a simple index for the distribution of polarizing content around the country. We find that (1) nationally, Twitter users got more misinformation, polarizing and conspiratorial content than professionally produced news. (2) Users in some states, however, shared more polarizing political news and information than users in other states. (3) Average levels of misinformation were higher in swing states than in uncontested states, even when weighted for the relative size of the user population in each state. We conclude with some observations about the impact of strategically disseminated polarizing information on public life. △ Less

Submitted 10 February, 2018; originally announced February 2018.

Comments: Data Memo

arXiv:1606.06356 [pdf]

Bots, #StrongerIn, and #Brexit: Computational Propaganda during the UK-EU Referendum

Authors: Philip N. Howard, Bence Kollanyi

Abstract: Bots are social media accounts that automate interaction with other users, and they are active on the StrongerIn-Brexit conversation happening over Twitter. These automated scripts generate content through these platforms and then interact with people. Political bots are automated accounts that are particularly active on public policy issues, elections, and political crises. In this preliminary st… ▽ More Bots are social media accounts that automate interaction with other users, and they are active on the StrongerIn-Brexit conversation happening over Twitter. These automated scripts generate content through these platforms and then interact with people. Political bots are automated accounts that are particularly active on public policy issues, elections, and political crises. In this preliminary study on the use of political bots during the UK referendum on EU membership, we analyze the tweeting patterns for both human users and bots. We find that political bots have a small but strategic role in the referendum conversations: (1) the family of hashtags associated with the argument for leaving the EU dominates, (2) different perspectives on the issue utilize different levels of automation, and (3) less than 1 percent of sampled accounts generate almost a third of all the messages. △ Less

Submitted 20 June, 2016; originally announced June 2016.

Comments: 6 pages, 1 figure, 2 tables

Report number: 2016-1

Showing 1–6 of 6 results for author: Kollanyi, B