Search | arXiv e-print repository

arXiv:2405.11897 [pdf, other]

CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera, Muhammad Imran

Abstract: During times of crisis, social media platforms play a vital role in facilitating communication and coordinating resources. Amidst chaos and uncertainty, communities often rely on these platforms to share urgent pleas for help, extend support, and organize relief efforts. However, the sheer volume of conversations during such periods, which can escalate to unprecedented levels, necessitates the aut… ▽ More During times of crisis, social media platforms play a vital role in facilitating communication and coordinating resources. Amidst chaos and uncertainty, communities often rely on these platforms to share urgent pleas for help, extend support, and organize relief efforts. However, the sheer volume of conversations during such periods, which can escalate to unprecedented levels, necessitates the automated identification and matching of requests and offers to streamline relief operations. This study addresses the challenge of efficiently identifying and matching assistance requests and offers on social media platforms during emergencies. We propose CReMa (Crisis Response Matcher), a systematic approach that integrates textual, temporal, and spatial features for multi-lingual request-offer matching. By leveraging CrisisTransformers, a set of pre-trained models specific to crises, and a cross-lingual embedding space, our methodology enhances the identification and matching tasks while outperforming strong baselines such as RoBERTa, MPNet, and BERTweet, in classification tasks, and Universal Sentence Encoder, Sentence Transformers in crisis embeddings generation tasks. We introduce a novel multi-lingual dataset that simulates scenarios of help-seeking and offering assistance on social media across the 16 most commonly used languages in Australia. We conduct comprehensive cross-lingual experiments across these 16 languages, also while examining trade-offs between multiple vector search strategies and accuracy. Additionally, we analyze a million-scale geotagged global dataset to comprehend patterns in relation to seeking help and offering assistance on social media. Overall, these contributions advance the field of crisis informatics and provide benchmarks for future research in the area. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2403.16614 [pdf, other]

Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Abstract: Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse, aiding decision-making and targeted interventions. Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness. Although the CrisisTransformers family includes a sentence encoder to address the… ▽ More Tasks such as semantic search and clustering on crisis-related social media texts enhance our comprehension of crisis discourse, aiding decision-making and targeted interventions. Pre-trained language models have advanced performance in crisis informatics, but their contextual embeddings lack semantic meaningfulness. Although the CrisisTransformers family includes a sentence encoder to address the semanticity issue, it remains monolingual, processing only English texts. Furthermore, employing separate models for different languages leads to embeddings in distinct vector spaces, introducing challenges when comparing semantic similarities between multi-lingual texts. Therefore, we propose multi-lingual sentence encoders (CT-XLMR-SE and CT-mBERT-SE) that embed crisis-related social media texts for over 50 languages, such that texts with similar meanings are in close proximity within the same vector space, irrespective of language diversity. Results in sentence encoding and sentence matching tasks are promising, suggesting these models could serve as robust baselines when embedding multi-lingual crisis-related social media texts. The models are publicly available at: https://huggingface.co/crisistransformers. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: Accepted to ISCRAM 2024

arXiv:2403.09349 [pdf, other]

From Pro, Anti to Informative and Hesitant: An Infoveillance study of COVID-19 vaccines and vaccination discourse on Twitter

Authors: Pardeep Singh, Rabindra Lamsal, Monika Singh, Satish Chand, Bhawna Shishodia

Abstract: COVID-19 pandemic has brought unprecedented challenges to the world, and vaccination has been a key strategy to combat the disease. Since Twitter is one of the most widely used public microblogging platforms, researchers have analysed COVID-19 vaccines and vaccination Twitter discourse to explore the conversational dynamics around the topic. While contributing to the crisis informatics literature,… ▽ More COVID-19 pandemic has brought unprecedented challenges to the world, and vaccination has been a key strategy to combat the disease. Since Twitter is one of the most widely used public microblogging platforms, researchers have analysed COVID-19 vaccines and vaccination Twitter discourse to explore the conversational dynamics around the topic. While contributing to the crisis informatics literature, we curate a large-scale geotagged Twitter dataset, GeoCovaxTweets Extended, and explore the discourse through multiple spatiotemporal analyses. This dataset covers a longer time span of 38 months, from the announcement of the first vaccine to the availability of booster doses. Results show that 43.4% of the collected tweets, although containing phrases and keywords related to vaccines and vaccinations, were unrelated to the COVID-19 context. In total, 23.1% of the discussions on vaccines and vaccinations were classified as Pro, 16% as Hesitant, 11.4% as Anti, and 6.1% as Informative. The trend shifted towards Pro and Informative tweets globally as vaccination programs progressed, indicating a change in the public's perception of COVID-19 vaccines and vaccination. Furthermore, we explored the discourse based on account attributes, i.e., followers counts and tweet counts. Results show a significant pattern of discourse differences. Our findings highlight the potential of harnessing a large-scale geotagged Twitter dataset to understand global public health communication and to inform targeted interventions aimed at addressing vaccine hesitancy. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2309.05494 [pdf, other]

doi 10.1016/j.knosys.2024.111916

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Abstract: Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddi… ▽ More Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to the effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. The models are publicly available at: https://huggingface.co/crisistransformers △ Less

Submitted 11 April, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

Journal ref: Knowledge-Based Systems, 111916 (2024)

arXiv:2302.11136 [pdf, other]

doi 10.59297/GQED8281

A Twitter narrative of the COVID-19 pandemic in Australia

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Abstract: Social media platforms contain abundant data that can provide comprehensive knowledge of historical and real-time events. During crisis events, the use of social media peaks, as people discuss what they have seen, heard, or felt. Previous studies confirm the usefulness of such socially generated discussions for the public, first responders, and decision-makers to gain a better understanding of eve… ▽ More Social media platforms contain abundant data that can provide comprehensive knowledge of historical and real-time events. During crisis events, the use of social media peaks, as people discuss what they have seen, heard, or felt. Previous studies confirm the usefulness of such socially generated discussions for the public, first responders, and decision-makers to gain a better understanding of events as they unfold at the ground level. This study performs an extensive analysis of COVID-19-related Twitter discussions generated in Australia between January 2020, and October 2022. We explore the Australian Twitterverse by employing state-of-the-art approaches from both supervised and unsupervised domains to perform network analysis, topic modeling, sentiment analysis, and causality analysis. As the presented results provide a comprehensive understanding of the Australian Twitterverse during the COVID-19 pandemic, this study aims to explore the discussion dynamics to aid the development of future automated information systems for epidemic/pandemic management. △ Less

Submitted 23 February, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: Accepted to ISCRAM 2023

arXiv:2301.11284 [pdf, other]

doi 10.1016/j.dib.2023.109229

BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration

Authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera

Abstract: The COVID-19 pandemic introduced new norms such as social distancing, face masks, quarantine, lockdowns, travel restrictions, work/study from home, and business closures, to name a few. The pandemic's seriousness made people vocal on social media, especially on microblogs such as Twitter. Researchers have been collecting and sharing large-scale datasets of COVID-19 tweets since the early days of t… ▽ More The COVID-19 pandemic introduced new norms such as social distancing, face masks, quarantine, lockdowns, travel restrictions, work/study from home, and business closures, to name a few. The pandemic's seriousness made people vocal on social media, especially on microblogs such as Twitter. Researchers have been collecting and sharing large-scale datasets of COVID-19 tweets since the early days of the outbreak. Sharing raw Twitter data with third parties is restricted; users need to hydrate tweet identifiers in a public dataset to re-create the dataset locally. Large-scale datasets that include original tweets, retweets, quotes, and replies have tweets in billions which takes months to hydrate. The existing datasets carry issues related to proportion and redundancy. We report that more than 500 million tweet identifiers point to deleted or protected tweets. In order to address these issues, this paper introduces an enriched global billion-scale English-language COVID-19 tweets dataset, BillionCOV, that contains 1.4 billion tweets originating from 240 countries and territories between October 2019 and April 2022. Importantly, BillionCOV facilitates researchers to filter tweet identifiers for efficient hydration. This paper discusses associated methods to fetch raw Twitter data for a set of tweet identifiers, presents multiple tweets' distributions to provide an overview of BillionCOV, and finally, reviews the dataset's potential use cases. △ Less

Submitted 19 March, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

arXiv:2301.07378 [pdf, other]

GeoCovaxTweets: COVID-19 Vaccines and Vaccination-specific Global Geotagged Twitter Conversations

Authors: Pardeep Singh, Rabindra Lamsal, Monika, Satish Chand, Bhawna Shishodia

Abstract: Social media platforms provide actionable information during crises and pandemic outbreaks. The COVID-19 pandemic has imposed a chronic public health crisis worldwide, with experts considering vaccines as the ultimate prevention to achieve herd immunity against the virus. A proportion of people may turn to social media platforms to oppose vaccines and vaccination, hindering government efforts to e… ▽ More Social media platforms provide actionable information during crises and pandemic outbreaks. The COVID-19 pandemic has imposed a chronic public health crisis worldwide, with experts considering vaccines as the ultimate prevention to achieve herd immunity against the virus. A proportion of people may turn to social media platforms to oppose vaccines and vaccination, hindering government efforts to eradicate the virus. This paper presents the COVID-19 vaccines and vaccination-specific global geotagged tweets dataset, GeoCovaxTweets, that contains more than 1.8 million tweets, with location information and longer temporal coverage, originating from 233 countries and territories between January 2020 and November 2022. The paper discusses the dataset's curation method and how it can be re-created locally, and later explores the dataset through multiple tweets distributions and briefly discusses its potential use cases. We anticipate that the dataset will assist the researchers in the crisis computing domain to explore the conversational dynamics of COVID-19 vaccines and vaccination Twitter discourse through numerous spatial and temporal dimensions concerning trends, shifts in opinions, misinformation, and anti-vaccination campaigns. △ Less

Submitted 18 January, 2023; originally announced January 2023.

arXiv:2211.16506 [pdf, other]

doi 10.1109/BigData55660.2022.10020460

Where did you tweet from? Inferring the origin locations of tweets based on contextual information

Authors: Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read

Abstract: Public conversations on Twitter comprise many pertinent topics including disasters, protests, politics, propaganda, sports, climate change, epidemics/pandemic outbreaks, etc., that can have both regional and global aspects. Spatial discourse analysis rely on geographical data. However, today less than 1% of tweets are geotagged; in both cases--point location or bounding place information. A major… ▽ More Public conversations on Twitter comprise many pertinent topics including disasters, protests, politics, propaganda, sports, climate change, epidemics/pandemic outbreaks, etc., that can have both regional and global aspects. Spatial discourse analysis rely on geographical data. However, today less than 1% of tweets are geotagged; in both cases--point location or bounding place information. A major issue with tweets is that Twitter users can be at location A and exchange conversations specific to location B, which we call the Location A/B problem. The problem is considered solved if location entities can be classified as either origin locations (Location As) or non-origin locations (Location Bs). In this work, we propose a simple yet effective framework--the True Origin Model--to address the problem that uses machine-level natural language understanding to identify tweets that conceivably contain their origin location information. The model achieves promising accuracy at country (80%), state (67%), city (58%), county (56%) and district (64%) levels with support from a Location Extraction Model as basic as the CoNLL-2003-based RoBERTa. We employ a tweet contexualizer (locBERT) which is one of the core components of the proposed model, to investigate multiple tweets' distributions for understanding Twitter users' tweeting behavior in terms of mentioning origin and non-origin locations. We also highlight a major concern with the currently regarded gold standard test set (ground truth) methodology, introduce a new data set, and identify further research avenues for advancing the area. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: To appear in Proceedings of the IEEE Big Data Conference 2022

arXiv:2209.07272 [pdf, other]

doi 10.1145/3524498

Socially Enhanced Situation Awareness from Microblogs using Artificial Intelligence: A Survey

Authors: Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read

Abstract: The rise of social media platforms provides an unbounded, infinitely rich source of aggregate knowledge of the world around us, both historic and real-time, from a human perspective. The greatest challenge we face is how to process and understand this raw and unstructured data, go beyond individual observations and see the "big picture"--the domain of Situation Awareness. We provide an extensive s… ▽ More The rise of social media platforms provides an unbounded, infinitely rich source of aggregate knowledge of the world around us, both historic and real-time, from a human perspective. The greatest challenge we face is how to process and understand this raw and unstructured data, go beyond individual observations and see the "big picture"--the domain of Situation Awareness. We provide an extensive survey of Artificial Intelligence research, focusing on microblog social media data with applications to Situation Awareness, that gives the seminal work and state-of-the-art approaches across six thematic areas: Crime, Disasters, Finance, Physical Environment, Politics, and Health and Population. We provide a novel, unified methodological perspective, identify key results and challenges, and present ongoing research directions. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: Accepted to ACM Computing Surveys (CSUR) 2022

arXiv:2206.10471 [pdf, other]

doi 10.1016/j.asoc.2022.109603

Twitter conversations predict the daily confirmed COVID-19 cases

Authors: Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read

Abstract: As of writing this paper, COVID-19 (Coronavirus disease 2019) has spread to more than 220 countries and territories. Following the outbreak, the pandemic's seriousness has made people more active on social media, especially on the microblogging platforms such as Twitter and Weibo. The pandemic-specific discourse has remained on-trend on these platforms for months now. Previous studies have confirm… ▽ More As of writing this paper, COVID-19 (Coronavirus disease 2019) has spread to more than 220 countries and territories. Following the outbreak, the pandemic's seriousness has made people more active on social media, especially on the microblogging platforms such as Twitter and Weibo. The pandemic-specific discourse has remained on-trend on these platforms for months now. Previous studies have confirmed the contributions of such socially generated conversations towards situational awareness of crisis events. The early forecasts of cases are essential to authorities to estimate the requirements of resources needed to cope with the outgrowths of the virus. Therefore, this study attempts to incorporate the public discourse in the design of forecasting models particularly targeted for the steep-hill region of an ongoing wave. We propose a sentiment-involved topic-based latent variables search methodology for designing forecasting models from publicly available Twitter conversations. As a use case, we implement the proposed methodology on Australian COVID-19 daily cases and Twitter conversations generated within the country. Experimental results: (i) show the presence of latent social media variables that Granger-cause the daily COVID-19 confirmed cases, and (ii) confirm that those variables offer additional prediction capability to forecasting models. Further, the results show that the inclusion of social media variables introduces 48.83--51.38% improvements on RMSE over the baseline models. We also release the large-scale COVID-19 specific geotagged global tweets dataset, MegaGeoCOV, to the public anticipating that the geotagged data of this scale would aid in understanding the conversational dynamics of the pandemic through other spatial and temporal contexts. △ Less

Submitted 13 September, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Accepted to Applied Soft Computing

Journal ref: Applied Soft Computing, 109603 (2022)

arXiv:1810.01878 [pdf, ps, other]

doi 10.1007/s42452-020-03582-5

Determining Optimal Number of k-Clusters based on Predefined Level-of-Similarity

Authors: Rabindra Lamsal, Shubham Katiyar

Abstract: This paper proposes a centroid-based clustering algorithm which is capable of clustering data-points with n-features, without having to specify the number of clusters to be formed. The core logic behind the algorithm is a similarity measure, which collectively decides whether to assign an incoming data-point to a pre-existing cluster, or create a new cluster and assign the data-point to it. The pr… ▽ More This paper proposes a centroid-based clustering algorithm which is capable of clustering data-points with n-features, without having to specify the number of clusters to be formed. The core logic behind the algorithm is a similarity measure, which collectively decides whether to assign an incoming data-point to a pre-existing cluster, or create a new cluster and assign the data-point to it. The proposed clustering algorithm is application-specific and is applicable when the need is to perform clustering analysis of a stream of data-points, where the similarity measure between an incoming data-point and the cluster to which the data-point is to be associated with, is greater than the predefined Level-of-Similarity. △ Less

Submitted 21 July, 2019; v1 submitted 3 October, 2018; originally announced October 2018.

Comments: 2 Figures, 3 Equations

arXiv:1809.09813 [pdf, ps, other]

Predicting Outcome of Indian Premier League (IPL) Matches Using Machine Learning

Authors: Rabindra Lamsal, Ayesha Choudhary

Abstract: Cricket, especially the Twenty20 format, has maximum uncertainty, where a single over can completely change the momentum of the game. With millions of people following the Indian Premier League (IPL), develo** a model for predicting the outcome of its matches is a real-world problem. A cricket match depends upon various factors, and in this work, the factors which significantly influence the out… ▽ More Cricket, especially the Twenty20 format, has maximum uncertainty, where a single over can completely change the momentum of the game. With millions of people following the Indian Premier League (IPL), develo** a model for predicting the outcome of its matches is a real-world problem. A cricket match depends upon various factors, and in this work, the factors which significantly influence the outcome of a Twenty20 cricket match are identified. Each player's performance in the field is considered to find out the overall weight (relative strength) of the teams. A multivariate regression based solution is proposed to calculate points for each player in the league and the overall weight of a team is computed based on the past performance of the players who have appeared most for the team. Finally, a dataset is modeled based on the identified seven factors which influence the outcome of an IPL match. Six machine learning models were trained and used for predicting the outcome of each 2018 IPL match, 15 minutes before the gameplay, immediately after the toss. Three of the trained models were seen to be correctly predicting more than 40 matches, with Multilayer Perceptron outperforming all other models with an impressive accuracy of 71.66%. △ Less

Submitted 21 September, 2020; v1 submitted 26 September, 2018; originally announced September 2018.

Showing 1–12 of 12 results for author: Lamsal, R