Search | arXiv e-print repository

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Authors: Ahmet Gunduz, Kamer Ali Yuksel, Kareem Darwish, Golara Javadi, Fabio Minazzi, Nicola Sobieski, Sebastien Bratieres

Abstract: Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing th… ▽ More Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 9 Pages, 6 Figures, 4 Tables, LREC-COLING 2024

arXiv:2206.07373 [pdf, other]

NatiQ: An End-to-end Text-to-Speech System for Arabic

Authors: Ahmed Abdelali, Nadir Durrani, Cenk Demiroglu, Fahim Dalvi, Hamdy Mubarak, Kareem Darwish

Abstract: NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the paral… ▽ More NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the parallel wavegan vocoder to synthesize waveforms from the spectrograms. We used in-house speech data for two voices: 1) neutral male "Hamza"- narrating general content and news, and 2) expressive female "Amina"- narrating children story books to train our models. Our best systems achieve an average Mean Opinion Score (MOS) of 4.21 and 4.40 for Amina and Hamza respectively. The objective evaluation of the systems using word and character error rate (WER and CER) as well as the response time measured by real-time factor favored the end-to-end architecture ESPnet. NatiQ demo is available on-line at https://tts.qcri.org △ Less

Submitted 16 November, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2111.09574 [pdf, other]

Automatic Expansion and Retargeting of Arabic Offensive Language Training

Authors: Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Younes Samih

Abstract: Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity specific offensive language. We employ two key insigh… ▽ More Rampant use of offensive language on social media led to recent efforts on automatic identification of such language. Though offensive language has general characteristics, attacks on specific entities may exhibit distinct phenomena such as malicious alterations in the spelling of names. In this paper, we present a method for identifying entity specific offensive language. We employ two key insights, namely that replies on Twitter often imply opposition and some accounts are persistent in their offensiveness towards specific targets. Using our methodology, we are able to collect thousands of targeted offensive tweets. We show the efficacy of the approach on Arabic tweets with 13% and 79% relative F1-measure improvement in entity specific offensive language detection when using deep-learning based and support vector machine based classifiers respectively. Further, expanding the training set with automatically identified offensive tweets directed at multiple entities can improve F1-measure by 48%. △ Less

Submitted 18 November, 2021; originally announced November 2021.

arXiv:2109.12844 [pdf, other]

News Consumption in Time of Conflict: 2021 Palestinian-Israel War as an Example

Authors: Kareem Darwish

Abstract: This paper examines news consumption in response to a major polarizing event, and we use the May 2021 Israeli-Palestinian conflict as an example. We conduct a detailed analysis of the news consumption of more than eight thousand Twitter users who are either pro-Palestinian or pro-Israeli and authored more than 29 million tweets between January 1 and August 17, 2021. We identified the stance of use… ▽ More This paper examines news consumption in response to a major polarizing event, and we use the May 2021 Israeli-Palestinian conflict as an example. We conduct a detailed analysis of the news consumption of more than eight thousand Twitter users who are either pro-Palestinian or pro-Israeli and authored more than 29 million tweets between January 1 and August 17, 2021. We identified the stance of users using unsupervised stance detection. We observe that users may consume more topically-related content from foreign and less popular sources, because, unlike popular sources, they may reaffirm their views, offer more extreme, hyper-partisan, or sensational content, or provide more in depth coverage of the event. The sudden popularity of such sources may not translate to longer-term or general popularity on other topics. △ Less

Submitted 27 September, 2021; originally announced September 2021.

arXiv:2106.06017 [pdf, other]

Cross-lingual Emotion Detection

Authors: Sabit Hassan, Shaden Shaar, Kareem Darwish

Abstract: Emotion detection can provide us with a window into understanding human behavior. Due to the complex dynamics of human emotions, however, constructing annotated datasets to train automated models can be expensive. Thus, we explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approach… ▽ More Emotion detection can provide us with a window into understanding human behavior. Due to the complex dynamics of human emotions, however, constructing annotated datasets to train automated models can be expensive. Thus, we explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approaches, namely: i) using inherently multilingual models; ii) translating training data into the target language; and iii) using an automatically tagged parallel corpus. In our study, we consider English as the source language with Arabic and Spanish as target languages. We study the effectiveness of different classification models such as BERT and SVMs trained with different features. Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively. Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively. Lastly, we use LIME to analyze the challenges of training cross-lingual models for different language pairs △ Less

Submitted 4 May, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:2102.10684 [pdf, other]

Pre-Training BERT on Arabic Tweets: Practical Considerations

Authors: Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish, Younes Samih

Abstract: Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the effi… ▽ More Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB. △ Less

Submitted 21 February, 2021; originally announced February 2021.

Comments: 6 pages, 5 figures

arXiv:2101.09345 [pdf]

BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets

Authors: Fouzi Harrag, Maria Debbah, Kareem Darwish, Ahmed Abdelali

Abstract: During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing prompts that look like they were written by humans… ▽ More During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing prompts that look like they were written by humans, facilitating the spread false or auto-generated text. In line with this progress, and in order to counteract potential dangers, several methods have been pro-posed for detecting text written by these language models. In this paper, we propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots. Our dataset is based on tweets from a previous work, which we have crawled and extended using the Twitter API. We used GPT2-Small-Arabic to generate fake Arabic Sentences. For evaluation, we compared different recurrent neural network (RNN) word embeddings based baseline models, namely: LSTM, BI-LSTM, GRU and BI-GRU, with a transformer-based model. Our new transfer-learning model has obtained an accuracy up to 98%. To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts. △ Less

Submitted 22 January, 2021; originally announced January 2021.

Journal ref: Proceedings of the Fifth Arabic Natural Language Processing Workshop (WANLP @ COLING 2020)

arXiv:2011.12631 [pdf, ps, other]

A Panoramic Survey of Natural Language Processing in the Arab World

Authors: Kareem Darwish, Nizar Habash, Mourad Abbas, Hend Al-Khalifa, Huseein T. Al-Natsheh, Samhaa R. El-Beltagy, Houda Bouamor, Karim Bouzoubaa, Violetta Cavalli-Sforza, Wassim El-Hajj, Mustafa Jarrar, Hamdy Mubarak

Abstract: The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languag… ▽ More The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languages to build applications such as speech recognition and synthesis, machine translation, optical character recognition (OCR), sentiment analysis (SA), question answering, dialogue systems, etc. NLP is a highly interdisciplinary field with connections to computer science, linguistics, cognitive science, psychology, mathematics and others. Some of the earliest AI applications were in NLP (e.g., machine translation); and the last decade (2010-2020) in particular has witnessed an incredible increase in quality, matched with a rise in public awareness, use, and expectations of what may have seemed like science fiction in the past. NLP researchers pride themselves on develo** language independent models and tools that can be applied to all human languages, e.g. machine translation systems can be built for a variety of languages using the same basic mechanisms and models. However, the reality is that some languages do get more attention (e.g., English and Chinese) than others (e.g., Hindi and Swahili). Arabic, the primary language of the Arab world and the religious language of millions of non-Arab Muslims is somewhere in the middle of this continuum. Though Arabic NLP has many challenges, it has seen many successes and developments. Next we discuss Arabic's main challenges as a necessary background, and we present a brief history of Arabic NLP. We then survey a number of its research areas, and close with a critical discussion of the future of Arabic NLP. △ Less

Submitted 27 September, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

arXiv:2007.09655 [pdf, other]

Political Framing: US COVID19 Blame Game

Authors: Chereen Shurafa, Kareem Darwish, Wajdi Zaghouani

Abstract: Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly shaped through a blame frame (blame Trump, China, or… ▽ More Through the use of Twitter, framing has become a prominent presidential campaign tool for politically active users. Framing is used to influence thoughts by evoking a particular perspective on an event. In this paper, we show that the COVID19 pandemic rather than being viewed as a public health issue, political rhetoric surrounding it is mostly shaped through a blame frame (blame Trump, China, or conspiracies) and a support frame (support candidates) backing the agenda of Republican and Democratic users in the lead up to the 2020 presidential campaign. We elucidate the divergences between supporters of both parties on Twitter via the use of frames. Additionally, we show how framing is used to positively or negatively reinforce users' thoughts. We look at how Twitter can efficiently be used to identify frames for topics through a reproducible pipeline. △ Less

Submitted 19 July, 2020; originally announced July 2020.

Comments: Social Informatics 2020 (SocInfo2020)

arXiv:2007.07996 [pdf, other]

Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms

Authors: Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani, Hamdy Mubarak, Alex Nikolov, Giovanni Da San Martino, Ahmed Abdelali, Hassan Sajjad, Kareem Darwish, Preslav Nakov

Abstract: With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the first global infodemic. While fighting this infodemi… ▽ More With the outbreak of the COVID-19 pandemic, people turned to social media to read and to share timely information including statistics, warnings, advice, and inspirational stories. Unfortunately, alongside all this useful information, there was also a new blending of medical and political misinformation and disinformation, which gave rise to the first global infodemic. While fighting this infodemic is typically thought of in terms of factuality, the problem is much broader as malicious content includes not only fake news, rumors, and conspiracy theories, but also promotion of fake cures, panic, racism, xenophobia, and mistrust in the authorities, among others. This is a complex problem that needs a holistic approach combining the perspectives of journalists, fact-checkers, policymakers, government entities, social media platforms, and society as a whole. Taking them into account we define an annotation schema and detailed annotation instructions, which reflect these perspectives. We performed initial annotations using this schema, and our initial experiments demonstrated sizable improvements over the baselines. Now, we issue a call to arms to the research community and beyond to join the fight by supporting our crowdsourcing annotation efforts. △ Less

Submitted 9 April, 2021; v1 submitted 15 July, 2020; originally announced July 2020.

Comments: COVID-19, Infodemic, Disinformation, Misinformation, Fake News, Call to Arms, Crowdsourcing Annotations

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2005.09649 [pdf, other]

Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey

Authors: Ammar Rashed, Mucahid Kutlu, Kareem Darwish, Tamer Elsayed, Cansın Bayrak

Abstract: On June 24, 2018, Turkey conducted a highly consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One aspect of polarization among the electorate was support for or opposition to the reelection of Recep Tay… ▽ More On June 24, 2018, Turkey conducted a highly consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One aspect of polarization among the electorate was support for or opposition to the reelection of Recep Tayyip Erdoğan. In this paper, we present an unsupervised method for target-specific stance detection in a polarized setting, specifically Turkish politics, achieving 90% precision in identifying user stances, while maintaining more than 80% recall. The method involves representing users in an embedding space using Google's Convolutional Neural Network (CNN) based multilingual universal sentence encoder. The representations are then projected onto a lower dimensional space in a manner that reflects similarities and are consequently clustered. We show the effectiveness of our method in properly clustering users of divergent groups across multiple targets that include political figures, different groups, and parties. We perform our analysis on a large dataset of 108M Turkish election-related tweets along with the timeline tweets of 168k Turkish users, who authored 213M tweets. Given the resultant user stances, we are able to observe correlations between topics and compute topic polarization. △ Less

Submitted 24 February, 2022; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: arXiv admin note: text overlap with arXiv:1909.10213

Journal ref: ICWSM, vol. 15, no. 1, pp. 537-548, May 2021

arXiv:2005.06557 [pdf, other]

Arabic Dialect Identification in the Wild

Authors: Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Sabit Hassan, Kareem Darwish

Abstract: We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the Middle East and North Africa region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either w… ▽ More We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the Middle East and North Africa region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either written in Modern Standard Arabic or contain inappropriate language. The resultant dataset contains 540k tweets from 2,525 users who are evenly distributed across 18 Arab countries. Using intrinsic evaluation, we show that the labels of a set of randomly selected tweets are 91.5% accurate. For extrinsic evaluation, we are able to build effective country-level dialect identification on tweets with a macro-averaged F1-score of 60.6% across 18 classes. △ Less

Submitted 15 May, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

Comments: 13 pages, 7 figures, 4 tables

arXiv:2005.00033 [pdf, other]

Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society

Authors: Firoj Alam, Shaden Shaar, Fahim Dalvi, Hassan Sajjad, Alex Nikolov, Hamdy Mubarak, Giovanni Da San Martino, Ahmed Abdelali, Nadir Durrani, Kareem Darwish, Abdulaziz Al-Homaid, Wajdi Zaghouani, Tommaso Caselli, Gijs Danoe, Friso Stolk, Britt Bruntink, Preslav Nakov

Abstract: With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreadin… ▽ More With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that (i) focuses on COVID-19, (ii) combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and (iii) covers Arabic, Bulgarian, Dutch, and English. Finally, we show strong evaluation results using pretrained Transformers, thus confirming the practical utility of the dataset in monolingual vs. multilingual, and single task vs. multitask settings. △ Less

Submitted 22 September, 2021; v1 submitted 30 April, 2020; originally announced May 2020.

Comments: disinformation, misinformation, factuality, fact-checking, fact-checkers, check-worthiness, Social Media Platforms, COVID-19, social media

MSC Class: 68T50 ACM Class: I.2; I.2.7

Journal ref: EMNLP-2021 (Findings)

arXiv:2004.03485 [pdf, other]

A Few Topical Tweets are Enough for Effective User-Level Stance Detection

Authors: Younes Samih, Kareem Darwish

Abstract: Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail completely for less vocal users, who may have aut… ▽ More Stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can yield very high accuracy (+98%). However, such methods perform poorly or fail completely for less vocal users, who may have authored only a few tweets about a target. In this paper, we tackle stance detection for such users using two approaches. In the first approach, we improve user-level stance detection by representing tweets using contextualized embeddings, which capture latent meanings of words in context. We show that this approach outperforms two strong baselines and achieves 89.6% accuracy and 91.3% macro F-measure on eight controversial topics. In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, and then we perform unsupervised classification of the user, which entails clustering a user with other users in the training set. This approach achieves 95.6% accuracy and 93.1% macro F-measure. △ Less

Submitted 7 April, 2020; originally announced April 2020.

arXiv:2004.02192 [pdf, other]

Arabic Offensive Language on Twitter: Analysis and Experiments

Authors: Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, Ahmed Abdelali

Abstract: Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building a large Arabic offensive tweet dataset. We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. We… ▽ More Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building a large Arabic offensive tweet dataset. We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. We thoroughly analyze the dataset to determine which topics, dialects, and gender are most associated with offensive tweets and how Arabic speakers use offensive language. Lastly, we conduct many experiments to produce strong results (F1 = 83.2) on the dataset using SOTA techniques. △ Less

Submitted 9 March, 2021; v1 submitted 5 April, 2020; originally announced April 2020.

Comments: 10 pages, 6 figures, 3 tables

arXiv:2002.01207 [pdf, other]

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model

Authors: Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Mohamed Eldesouki

Abstract: Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recov… ▽ More Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86\% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models. △ Less

Submitted 4 February, 2020; originally announced February 2020.

arXiv:2001.02125 [pdf, other]

Quantifying Polarization on Twitter: the Kavanaugh Nomination

Authors: Kareem Darwish

Abstract: This paper addresses polarization quantification, particularly as it pertains to the nomination of Brett Kavanaugh to the US Supreme Court and his subsequent confirmation with the narrowest margin since 1881. Republican (GOP) and Democratic (DNC) senators voted overwhelmingly along party lines. In this paper, we examine political polarization concerning the nomination among Twitter users. To do so… ▽ More This paper addresses polarization quantification, particularly as it pertains to the nomination of Brett Kavanaugh to the US Supreme Court and his subsequent confirmation with the narrowest margin since 1881. Republican (GOP) and Democratic (DNC) senators voted overwhelmingly along party lines. In this paper, we examine political polarization concerning the nomination among Twitter users. To do so, we accurately identify the stance of more than 128 thousand Twitter users towards Kavanaugh's nomination using both semi-supervised and supervised classification. Next, we quantify the polarization between the different groups in terms of who they retweet and which hashtags they use. We modify existing polarization quantification measures to make them more efficient and more effective. We also characterize the polarization between users who supported and opposed the nomination. △ Less

Submitted 5 January, 2020; originally announced January 2020.

Comments: 13 pages, 4 figures, 5 tables. International Conference on Social Informatics. Springer, Cham, 2019. arXiv admin note: substantial text overlap with arXiv:1810.06687

ACM Class: J.4

arXiv:1910.02028 [pdf, other]

Tanbih: Get To Know What You Are Reading

Authors: Yifan Zhang, Giovanni Da San Martino, Alberto Barrón-Cedeño, Salvatore Romeo, Jisun An, Haewoon Kwak, Todor Staykovski, Israa Jaradat, Georgi Karadzhov, Ramy Baly, Kareem Darwish, James Glass, Preslav Nakov

Abstract: We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what's behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various c… ▽ More We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what's behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics of a news outlet. In addition, we automatically analyse each article to detect whether it is propagandistic and to determine its stance with respect to a number of controversial topics. △ Less

Submitted 4 October, 2019; originally announced October 2019.

MSC Class: 68T50 ACM Class: I.2.7

Journal ref: EMNLP-2019

arXiv:1909.10213 [pdf, ps, other]

Embedding-based Qualitative Analysis of Polarization in Turkey

Authors: Mucahid Kutlu, Kareem Darwish, Cansin Bayrak, Ammar Rashed, Tamer Elsayed

Abstract: On June 24, 2018, Turkey conducted a highly-consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One access of polarization among the electorate was support for or opposition to the reelection of Recep Tay… ▽ More On June 24, 2018, Turkey conducted a highly-consequential election in which the Turkish people elected their president and parliament in the first election under a new presidential system. During the election period, the Turkish people extensively shared their political opinions on Twitter. One access of polarization among the electorate was support for or opposition to the reelection of Recep Tayyip Erdogan. In this paper, we explore the polarization between the two groups on their political opinions and lifestyle, and examine whether polarization had increased in the lead up to the election. We conduct our analysis on two collected datasets covering the time periods before and during the election period that we split into pro- and anti-Erdogan groups. For the pro and anti splits of both datasets, we generate separate word embedding models, and then use the four generated models to contrast the neighborhood (in the embedding space) of the political leaders, political issues, and lifestyle choices (e.g., beverages, food, and vacation). Our analysis shows that the two groups agree on some topics, such as terrorism and organizations threatening the country, but disagree on others, such as refugees and lifestyle choices. Polarization towards party leaders is more pronounced, and polarization further increased during the election time. △ Less

Submitted 23 September, 2019; originally announced September 2019.

arXiv:1907.01260 [pdf, other]

Predicting the Topical Stance of Media and Popular Twitter Users

Authors: Peter Stefanov, Kareem Darwish, Atanas Atanasov, Preslav Nakov

Abstract: Discovering the stances of media outlets and influential people on current, debatable topics is important for social statisticians and policy makers. Many supervised solutions exist for determining viewpoints, but manually annotating training data is costly. In this paper, we propose a cascaded method that uses unsupervised learning to ascertain the stance of Twitter users with respect to a polari… ▽ More Discovering the stances of media outlets and influential people on current, debatable topics is important for social statisticians and policy makers. Many supervised solutions exist for determining viewpoints, but manually annotating training data is costly. In this paper, we propose a cascaded method that uses unsupervised learning to ascertain the stance of Twitter users with respect to a polarizing topic by leveraging their retweet behavior; then, it uses supervised learning based on user labels to characterize both the general political leaning of online media and of popular Twitter users, as well as their stance with respect to the target polarizing topic. We evaluate the model by comparing its predictions to gold labels from the Media Bias/Fact Check website, achieving 82.6% accuracy. △ Less

Submitted 21 May, 2020; v1 submitted 2 July, 2019; originally announced July 2019.

MSC Class: 91D30

arXiv:1904.02000 [pdf, other]

Unsupervised User Stance Detection on Twitter

Authors: Kareem Darwish, Peter Stefanov, Michaël Aupetit, Preslav Nakov

Abstract: We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-exist… ▽ More We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-existing methods, which are based on supervised or semi-supervised classification. First, we do not require any prior labeling of users: instead, we create clusters, which are much easier to label manually afterwards, e.g., in a matter of seconds or minutes instead of hours. Second, there is no need for domain- or topic-level knowledge either to specify the relevant stances (labels) or to conduct the actual labeling. Third, our framework is robust in the face of data skewness, e.g., when some users or some stances have greater representation in the data. We experiment with different combinations of user similarity features, dataset sizes, dimensionality reduction methods, and clustering algorithms to ascertain the most effective and most computationally efficient combinations across three different datasets (in English and Turkish). We further verified our results on additional tweet sets covering six different controversial topics. Our best combination in terms of effectiveness and efficiency uses retweeted accounts as features, UMAP for dimensionality reduction, and Mean Shift for clustering, and yields a small number of high-quality user clusters, typically just 2--3, with more than 98\% purity. The resulting user clusters can be used to train downstream classifiers. Moreover, our framework is robust to variations in the hyper-parameter values and also with respect to random initialization. △ Less

Submitted 21 May, 2020; v1 submitted 3 April, 2019; originally announced April 2019.

MSC Class: 62P25; 91D30

arXiv:1810.06687 [pdf, other]

To Kavanaugh or Not to Kavanaugh: That is the Polarizing Question

Authors: Kareem Darwish

Abstract: On October 6, 2018, the US Senate confirmed Brett Kavanaugh with the narrowest margin for a successful confirmation since 1881 and where the senators voted overwhelmingly along party lines. In this paper, we examine whether the political polarization in the Senate is reflected among the general public. To do so, we analyze the views of more than 128 thousand Twitter users. We show that users suppo… ▽ More On October 6, 2018, the US Senate confirmed Brett Kavanaugh with the narrowest margin for a successful confirmation since 1881 and where the senators voted overwhelmingly along party lines. In this paper, we examine whether the political polarization in the Senate is reflected among the general public. To do so, we analyze the views of more than 128 thousand Twitter users. We show that users supporting or opposing Kavanaugh's nomination were generally using divergent hashtags, retweeting different Twitter accounts, and sharing links from different websites. We also examine characterestics of both groups. △ Less

Submitted 15 October, 2018; originally announced October 2018.

arXiv:1810.06619 [pdf, other]

Diacritization of Maghrebi Arabic Sub-Dialects

Authors: Ahmed Abdelali, Mohammed Attia, Younes Samih, Kareem Darwish, Hamdy Mubarak

Abstract: Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automa… ▽ More Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automatic diacritization of two sub-dialects of Maghrebi Arabic, namely Tunisian and Moroccan, using a character-level deep neural network architecture that stacks two bi-LSTM layers over a CRF output layer. The model achieves word error rate of 2.7% and 3.6% for Moroccan and Tunisian respectively and is capable of implicitly identifying the sub-dialect of the input. △ Less

Submitted 30 May, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: 6 pages, 3 figures

arXiv:1807.06655 [pdf, other]

Devam vs. Tamam: 2018 Turkish Elections

Authors: Mucahid Kutlu, Kareem Darwish, Tamer Elsayed

Abstract: On June 24, 2018, Turkey held a historical election, transforming its parliamentary system to a presidential one. One of the main questions for Turkish voters was whether to start this new political era with reelecting its long-time political leader Recep Tayyip Erdogan or not. In this paper, we analyzed 108M tweets posted in the two months leading to the election to understand the groups that sup… ▽ More On June 24, 2018, Turkey held a historical election, transforming its parliamentary system to a presidential one. One of the main questions for Turkish voters was whether to start this new political era with reelecting its long-time political leader Recep Tayyip Erdogan or not. In this paper, we analyzed 108M tweets posted in the two months leading to the election to understand the groups that supported or opposed Erdogan's reelection. We examined the most distinguishing hashtags and retweeted accounts for both groups. Our findings indicate strong polarization between both groups as they differ in terms of ideology, news sources they follow, and preferred TV entertainment. △ Less

Submitted 17 July, 2018; originally announced July 2018.

arXiv:1708.05891 [pdf, other]

Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM

Authors: Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, Kallmeyer Laura

Abstract: Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the p… ▽ More Arabic word segmentation is essential for a variety of NLP applications such as machine translation and information retrieval. Segmentation entails breaking words into their constituent stems, affixes and clitics. In this paper, we compare two approaches for segmenting four major Arabic dialects using only several thousand training examples for each dialect. The two approaches involve posing the problem as a ranking problem, where an SVM ranker picks the best segmentation, and as a sequence labeling problem, where a bi-LSTM RNN coupled with CRF determines where best to segment words. We are able to achieve solid segmentation results for all dialects using rather limited training data. We also show that employing Modern Standard Arabic data for domain adaptation and assuming context independence improve overall results. △ Less

Submitted 19 August, 2017; originally announced August 2017.

arXiv:1707.07276 [pdf, other]

Seminar Users in the Arabic Twitter Sphere

Authors: Kareem Darwish, Dimitar Alexandrov, Preslav Nakov, Yelena Mejova

Abstract: We introduce the notion of "seminar users", who are social media users engaged in propaganda in support of a political entity. We develop a framework that can identify such users with 84.4% precision and 76.1% recall. While our dataset is from the Arab region, omitting language-specific features has only a minor impact on classification performance, and thus, our approach could work for detecting… ▽ More We introduce the notion of "seminar users", who are social media users engaged in propaganda in support of a political entity. We develop a framework that can identify such users with 84.4% precision and 76.1% recall. While our dataset is from the Arab region, omitting language-specific features has only a minor impact on classification performance, and thus, our approach could work for detecting seminar users in other parts of the world and in other languages. We further explored a controversial political topic to observe the prevalence and potential potency of such users. In our case study, we found that 25% of the users engaged in the topic are in fact seminar users and their tweets make nearly a third of the on-topic tweets. Moreover, they are often successful in affecting mainstream discourse with coordinated hashtag campaigns. △ Less

Submitted 23 July, 2017; originally announced July 2017.

Comments: to appear in SocInfo 2017

arXiv:1707.03375 [pdf, other]

Trump vs. Hillary: What went Viral during the 2016 US Presidential Election

Authors: Kareem Darwish, Walid Magdy, Tahar Zanouda

Abstract: In this paper, we present quantitative and qualitative analysis of the top retweeted tweets (viral tweets) pertaining to the US presidential elections from September 1, 2016 to Election Day on November 8, 2016. For everyday, we tagged the top 50 most retweeted tweets as supporting or attacking either candidate or as neutral/irrelevant. Then we analyzed the tweets in each class for: general trends… ▽ More In this paper, we present quantitative and qualitative analysis of the top retweeted tweets (viral tweets) pertaining to the US presidential elections from September 1, 2016 to Election Day on November 8, 2016. For everyday, we tagged the top 50 most retweeted tweets as supporting or attacking either candidate or as neutral/irrelevant. Then we analyzed the tweets in each class for: general trends and statistics; the most frequently used hashtags, terms, and locations; the most retweeted accounts and tweets; and the most shared news and links. In all we analyzed the 3,450 most viral tweets that grabbed the most attention during the US election and were retweeted in total 26.3 million times accounting over 40% of the total tweet volume pertaining to the US election in the aforementioned period. Our analysis of the tweets highlights some of the differences between the social media strategies of both candidates, the penetration of their messages, and the potential effect of attacks on both △ Less

Submitted 11 July, 2017; originally announced July 2017.

Comments: Paper to appear in Springer SocInfo 2017

arXiv:1707.02591 [pdf, other]

Flexible human-robot cooperation models for assisted shop-floor tasks

Authors: Kourosh Darwish, Francesco Wanderlingh, Barbara Bruno, Enrico Simetti, Fulvio Mastrogiovanni, Giuseppe Casalino

Abstract: The Industry 4.0 paradigm emphasizes the crucial benefits that collaborative robots, i.e., robots able to work alongside and together with humans, could bring to the whole production process. In this context, an enabling technology yet unreached is the design of flexible robots able to deal at all levels with humans' intrinsic variability, which is not only a necessary element for a comfortable wo… ▽ More The Industry 4.0 paradigm emphasizes the crucial benefits that collaborative robots, i.e., robots able to work alongside and together with humans, could bring to the whole production process. In this context, an enabling technology yet unreached is the design of flexible robots able to deal at all levels with humans' intrinsic variability, which is not only a necessary element for a comfortable working experience for the person but also a precious capability for efficiently dealing with unexpected events. In this paper, a sensing, representation, planning and control architecture for flexible human-robot cooperation, referred to as FlexHRC, is proposed. FlexHRC relies on wearable sensors for human action recognition, AND/OR graphs for the representation of and reasoning upon cooperation models, and a Task Priority framework to decouple action planning from robot motion planning and control. △ Less

Submitted 9 July, 2017; originally announced July 2017.

Comments: Submitted to Mechatronics (Elsevier)

MSC Class: 68T40

arXiv:1610.01655 [pdf, other]

Trump vs. Hillary Analyzing Viral Tweets during US Presidential Elections 2016

Authors: Walid Magdy, Kareem Darwish

Abstract: In this paper, we provide a quantitative and qualitative analyses of the viral tweets related to the US presidential election. In our study, we focus on analyzing the most retweeted 50 tweets for everyday during September and October 2016. The resulting set is composed 3,050 viral tweets, and they were retweeted over 20.5 million times. We manually annotated the tweets as favorable of Trump, Clint… ▽ More In this paper, we provide a quantitative and qualitative analyses of the viral tweets related to the US presidential election. In our study, we focus on analyzing the most retweeted 50 tweets for everyday during September and October 2016. The resulting set is composed 3,050 viral tweets, and they were retweeted over 20.5 million times. We manually annotated the tweets as favorable of Trump, Clinton, or neither. Our quantitative study shows that tweets favoring Trump were usually retweeted more than pro-Clinton tweets, with the exception of a few days in September and two days in October, especially the day following the first presidential debate and following the release of the Access Hollywood tape. On two days in October 2016, pro-Trump tweet volume accounted for than 90\% of the total tweet volume. △ Less

Submitted 3 November, 2016; v1 submitted 5 October, 2016; originally announced October 2016.

Comments: In version 2, analysis of viral tweets of October 2016 is added to the paper

arXiv:1512.04570 [pdf, other]

Quantifying Public Response towards Islam on Twitter after Paris Attacks

Authors: Walid Magdy, Kareem Darwish, Norah Abokhodair

Abstract: The Paris terrorist attacks occurred on November 13, 2015 prompted a massive response on social media including Twitter, with millions of posted tweets in the first few hours after the attacks. Most of the tweets were condemning the attacks and showing support to Parisians. One of the trending debates related to the attacks concerned possible association between terrorism and Islam and Muslims in… ▽ More The Paris terrorist attacks occurred on November 13, 2015 prompted a massive response on social media including Twitter, with millions of posted tweets in the first few hours after the attacks. Most of the tweets were condemning the attacks and showing support to Parisians. One of the trending debates related to the attacks concerned possible association between terrorism and Islam and Muslims in general. This created a global discussion between those attacking and those defending Islam and Muslims. In this paper, we provide quantitative and qualitative analysis of data collection we streamed from Twitter starting 7 hours after the Paris attacks and for 50 subsequent hours that are related to blaming Islam and Muslims and to defending them. We collected a set of 8.36 million tweets in this epoch consisting of tweets in many different of languages. We could identify a subset consisting of 900K tweets relating to Islam and Muslims. Using sampling methods and crowd-sourcing annotation, we managed to estimate the public response of these tweets. Our findings show that the majority of the tweets were in fact defending Muslims and absolving them from responsibility for the attacks. However, a considerable number of tweets were blaming Muslims, with most of these tweets coming from western countries such as the Netherlands, France, and the US. △ Less

Submitted 14 December, 2015; originally announced December 2015.

Comments: 9 pages, 5 figures

ACM Class: J.4; K.4.2

arXiv:1512.04310 [pdf, other]

Attitudes towards Refugees in Light of the Paris Attacks

Authors: Kareem Darwish, Walid Magdy

Abstract: The Paris attacks prompted a massive response on social media including Twitter. This paper explores the immediate response of English speakers on Twitter towards Middle Eastern refugees in Europe. We show that antagonism towards refugees is mostly coming from the United States and is mostly partisan. The Paris attacks prompted a massive response on social media including Twitter. This paper explores the immediate response of English speakers on Twitter towards Middle Eastern refugees in Europe. We show that antagonism towards refugees is mostly coming from the United States and is mostly partisan. △ Less

Submitted 15 December, 2015; v1 submitted 14 December, 2015; originally announced December 2015.

Comments: 3 pages, 1 table, and 2 figures

ACM Class: J.4; K.4.2

arXiv:1503.02401 [pdf, other]

#FailedRevolutions: Using Twitter to Study the Antecedents of ISIS Support

Authors: Walid Magdy, Kareem Darwish, Ingmar Weber

Abstract: Within a fairly short amount of time, the Islamic State of Iraq and Syria (ISIS) has managed to put large swaths of land in Syria and Iraq under their control. To many observers, the sheer speed at which this "state" was established was dumbfounding. To better understand the roots of this organization and its supporters we present a study using data from Twitter. We start by collecting large amoun… ▽ More Within a fairly short amount of time, the Islamic State of Iraq and Syria (ISIS) has managed to put large swaths of land in Syria and Iraq under their control. To many observers, the sheer speed at which this "state" was established was dumbfounding. To better understand the roots of this organization and its supporters we present a study using data from Twitter. We start by collecting large amounts of Arabic tweets referring to ISIS and classify them into pro-ISIS and anti-ISIS. This classification turns out to be easily done simply using the name variants used to refer to the organization: the full name and the description as "state" is associated with support, whereas abbreviations usually indicate opposition. We then "go back in time" by analyzing the historic timelines of both users supporting and opposing and look at their pre-ISIS period to gain insights into the antecedents of support. To achieve this, we build a classifier using pre-ISIS data to "predict", in retrospect, who will support or oppose the group. The key story that emerges is one of frustration with failed Arab Spring revolutions. ISIS supporters largely differ from ISIS opposition in that they refer a lot more to Arab Spring uprisings that failed. We also find temporal patterns in the support and opposition which seems to be linked to major news, such as reported territorial gains, reports on gruesome acts of violence, and reports on airstrikes and foreign intervention. △ Less

Submitted 9 March, 2015; originally announced March 2015.

Comments: Submitted to ICWSM 2015

arXiv:1410.3097 [pdf, other]

Content and Network Dynamics Behind Egyptian Political Polarization on Twitter

Authors: Javier Borge-Holthoefer, Walid Magdy, Kareem Darwish, Ingmar Weber

Abstract: There is little doubt about whether social networks play a role in modern protests. This agreement has triggered an entire research avenue, in which social structure and content analysis have been central --but are typically exploited separately. Here, we combine these two approaches to shed light on the opinion evolution dynamics in Egypt during the summer of 2013 along two axes (Islamist/Secul… ▽ More There is little doubt about whether social networks play a role in modern protests. This agreement has triggered an entire research avenue, in which social structure and content analysis have been central --but are typically exploited separately. Here, we combine these two approaches to shed light on the opinion evolution dynamics in Egypt during the summer of 2013 along two axes (Islamist/Secularist, pro/anti-military intervention). We intend to find traces of opinion changes in Egypt's population, paralleling those in the international community --which oscillated from sympathetic to condemnatory as civil clashes grew. We find little evidence of people "switching" sides, along with clear changes in volume in both pro- and anti-military camps. Our work contributes new insights into the dynamics of large protest movements, specially in the aftermath of the main events --rather unattended previously. It questions the standard narrative concerning a simplistic map** between Secularist/pro-military and Islamist/anti-military. Finally, our conclusions provide empirical validation to sociological models regarding the behavior of individuals in conflictive contexts. △ Less

Submitted 12 October, 2014; originally announced October 2014.

Comments: To appear in the Proceedings of the 18th Conference on Computer-Supported Cooperative Work and Social Computing CSCW (2015)

arXiv:1306.6755 [pdf, ps, other]

Arabizi Detection and Conversion to Arabic

Authors: Kareem Darwish

Abstract: Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arabic characters. We used word and sequence-level fea… ▽ More Arabizi is Arabic text that is written using Latin characters. Arabizi is used to present both Modern Standard Arabic (MSA) or Arabic dialects. It is commonly used in informal settings such as social networking sites and is often with mixed with English. In this paper we address the problems of: identifying Arabizi in text and converting it to Arabic characters. We used word and sequence-level features to identify Arabizi that is mixed with English. We achieved an identification accuracy of 98.5%. As for conversion, we used transliteration mining with language modeling to generate equivalent Arabic text. We achieved 88.7% conversion accuracy, with roughly a third of errors being spelling and morphological variants of the forms in ground truth. △ Less

Submitted 28 June, 2013; originally announced June 2013.

ACM Class: I.2.7

Showing 1–34 of 34 results for author: Darwish, K