Search | arXiv e-print repository

doi 10.1609/icwsm.v18i1.31366

Topic Shifts as a Proxy for Assessing Politicization in Social Media

Authors: Marcelo Sartori Locatelli, Pedro Calais, Matheus Prado Miranda, João Pedro Junho, Tomas Lacerda Muniz, Wagner Meira Jr., Virgilio Almeida

Abstract: Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations… ▽ More Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations based on topic shifts, i.e., the degree to which people switch topics in online conversations. The intuition is that topic shifts from a non-political topic to politics are a direct measure of politicization -- making something political, and that the more people switch conversations to politics, the more they perceive politics as playing a vital role in their daily lives. A fundamental challenge that must be addressed when one studies politicization in social media is that, a priori, any topic may be politicized. Hence, any keyword-based method or even machine learning approaches that rely on topic labels to classify topics are expensive to run and potentially ineffective. Instead, we learn from a seed of political keywords and use Positive-Unlabeled (PU) Learning to detect political comments in reaction to non-political news articles posted on Twitter, YouTube, and TikTok during the 2022 Brazilian presidential elections. Our findings indicate that all platforms show evidence of politicization as discussion around topics adjacent to politics such as economy, crime and drugs tend to shift to politics. Even the least politicized topics had the rate in which their topics shift to politics increased in the lead up to the elections and after other political events in Brazil -- an evidence of politicization. △ Less

Submitted 13 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: 12 pages, 6 figures, accepted for the 18th AAAI International Conference on Web and Social Media (ICWSM-2024)

Journal ref: Topic Shifts as a Proxy for Assessing Politicization in Social Media. In: Proceedings of the International AAAI Conference on Web and Social Media. 2024. p. 972-984

arXiv:2212.04551 [pdf, other]

doi 10.1109/SBAC-PAD55451.2022.00022

Efficient Strategies for Graph Pattern Mining Algorithms on GPUs

Authors: Samuel Ferraz, Vinicius Dias, Carlos H. C. Teixeira, George Teodoro, Wagner Meira Jr

Abstract: Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it chal… ▽ More Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it challenging for efficient execution on GPU due to typical uncoalesced memory access, divergence, and load imbalance. Unfortunately, these aspects have not been fully addressed in previous work. Thus, this work proposes novel strategies to design and implement subgraph enumeration efficiently on GPU. We support a depth-first search style search (DFS-wide) that maximizes memory performance while providing enough parallelism to be exploited by the GPU, along with a warp-centric design that minimizes execution divergence and improves utilization of the computing capabilities. We also propose a low-cost load balancing layer to avoid idleness and redistribute work among thread warps in a GPU. Our strategies have been deployed in a system named DuMato, which provides a simple programming interface to allow efficient implementation of GPM algorithms. Our evaluation has shown that DuMato is often an order of magnitude faster than state-of-the-art GPM systems and can mine larger subgraphs (up to 12 vertices). △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted for publication on IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'22)

arXiv:2208.01509 [pdf, other]

doi 10.1145/3511095.3531283

Characterizing Vaccination Movements on YouTube in the United States and Brazil

Authors: Marcelo Sartori Locatelli, Josemar Caetano, Wagner Meira Jr., Virgilio Almeida

Abstract: In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users in… ▽ More In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users interact with different types of content. Considering that United States (USA) and Brazil (BR) are two countries with the highest COVID-19 death tolls, we asked the following question: What are the nuances of vaccination campaigns in the two countries? With that in mind, we engage in a comparative analysis of pro and anti-vaccine movements on YouTube. We also investigate the role of YouTube in countering online vaccine misinformation in USA and BR. For this means, we monitored the removal of vaccine related content on the platform and also applied various techniques to analyze the differences in discourse and engagement in pro and anti-vaccine "comment sections". We found that American anti-vaccine content tend to lead to considerably more toxic and negative discussion than their pro-vaccine counterparts while also leading to 18% higher user-user engagement, while Brazilian anti-vaccine content was significantly less engaging. We also found that pro-vaccine and anti-vaccine discourses are considerably different as the former is associated with conspiracy theories (e.g. ccp), misinformation and alternative medicine (e.g. hydroxychloroquine), while the latter is associated with protective measures. Finally, it was observed that YouTube content removals are still insufficient, with only approximately 16% of the anti-vaccine content being removed by the end of the studied period, with the USA registering the highest percentage of removed anti-vaccine content(34%) and BR registering the lowest(9.8%). △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: Accepted at ACM HT 2022, 15 pages, 7 figures

Journal ref: Proceedings of the 33rd ACM Conference on Hypertext and Social Media. 2022. p. 80-90

arXiv:2105.07523 [pdf, other]

Analyzing the "Slee** Giants" Activism Model in Brazil

Authors: Bárbara Gomes Ribeiro, Manoel Horta Ribeiro, Virgílio Almeida, Wagner Meira Jr

Abstract: In 2020, amidst the COVID pandemic and a polarized political climate, the Slee** Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Slee** Giants Brasil (SGB) campaigned against media outlets using Twitter t… ▽ More In 2020, amidst the COVID pandemic and a polarized political climate, the Slee** Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Slee** Giants Brasil (SGB) campaigned against media outlets using Twitter to ask companies to remove ads from the targeted outlets. This work presents a thorough quantitative characterization of this activism model, analyzing the three campaigns carried out by SGB between May and September 2020. To do so, we use digital traces from both Twitter and Google Trends, toxicity and sentiment classifiers trained for the Portuguese language, and an annotated corpus of SGB's tweets. Our key findings were threefold. First, we found that SGB's requests to companies were largely successful (with 83.85\% of all 192 targeted companies responding positively) and that user pressure was correlated to the speed of companies' responses. Second, there were no significant changes in the online attention and the user engagement going towards the targeted media outlets in the six months that followed SGB's campaign (as measured by Google Trends and Twitter engagement). Third, we observed that user interactions with companies changed only transiently, even if the companies did not respond to SGB's request. Overall, our results paint a nuanced portrait of internet activism. On the one hand, they suggest that SGB was successful in getting companies to boycott specific media outlets, which may have harmed their advertisement revenue stream. On the other hand, they also suggest that the activist movement did not impact the online attention these media outlets received nor the online image of companies that did not respond positively to their requests. △ Less

Submitted 25 February, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

arXiv:2012.03879 [pdf, other]

Sequential Stratified Regeneration: MCMC for Large State Spaces with an Application to Subgraph Count Estimation

Authors: Carlos H. C. Teixeira, Mayank Kakodkar, Vinícius Dias, Wagner Meira Jr., Bruno Ribeiro

Abstract: This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distributio… ▽ More This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distribution over all edges, eliminating the need to iterate over all edges. Unfortunately, these existing estimators are not scalable to massive real-world graphs. In this paper, we introduce Ripple, an MCMC-based estimator that achieves unprecedented scalability by stratifying the Markov chain state space into ordered strata with a new technique that we denote {\em sequential stratified regenerations}. We show that the Ripple estimator is consistent, highly parallelizable, and scales well. We empirically evaluate our method by applying Ripple to the task of estimating connected, induced subgraph counts given some input graph. Therein, we demonstrate that Ripple is accurate and can estimate counts of up to $12$-node subgraphs, which is a task at a scale that has been considered unreachable, not only by prior MCMC-based methods but also by other sampling approaches. For instance, in this target application, we present results in which the Markov chain state space is as large as $10^{43}$, for which Ripple computes estimates in less than $4$ hours, on average. △ Less

Submitted 8 April, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: Markov Chain Monte Carlo, Random Walk, Regenerative Sampling, Motif Analysis, Subgraph Counting, Graph Mining

arXiv:1908.08313 [pdf, other]

Auditing Radicalization Pathways on YouTube

Authors: Manoel Horta Ribeiro, Raphael Ottoni, Robert West, Virgílio A. F. Almeida, Wagner Meira

Abstract: Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted… ▽ More Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted on 349 channels, which we broadly classified into four types: Media, the Alt-lite, the Intellectual Dark Web (I.D.W.), and the Alt-right. According to the aforementioned radicalization hypothesis, channels in the I.D.W. and the Alt-lite serve as gateways to fringe far-right ideology, here represented by Alt-right channels. Processing 72M+ comments, we show that the three channel types indeed increasingly share the same user base; that users consistently migrate from milder to more extreme content; and that a large percentage of users who consume Alt-right content now consumed Alt-lite and I.D.W. content in the past. We also probe YouTube's recommendation algorithm, looking at more than 2M video and channel recommendations between May/July 2019. We find that Alt-lite content is easily reachable from I.D.W. channels, while Alt-right videos are reachable only through channel recommendations. Overall, we paint a comprehensive picture of user radicalization on YouTube. △ Less

Submitted 21 October, 2021; v1 submitted 22 August, 2019; originally announced August 2019.

Comments: 10 pages plus appendices

arXiv:1904.01949 [pdf, other]

doi 10.1038/s41467-020-15432-4

Automatic diagnosis of the 12-lead ECG using a deep neural network

Authors: Antônio H. Ribeiro, Manoel Horta Ribeiro, Gabriela M. M. Paixão, Derick M. Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton P. S. Ferreira, Carl R. Andersson, Peter W. Macfarlane, Wagner Meira Jr., Thomas B. Schön, Antonio Luiz P. Ribeiro

Abstract: The role of automatic electrocardiogram (ECG) analysis in clinical practice is limited by the accuracy of existing models. Deep Neural Networks (DNNs) are models composed of stacked transformations that learn tasks by examples. This technology has recently achieved striking success in a variety of task and there are great expectations on how it might improve clinical practice. Here we present a DN… ▽ More The role of automatic electrocardiogram (ECG) analysis in clinical practice is limited by the accuracy of existing models. Deep Neural Networks (DNNs) are models composed of stacked transformations that learn tasks by examples. This technology has recently achieved striking success in a variety of task and there are great expectations on how it might improve clinical practice. Here we present a DNN model trained in a dataset with more than 2 million labeled exams analyzed by the Telehealth Network of Minas Gerais and collected under the scope of the CODE (Clinical Outcomes in Digital Electrocardiology) study. The DNN outperform cardiology resident medical doctors in recognizing 6 types of abnormalities in 12-lead ECG recordings, with F1 scores above 80% and specificity over 99%. These results indicate ECG analysis based on DNNs, previously studied in a single-lead setup, generalizes well to 12-lead exams, taking the technology closer to the standard clinical practice. △ Less

Submitted 14 April, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: A preliminary version of this work titled: "Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network " was presented in the Machine Learning for Health Workshop at NeurIPS 2018 and was made available under a different identifier: arXiv:1811.12194. The current version subsumes all previous versions

Journal ref: Nature Communications 11, article number: 1760 (2020)

arXiv:1811.12194 [pdf, other]

Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network

Authors: Antônio H. Ribeiro, Manoel Horta Ribeiro, Gabriela Paixão, Derick Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton Pifano, Wagner Meira Jr., Thomas B. Schön, Antonio Luiz Ribeiro

Abstract: We present a model for predicting electrocardiogram (ECG) abnormalities in short-duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency. Such exams can provide a full evaluation of heart activity and have not been studied in previous end-to-end machine learning papers. Using the database of a large telehealth network, we built a novel dataset… ▽ More We present a model for predicting electrocardiogram (ECG) abnormalities in short-duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency. Such exams can provide a full evaluation of heart activity and have not been studied in previous end-to-end machine learning papers. Using the database of a large telehealth network, we built a novel dataset with more than 2 million ECG tracings, orders of magnitude larger than those used in previous studies. Moreover, our dataset is more realistic, as it consist of 12-lead ECGs recorded during standard in-clinics exams. Using this data, we trained a residual neural network with 9 convolutional layers to map 7 to 10 second ECG signals to 6 classes of ECG abnormalities. Future work should extend these results to cover a large range of ECG abnormalities, which could improve the accessibility of this diagnostic tool and avoid wrong diagnosis from medical doctors. △ Less

Submitted 17 February, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

Report number: ML4H/2018/82

arXiv:1809.05241 [pdf, other]

Graph Pattern Mining and Learning through User-defined Relations (Extended Version)

Authors: Carlos H. C. Teixeira, Leonardo Cotta, Bruno Ribeiro, Wagner Meira Jr

Abstract: In this work we propose R-GPM, a parallel computing framework for graph pattern mining (GPM) through a user-defined subgraph relation. More specifically, we enable the computation of statistics of patterns through their subgraph classes, generalizing traditional GPM methods. R-GPM provides efficient estimators for these statistics by employing a MCMC sampling algorithm combined with several optimi… ▽ More In this work we propose R-GPM, a parallel computing framework for graph pattern mining (GPM) through a user-defined subgraph relation. More specifically, we enable the computation of statistics of patterns through their subgraph classes, generalizing traditional GPM methods. R-GPM provides efficient estimators for these statistics by employing a MCMC sampling algorithm combined with several optimizations. We provide both theoretical guarantees and empirical evaluations of our estimators in application scenarios such as stochastic optimization of deep high-order graph neural network models and pattern (motif) counting. We also propose and evaluate optimizations that enable improvements of our estimators accuracy, while reducing their computational costs in up to 3-orders-of-magnitude. Finally,we show that R-GPM is scalable, providing near-linear speedups on 44 cores in all of our tests. △ Less

Submitted 10 October, 2020; v1 submitted 13 September, 2018; originally announced September 2018.

Comments: Extended version of the paper published in the ICDM 2018

arXiv:1808.05927 [pdf, other]

Characterizing the public perception of WhatsApp through the lens of media

Authors: Josemar Alves Caetano, Gabriel Magno, Evandro Cunha, Wagner Meira Jr., Humberto T. Marques-Neto, Virgilio Almeida

Abstract: WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in develo** countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp thro… ▽ More WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in develo** countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp through the lens of media. We analyze two large datasets of news and show the kind of content that is being associated with WhatsApp in different regions of the world and over time. Our analyses include the examination of named entities, general vocabulary, and topics addressed in news articles that mention WhatsApp, as well as the polarity of these texts. Among other results, we demonstrate that the vocabulary and topics around the term "whatsapp" in the media have been changing over the years and in 2018 concentrate on matters related to misinformation, politics and criminal scams. More generally, our findings are useful to understand the impact that tools like WhatsApp play in the contemporary society and how they are seen by the communities themselves. △ Less

Submitted 17 August, 2018; originally announced August 2018.

Comments: Accepted as a full paper at the 2nd International Workshop on Rumours and Deception in Social Media (RDSM 2018), co-located with CIKM 2018 in Turin. Please cite the RDSM version

arXiv:1804.04096 [pdf, other]

doi 10.1145/3201064.3201081

Analyzing Right-wing YouTube Channels: Hate, Violence and Discrimination

Authors: Raphael Ottoni, Evandro Cunha, Gabriel Magno, Pedro Bernadina, Wagner Meira Jr, Virgilio Almeida

Abstract: As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and co… ▽ More As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and compare it to a baseline set using a three-layered approach, in which we analyze (a) lexicon, (b) topics and (c) implicit biases present in the texts. Among other results, our analyses show that right-wing channels tend to (a) contain a higher degree of words from "negative" semantic fields, (b) raise more topics related to war and terrorism, and (c) demonstrate more discriminatory bias against Muslims (in videos) and towards LGBT people (in comments). Our findings shed light not only into the collective conduct of the YouTube community promoting and consuming right-wing content, but also into the general behavior of YouTube users. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: In Proceedings of the 10th ACM Conference on Web Science

arXiv:1804.00397 [pdf, other]

Analyzing and characterizing political discussions in WhatsApp public groups

Authors: Josemar Alves Caetano, Jaqueline Faria de Oliveira, Helder Seixas Lima, Humberto T. Marques-Neto, Gabriel Magno, Wagner Meira Jr, Virgílio A. F. Almeida

Abstract: We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by… ▽ More We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by a user. The user layer characterizes the user actions while interacting with a group. The group layer characterizes the aggregate message patterns of all users that participate in a group. We analyze 81 public groups in WhatsApp and classify them into two categories, political and non-political groups according to keywords associated with each group. Our contributions are two-fold. First, we introduce a framework and a number of metrics to characterize the behavior of communication groups in mobile messaging systems such as WhatsApp. Second, our analysis underscores a Zipf-like profile for user messages in political groups. Also, our analysis reveals that Whatsapp messages are multimedia, with a combination of different forms of content. Multimedia content (i.e., audio, image, and video) and emojis are present in 20% and 11.2% of all messages respectively. Political groups use more text messages than non-political groups. Second, we characterize novel features that represent the behavior of a public group, with multiple conversational turns between key members, with the participation of other members of the group. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Comments: 10 pages, 12 figures

arXiv:1803.08977 [pdf, other]

Characterizing and Detecting Hateful Users on Twitter

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing h… ▽ More Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing hate-related words. In this work we partially address these issues by shifting the focus towards \textit{users}. We develop and employ a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile. This results in a sample of Twitter's retweet graph containing $100,386$ users, out of which $4,972$ were annotated. We also collect the users who were banned in the three months that followed the data collection. We show that hateful users differ from normal ones in terms of their activity patterns, word usage and as well as network structure. We obtain similar results comparing the neighbors of hateful vs. neighbors of normal users and also suspended users vs. active users, increasing the robustness of our analysis. We observe that hateful users are densely connected, and thus formulate the hate speech detection problem as a task of semi-supervised learning over a graph, exploiting the network of connections on Twitter. We find that a node embedding algorithm, which exploits the graph structure, outperforms content-based approaches for the detection of both hateful ($95\%$ AUC vs $88\%$ AUC) and suspended users ($93\%$ AUC vs $88\%$ AUC). Altogether, we present a user-centric view of hate speech, paving the way for better detection and understanding of this relevant and challenging issue. △ Less

Submitted 23 March, 2018; originally announced March 2018.

Comments: This is an extended version of the homonymous short paper to be presented at ICWSM-18. arXiv admin note: text overlap with arXiv:1801.00317

arXiv:1801.00317 [pdf, other]

"Like Sheep Among Wolves": Characterizing Hateful Users on Twitter

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Yuri A. Santos, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN… ▽ More Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN data, the sparsity of the phenomenon, and the subjectivity of the definition of hate speech. This works presents a user-centric view of hate speech, paving the way for better detection methods and understanding. We collect a Twitter dataset of $100,386$ users along with up to $200$ tweets from their timelines with a random-walk-based crawler on the retweet graph, and select a subsample of $4,972$ to be manually annotated as hateful or not through crowdsourcing. We examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph. Our results show that hateful users have more recent account creation dates, and more statuses, and followees per day. Additionally, they favorite more tweets, tweet in shorter intervals and are more central in the retweet network, contradicting the "lone wolf" stereotype often associated with such behavior. Hateful users are more negative, more profane, and use less words associated with topics such as hate, terrorism, violence and anger. We also identify similarities between hateful/normal users and their 1-neighborhood, suggesting strong homophily. △ Less

Submitted 14 January, 2018; v1 submitted 31 December, 2017; originally announced January 2018.

Comments: 8 pages, 11 figures, to be presented at MIS2 Workshop @ WSDM'18

arXiv:1707.00971 [pdf, other]

Characterizing videos, audience and advertising in Youtube channels for kids

Authors: Camila Souza Araujo, Gabriel Magno, Wagner Meira Jr, Virgilio Almeida, Pedro Hartung, Danilo Doneda

Abstract: Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior… ▽ More Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior of users of YouTube for kids channels and present the demographics of a large number of users. We collected data from 12,848 videos from 17 channels in US and UK and 24 channels in Brazil. The channels in English have been viewed more than 37 billion times. We also collected more than 14 million comments made by users. Based on a combination of text-analysis and face recognition tools, we show the presence of racial and gender biases in our large sample of users. We also identify children actively using YouTube, although the minimum age for using the service is 13 years in most countries. We provide comparisons of user behavior among the three countries, which represent large user populations in the global North and the global South. △ Less

Submitted 4 July, 2017; originally announced July 2017.

arXiv:1706.05924 [pdf, other]

"Everything I Disagree With is #FakeNews": Correlating Political Polarization and Spread of Misinformation

Authors: Manoel Horta Ribeiro, Pedro H. Calais, Virgílio A. F. Almeida, Wagner Meira Jr

Abstract: An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relati… ▽ More An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relationship between political polarization and content reported by Twitter users as related to "fake news". We investigate how polarization may create distinct narratives on what misinformation actually is. We perform our study based on two datasets collected from Twitter. The first dataset contains tweets about US politics in general, from which we compute the degree of polarization of each user towards the Republican and Democratic Party. In the second dataset, we collect tweets and URLs that co-occurred with "fake news" related keywords and hashtags, such as #FakeNews and #AlternativeFact, as well as reactions towards such tweets and URLs. We then analyze the relationship between polarization and what is perceived as misinformation, and whether users are designating information that they disagree as fake. Our results show an increase in the polarization of users and URLs associated with fake-news keywords and hashtags, when compared to information not labeled as "fake news". We discuss the impact of our findings on the challenges of tracking "fake news" in the ongoing battle against misinformation. △ Less

Submitted 17 July, 2017; v1 submitted 19 June, 2017; originally announced June 2017.

Comments: 8 pages, 10 figures, to be presented at DS+J Workshop @ KDD'17

arXiv:1705.07879 [pdf, other]

Enhancement of Epidemiological Models for Dengue Fever Based on Twitter Data

Authors: Julio Albinati, Wagner Meira Jr., Gisele L. Pappa, Mauro Teixeira, Cecilia Marques-Toledo

Abstract: Epidemiological early warning systems for dengue fever rely on up-to-date epidemiological data to forecast future incidence. However, epidemiological data typically requires time to be available, due to the application of time-consuming laboratorial tests. This implies that epidemiological models need to issue predictions with larger antecedence, making their task even more difficult. On the other… ▽ More Epidemiological early warning systems for dengue fever rely on up-to-date epidemiological data to forecast future incidence. However, epidemiological data typically requires time to be available, due to the application of time-consuming laboratorial tests. This implies that epidemiological models need to issue predictions with larger antecedence, making their task even more difficult. On the other hand, online platforms, such as Twitter or Google, allow us to obtain samples of users' interaction in near real-time and can be used as sensors to monitor current incidence. In this work, we propose a framework to exploit online data sources to mitigate the lack of up-to-date epidemiological data by obtaining estimates of current incidence, which are then explored by traditional epidemiological models. We show that the proposed framework obtains more accurate predictions than alternative approaches, with statistically better results for delays greater or equal to 4 weeks. △ Less

Submitted 22 May, 2017; originally announced May 2017.

Comments: ACM Digital Health 2017

arXiv:1704.00180 [pdf, other]

doi 10.1109/SIBGRAPI.2016.059

Complexity-Aware Assignment of Latent Values in Discriminative Models for Accurate Gesture Recognition

Authors: Manoel Horta Ribeiro, Bruno Teixeira, Antônio Otávio Fernandes, Wagner Meira Jr., Erickson R. Nascimento

Abstract: Many of the state-of-the-art algorithms for gesture recognition are based on Conditional Random Fields (CRFs). Successful approaches, such as the Latent-Dynamic CRFs, extend the CRF by incorporating latent variables, whose values are mapped to the values of the labels. In this paper we propose a novel methodology to set the latent values according to the gesture complexity. We use an heuristic tha… ▽ More Many of the state-of-the-art algorithms for gesture recognition are based on Conditional Random Fields (CRFs). Successful approaches, such as the Latent-Dynamic CRFs, extend the CRF by incorporating latent variables, whose values are mapped to the values of the labels. In this paper we propose a novel methodology to set the latent values according to the gesture complexity. We use an heuristic that iterates through the samples associated with each label value, stimating their complexity. We then use it to assign the latent values to the label values. We evaluate our method on the task of recognizing human gestures from video streams. The experiments were performed in binary datasets, generated by grou** different labels. Our results demonstrate that our approach outperforms the arbitrary one in many cases, increasing the accuracy by up to 10%. △ Less

Submitted 1 April, 2017; originally announced April 2017.

Comments: Conference paper published at 2016 29th SIBGRAPI, Conference on Graphics, Patterns and Images (SIBGRAPI). 8 pages, 7 figures

arXiv:1704.00172 [pdf, other]

Portinari: A Data Exploration Tool to Personalize Cervical Cancer Screening

Authors: Sagar Sen, Manoel Horta Ribeiro, Raquel C. de Melo Minardi, Wagner Meira Jr., Mari Nigard

Abstract: Socio-technical systems play an important role in public health screening programs to prevent cancer. Cervical cancer incidence has significantly decreased in countries that developed systems for organized screening engaging medical practitioners, laboratories and patients. The system automatically identifies individuals at risk of develo** the disease and invites them for a screening exam or a… ▽ More Socio-technical systems play an important role in public health screening programs to prevent cancer. Cervical cancer incidence has significantly decreased in countries that developed systems for organized screening engaging medical practitioners, laboratories and patients. The system automatically identifies individuals at risk of develo** the disease and invites them for a screening exam or a follow-up exam conducted by medical professionals. A triage algorithm in the system aims to reduce unnecessary screening exams for individuals at low-risk while detecting and treating individuals at high-risk. Despite the general success of screening, the triage algorithm is a one-size-fits all approach that is not personalized to a patient. This can easily be observed in historical data from screening exams. Often patients rely on personal factors to determine that they are either at high risk or not at risk at all and take action at their own discretion. Can exploring patient trajectories help hypothesize personal factors leading to their decisions? We present Portinari, a data exploration tool to query and visualize future trajectories of patients who have undergone a specific sequence of screening exams. The web-based tool contains (a) a visual query interface (b) a backend graph database of events in patients' lives (c) trajectory visualization using sankey diagrams. We use Portinari to explore diverse trajectories of patients following the Norwegian triage algorithm. The trajectories demonstrated variable degrees of adherence to the triage algorithm and allowed epidemiologists to hypothesize about the possible causes. △ Less

Submitted 1 April, 2017; originally announced April 2017.

Comments: Conference paper published at ICSE 2017 Buenos Aires, at the Software Engineering in Society Track. 10 pages, 5 figures

arXiv:1703.03895 [pdf, ps, other]

Antagonism also Flows through Retweets: The Impact of Out-of-Context Quotes in Opinion Polarization Analysis

Authors: Pedro Calais Guerra, Roberto C. S. N. P. Souza, Renato M. Assunção, Wagner Meira Jr

Abstract: In this paper, we study the implications of the commonplace assumption that most social media studies make with respect to the nature of message shares (such as retweets) as a predominantly positive interaction. By analyzing two large longitudinal Brazilian Twitter datasets containing 5 years of conversations on two polarizing topics - Politics and Sports - we empirically demonstrate that groups h… ▽ More In this paper, we study the implications of the commonplace assumption that most social media studies make with respect to the nature of message shares (such as retweets) as a predominantly positive interaction. By analyzing two large longitudinal Brazilian Twitter datasets containing 5 years of conversations on two polarizing topics - Politics and Sports - we empirically demonstrate that groups holding antagonistic views can actually retweet each other more often than they retweet other groups. We show that assuming retweets as endorsement interactions can lead to misleading conclusions with respect to the level of antagonism among social communities, and that this apparent paradox is explained in part by the use of retweets to quote the original content creator out of the message's original temporal context, for humor and criticism purposes. As a consequence, messages diffused on online media can have their polarity reversed over time, what poses challenges for social and computer scientists aiming to classify and track opinion groups on online media. On the other hand, we found that the time users take to retweet a message after it has been originally posted can be a useful signal to infer antagonism in social platforms, and that surges of out-of-context retweets correlate with sentiment drifts triggered by real-world events. We also discuss how such evidences can be embedded in sentiment analysis models. △ Less

Submitted 10 March, 2017; originally announced March 2017.

Comments: This is an extended version of the short paper published at ICWSM 2017

arXiv:1609.05413 [pdf, other]

Stereotypes in Search Engine Results: Understanding The Role of Local and Global Factors

Authors: Gabriel Magno, Camila Souza Araújo, Wagner Meira Jr., Virgilio Almeida

Abstract: The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness st… ▽ More The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness stereotypes in search engine results. By investigating datasets of images collected from two major search engines in 42 countries, we identify a significant fraction of replicated images. We find that common images are clustered around countries with the same language. We also show that existence of common images among countries is practically eliminated when the queries are limited to local sites. In summary, we show evidence that results from search engines are biased towards the language used to query the system, which leads to certain attractiveness stereotypes that are often quite different from the majority of the female population of the country. △ Less

Submitted 7 November, 2016; v1 submitted 17 September, 2016; originally announced September 2016.

arXiv:1608.02499 [pdf, other]

Identifying Stereotypes in the Online Perception of Physical Attractiveness

Authors: Camila Souza Araújo, Wagner Meira Jr., Virgilio Almeida

Abstract: Stereoty** can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly… ▽ More Stereoty** can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly women. We then collect images and extract information of faces. We propose a methodology and apply it to analyze photos gathered from search engines to understand how race and age manifest in the observed stereotypes and how they vary according to countries and regions. Our findings demonstrate the existence of stereotypes for female physical attractiveness, in particular negative stereotypes about black women and positive stereotypes about white women in terms of beauty. We also found negative stereotypes associated with older women in terms of physical attractiveness. Finally, we have identified patterns of stereotypes that are common to groups of countries. △ Less

Submitted 8 August, 2016; originally announced August 2016.

arXiv:1510.05981 [pdf, other]

A latent shared-component generative model for real-time disease surveillance using Twitter data

Authors: Roberto C. S. N. P. Souza, Denise E. F de Brito, Renato M. Assunção, Wagner Meira Jr

Abstract: Exploiting the large amount of available data for addressing relevant social problems has been one of the key challenges in data mining. Such efforts have been recently named "data science for social good" and attracted the attention of several researchers and institutions. We give a contribution in this objective in this paper considering a difficult public health problem, the timely monitoring o… ▽ More Exploiting the large amount of available data for addressing relevant social problems has been one of the key challenges in data mining. Such efforts have been recently named "data science for social good" and attracted the attention of several researchers and institutions. We give a contribution in this objective in this paper considering a difficult public health problem, the timely monitoring of dengue epidemics in small geographical areas. We develop a generative simple yet effective model to connect the fluctuations of disease cases and disease-related Twitter posts. We considered a hidden Markov process driving both, the fluctuations in dengue reported cases and the tweets issued in each region. We add a stable but random source of tweets to represent the posts when no disease cases are recorded. The model is learned through a Markov chain Monte Carlo algorithm that produces the posterior distribution of the relevant parameters. Using data from a significant number of large Brazilian towns, we demonstrate empirically that our model is able to predict well the next weeks of the disease counts using the tweets and disease cases jointly. △ Less

Submitted 20 October, 2015; originally announced October 2015.

Comments: Appears in 2nd ACM SIGKDD Workshop on Connected Health at Big Data Era (BigCHat)

arXiv:1301.6870 [pdf, other]

Studying User Footprints in Different Online Social Networks

Authors: Anshu Malhotra, Luam Totti, Wagner Meira Jr., Ponnurangam Kumaraguru, Virgilio Almeida

Abstract: With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile m… ▽ More With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users' online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for disambiguating profiles belonging to the same user from different social networks. UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively. △ Less

Submitted 29 January, 2013; originally announced January 2013.

Comments: The paper is already published in ASONAM 2012

arXiv:1209.0410 [pdf, other]

Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms

Authors: George Teodoro, Eduardo Valle, Nathan Mariano, Ricardo Torres, Wagner Meira Jr, Joel H. Saltz

Abstract: Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observ… ▽ More Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observed by the end-users. Additionally, those users dynamically interact with the systems creating fluctuating query request rates, requiring the search algorithm to adapt in order to better utilize the underline hardware to reduce response times. In order to address these challenges, we introduce hypercurves, a flexible framework for answering approximate k-nearest neighbor (kNN) queries for very large multimedia databases, aiming at online content-based multimedia services. Hypercurves executes on hybrid CPU--GPU environments, and is able to employ those devices cooperatively to support massive query request rates. In order to keep the response times optimal as the request rates vary, it employs a novel dynamic scheduler to partition the work between CPU and GPU. Hypercurves was throughly evaluated using a large database of multimedia descriptors. Its cooperative CPU--GPU execution achieved performance improvements of up to 30x when compared to the single CPU-core version. The dynamic work partition mechanism reduces the observed query response times in about 50% when compared to the best static CPU--GPU task partition configuration. In addition, Hypercurves achieves superlinear scalability in distributed (multi-node) executions, while kee** a high guarantee of equivalence with its sequential version --- thanks to the proof of probabilistic equivalence, which supported its aggressive parallelization design. △ Less

Submitted 3 September, 2012; originally announced September 2012.

Comments: 25 pages

arXiv:1201.6568 [pdf, other]

Mining Attribute-structure Correlated Patterns in Large Attributed Graphs

Authors: Arlei Silva, Wagner Meira Jr., Mohammed J. Zaki

Abstract: In this work, we study the correlation between attribute sets and the occurrence of dense subgraphs in large attributed graphs, a task we call structural correlation pattern mining. A structural correlation pattern is a dense subgraph induced by a particular attribute set. Existing methods are not able to extract relevant knowledge regarding how vertex attributes interact with dense subgraphs. Str… ▽ More In this work, we study the correlation between attribute sets and the occurrence of dense subgraphs in large attributed graphs, a task we call structural correlation pattern mining. A structural correlation pattern is a dense subgraph induced by a particular attribute set. Existing methods are not able to extract relevant knowledge regarding how vertex attributes interact with dense subgraphs. Structural correlation pattern mining combines aspects of frequent itemset and quasi-clique mining problems. We propose statistical significance measures that compare the structural correlation of attribute sets against their expected values using null models. Moreover, we evaluate the interestingness of structural correlation patterns in terms of size and density. An efficient algorithm that combines search and pruning strategies in the identification of the most relevant structural correlation patterns is presented. We apply our method for the analysis of three real-world attributed graphs: a collaboration, a music, and a citation network, verifying that it provides valuable knowledge in a feasible time. △ Less

Submitted 31 January, 2012; originally announced January 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 5, pp. 466-477 (2012)

arXiv:1111.3270 [pdf, other]

Mining Biclusters of Similar Values with Triadic Concept Analysis

Authors: Mehdi Kaytoue, Sergei O. Kuznetsov, Juraj Macko, Wagner Meira, Amedeo Napoli

Abstract: Biclustering numerical data became a popular data-mining task in the beginning of 2000's, especially for analysing gene expression data. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute data-table. So called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a… ▽ More Biclustering numerical data became a popular data-mining task in the beginning of 2000's, especially for analysing gene expression data. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute data-table. So called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a complete, correct and non redundant enumeration of such patterns, which is a well-known intractable problem, while no formal framework exists. In this paper, we introduce important links between biclustering and formal concept analysis. More specifically, we originally show that Triadic Concept Analysis (TCA), provides a nice mathematical framework for biclustering. Interestingly, existing algorithms of TCA, that usually apply on binary data, can be used (directly or with slight modifications) after a preprocessing step for extracting maximal biclusters of similar values. △ Less

Submitted 14 November, 2011; originally announced November 2011.

Comments: Concept Lattices and their Applications (CLA) (2011)

Showing 1–27 of 27 results for author: Meira, W