-
Topic Shifts as a Proxy for Assessing Politicization in Social Media
Authors:
Marcelo Sartori Locatelli,
Pedro Calais,
Matheus Prado Miranda,
João Pedro Junho,
Tomas Lacerda Muniz,
Wagner Meira Jr.,
Virgilio Almeida
Abstract:
Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations…
▽ More
Politicization is a social phenomenon studied by political science characterized by the extent to which ideas and facts are given a political tone. A range of topics, such as climate change, religion and vaccines has been subject to increasing politicization in the media and social media platforms. In this work, we propose a computational method for assessing politicization in online conversations based on topic shifts, i.e., the degree to which people switch topics in online conversations. The intuition is that topic shifts from a non-political topic to politics are a direct measure of politicization -- making something political, and that the more people switch conversations to politics, the more they perceive politics as playing a vital role in their daily lives. A fundamental challenge that must be addressed when one studies politicization in social media is that, a priori, any topic may be politicized. Hence, any keyword-based method or even machine learning approaches that rely on topic labels to classify topics are expensive to run and potentially ineffective. Instead, we learn from a seed of political keywords and use Positive-Unlabeled (PU) Learning to detect political comments in reaction to non-political news articles posted on Twitter, YouTube, and TikTok during the 2022 Brazilian presidential elections. Our findings indicate that all platforms show evidence of politicization as discussion around topics adjacent to politics such as economy, crime and drugs tend to shift to politics. Even the least politicized topics had the rate in which their topics shift to politics increased in the lead up to the elections and after other political events in Brazil -- an evidence of politicization.
△ Less
Submitted 13 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Efficient Strategies for Graph Pattern Mining Algorithms on GPUs
Authors:
Samuel Ferraz,
Vinicius Dias,
Carlos H. C. Teixeira,
George Teodoro,
Wagner Meira Jr
Abstract:
Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it chal…
▽ More
Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it challenging for efficient execution on GPU due to typical uncoalesced memory access, divergence, and load imbalance. Unfortunately, these aspects have not been fully addressed in previous work. Thus, this work proposes novel strategies to design and implement subgraph enumeration efficiently on GPU. We support a depth-first search style search (DFS-wide) that maximizes memory performance while providing enough parallelism to be exploited by the GPU, along with a warp-centric design that minimizes execution divergence and improves utilization of the computing capabilities. We also propose a low-cost load balancing layer to avoid idleness and redistribute work among thread warps in a GPU. Our strategies have been deployed in a system named DuMato, which provides a simple programming interface to allow efficient implementation of GPM algorithms. Our evaluation has shown that DuMato is often an order of magnitude faster than state-of-the-art GPM systems and can mine larger subgraphs (up to 12 vertices).
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Characterizing Vaccination Movements on YouTube in the United States and Brazil
Authors:
Marcelo Sartori Locatelli,
Josemar Caetano,
Wagner Meira Jr.,
Virgilio Almeida
Abstract:
In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users in…
▽ More
In the context of COVID-19 pandemic, social networks such as Twitter and YouTube stand out as important sources of information. YouTube, as the largest and most engaging online media consumption platform, has a large influence in the spread of information and misinformation, which makes it important to study how it deals with the problems that arise from disinformation, as well as how its users interact with different types of content. Considering that United States (USA) and Brazil (BR) are two countries with the highest COVID-19 death tolls, we asked the following question: What are the nuances of vaccination campaigns in the two countries? With that in mind, we engage in a comparative analysis of pro and anti-vaccine movements on YouTube. We also investigate the role of YouTube in countering online vaccine misinformation in USA and BR. For this means, we monitored the removal of vaccine related content on the platform and also applied various techniques to analyze the differences in discourse and engagement in pro and anti-vaccine "comment sections". We found that American anti-vaccine content tend to lead to considerably more toxic and negative discussion than their pro-vaccine counterparts while also leading to 18% higher user-user engagement, while Brazilian anti-vaccine content was significantly less engaging. We also found that pro-vaccine and anti-vaccine discourses are considerably different as the former is associated with conspiracy theories (e.g. ccp), misinformation and alternative medicine (e.g. hydroxychloroquine), while the latter is associated with protective measures. Finally, it was observed that YouTube content removals are still insufficient, with only approximately 16% of the anti-vaccine content being removed by the end of the studied period, with the USA registering the highest percentage of removed anti-vaccine content(34%) and BR registering the lowest(9.8%).
△ Less
Submitted 2 August, 2022;
originally announced August 2022.
-
Analyzing the "Slee** Giants" Activism Model in Brazil
Authors:
Bárbara Gomes Ribeiro,
Manoel Horta Ribeiro,
Virgílio Almeida,
Wagner Meira Jr
Abstract:
In 2020, amidst the COVID pandemic and a polarized political climate, the Slee** Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Slee** Giants Brasil (SGB) campaigned against media outlets using Twitter t…
▽ More
In 2020, amidst the COVID pandemic and a polarized political climate, the Slee** Giants online activist movement gained traction in Brazil. Its rationale was simple: to curb the spread of misinformation by harming the advertising revenue of sources that produce this type of content. Like its international counterparts, Slee** Giants Brasil (SGB) campaigned against media outlets using Twitter to ask companies to remove ads from the targeted outlets. This work presents a thorough quantitative characterization of this activism model, analyzing the three campaigns carried out by SGB between May and September 2020. To do so, we use digital traces from both Twitter and Google Trends, toxicity and sentiment classifiers trained for the Portuguese language, and an annotated corpus of SGB's tweets. Our key findings were threefold. First, we found that SGB's requests to companies were largely successful (with 83.85\% of all 192 targeted companies responding positively) and that user pressure was correlated to the speed of companies' responses. Second, there were no significant changes in the online attention and the user engagement going towards the targeted media outlets in the six months that followed SGB's campaign (as measured by Google Trends and Twitter engagement). Third, we observed that user interactions with companies changed only transiently, even if the companies did not respond to SGB's request. Overall, our results paint a nuanced portrait of internet activism. On the one hand, they suggest that SGB was successful in getting companies to boycott specific media outlets, which may have harmed their advertisement revenue stream. On the other hand, they also suggest that the activist movement did not impact the online attention these media outlets received nor the online image of companies that did not respond positively to their requests.
△ Less
Submitted 25 February, 2022; v1 submitted 16 May, 2021;
originally announced May 2021.
-
Sequential Stratified Regeneration: MCMC for Large State Spaces with an Application to Subgraph Count Estimation
Authors:
Carlos H. C. Teixeira,
Mayank Kakodkar,
Vinícius Dias,
Wagner Meira Jr.,
Bruno Ribeiro
Abstract:
This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distributio…
▽ More
This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distribution over all edges, eliminating the need to iterate over all edges. Unfortunately, these existing estimators are not scalable to massive real-world graphs. In this paper, we introduce Ripple, an MCMC-based estimator that achieves unprecedented scalability by stratifying the Markov chain state space into ordered strata with a new technique that we denote {\em sequential stratified regenerations}. We show that the Ripple estimator is consistent, highly parallelizable, and scales well.
We empirically evaluate our method by applying Ripple to the task of estimating connected, induced subgraph counts given some input graph. Therein, we demonstrate that Ripple is accurate and can estimate counts of up to $12$-node subgraphs, which is a task at a scale that has been considered unreachable, not only by prior MCMC-based methods but also by other sampling approaches. For instance, in this target application, we present results in which the Markov chain state space is as large as $10^{43}$, for which Ripple computes estimates in less than $4$ hours, on average.
△ Less
Submitted 8 April, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
Auditing Radicalization Pathways on YouTube
Authors:
Manoel Horta Ribeiro,
Raphael Ottoni,
Robert West,
Virgílio A. F. Almeida,
Wagner Meira
Abstract:
Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted…
▽ More
Non-profits, as well as the media, have hypothesized the existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content on the platform. Yet, there is to date no substantial quantitative evidence of this alleged pipeline. To close this gap, we conduct a large-scale audit of user radicalization on YouTube. We analyze 330,925 videos posted on 349 channels, which we broadly classified into four types: Media, the Alt-lite, the Intellectual Dark Web (I.D.W.), and the Alt-right. According to the aforementioned radicalization hypothesis, channels in the I.D.W. and the Alt-lite serve as gateways to fringe far-right ideology, here represented by Alt-right channels. Processing 72M+ comments, we show that the three channel types indeed increasingly share the same user base; that users consistently migrate from milder to more extreme content; and that a large percentage of users who consume Alt-right content now consumed Alt-lite and I.D.W. content in the past. We also probe YouTube's recommendation algorithm, looking at more than 2M video and channel recommendations between May/July 2019. We find that Alt-lite content is easily reachable from I.D.W. channels, while Alt-right videos are reachable only through channel recommendations. Overall, we paint a comprehensive picture of user radicalization on YouTube.
△ Less
Submitted 21 October, 2021; v1 submitted 22 August, 2019;
originally announced August 2019.
-
Automatic diagnosis of the 12-lead ECG using a deep neural network
Authors:
Antônio H. Ribeiro,
Manoel Horta Ribeiro,
Gabriela M. M. Paixão,
Derick M. Oliveira,
Paulo R. Gomes,
Jéssica A. Canazart,
Milton P. S. Ferreira,
Carl R. Andersson,
Peter W. Macfarlane,
Wagner Meira Jr.,
Thomas B. Schön,
Antonio Luiz P. Ribeiro
Abstract:
The role of automatic electrocardiogram (ECG) analysis in clinical practice is limited by the accuracy of existing models. Deep Neural Networks (DNNs) are models composed of stacked transformations that learn tasks by examples. This technology has recently achieved striking success in a variety of task and there are great expectations on how it might improve clinical practice. Here we present a DN…
▽ More
The role of automatic electrocardiogram (ECG) analysis in clinical practice is limited by the accuracy of existing models. Deep Neural Networks (DNNs) are models composed of stacked transformations that learn tasks by examples. This technology has recently achieved striking success in a variety of task and there are great expectations on how it might improve clinical practice. Here we present a DNN model trained in a dataset with more than 2 million labeled exams analyzed by the Telehealth Network of Minas Gerais and collected under the scope of the CODE (Clinical Outcomes in Digital Electrocardiology) study. The DNN outperform cardiology resident medical doctors in recognizing 6 types of abnormalities in 12-lead ECG recordings, with F1 scores above 80% and specificity over 99%. These results indicate ECG analysis based on DNNs, previously studied in a single-lead setup, generalizes well to 12-lead exams, taking the technology closer to the standard clinical practice.
△ Less
Submitted 14 April, 2020; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network
Authors:
Antônio H. Ribeiro,
Manoel Horta Ribeiro,
Gabriela Paixão,
Derick Oliveira,
Paulo R. Gomes,
Jéssica A. Canazart,
Milton Pifano,
Wagner Meira Jr.,
Thomas B. Schön,
Antonio Luiz Ribeiro
Abstract:
We present a model for predicting electrocardiogram (ECG) abnormalities in short-duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency. Such exams can provide a full evaluation of heart activity and have not been studied in previous end-to-end machine learning papers. Using the database of a large telehealth network, we built a novel dataset…
▽ More
We present a model for predicting electrocardiogram (ECG) abnormalities in short-duration 12-lead ECG signals which outperformed medical doctors on the 4th year of their cardiology residency. Such exams can provide a full evaluation of heart activity and have not been studied in previous end-to-end machine learning papers. Using the database of a large telehealth network, we built a novel dataset with more than 2 million ECG tracings, orders of magnitude larger than those used in previous studies. Moreover, our dataset is more realistic, as it consist of 12-lead ECGs recorded during standard in-clinics exams. Using this data, we trained a residual neural network with 9 convolutional layers to map 7 to 10 second ECG signals to 6 classes of ECG abnormalities. Future work should extend these results to cover a large range of ECG abnormalities, which could improve the accessibility of this diagnostic tool and avoid wrong diagnosis from medical doctors.
△ Less
Submitted 17 February, 2019; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Graph Pattern Mining and Learning through User-defined Relations (Extended Version)
Authors:
Carlos H. C. Teixeira,
Leonardo Cotta,
Bruno Ribeiro,
Wagner Meira Jr
Abstract:
In this work we propose R-GPM, a parallel computing framework for graph pattern mining (GPM) through a user-defined subgraph relation. More specifically, we enable the computation of statistics of patterns through their subgraph classes, generalizing traditional GPM methods. R-GPM provides efficient estimators for these statistics by employing a MCMC sampling algorithm combined with several optimi…
▽ More
In this work we propose R-GPM, a parallel computing framework for graph pattern mining (GPM) through a user-defined subgraph relation. More specifically, we enable the computation of statistics of patterns through their subgraph classes, generalizing traditional GPM methods. R-GPM provides efficient estimators for these statistics by employing a MCMC sampling algorithm combined with several optimizations. We provide both theoretical guarantees and empirical evaluations of our estimators in application scenarios such as stochastic optimization of deep high-order graph neural network models and pattern (motif) counting. We also propose and evaluate optimizations that enable improvements of our estimators accuracy, while reducing their computational costs in up to 3-orders-of-magnitude. Finally,we show that R-GPM is scalable, providing near-linear speedups on 44 cores in all of our tests.
△ Less
Submitted 10 October, 2020; v1 submitted 13 September, 2018;
originally announced September 2018.
-
Characterizing the public perception of WhatsApp through the lens of media
Authors:
Josemar Alves Caetano,
Gabriel Magno,
Evandro Cunha,
Wagner Meira Jr.,
Humberto T. Marques-Neto,
Virgilio Almeida
Abstract:
WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in develo** countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp thro…
▽ More
WhatsApp is, as of 2018, a significant component of the global information and communication infrastructure, especially in develo** countries. However, probably due to its strong end-to-end encryption, WhatsApp became an attractive place for the dissemination of misinformation, extremism and other forms of undesirable behavior. In this paper, we investigate the public perception of WhatsApp through the lens of media. We analyze two large datasets of news and show the kind of content that is being associated with WhatsApp in different regions of the world and over time. Our analyses include the examination of named entities, general vocabulary, and topics addressed in news articles that mention WhatsApp, as well as the polarity of these texts. Among other results, we demonstrate that the vocabulary and topics around the term "whatsapp" in the media have been changing over the years and in 2018 concentrate on matters related to misinformation, politics and criminal scams. More generally, our findings are useful to understand the impact that tools like WhatsApp play in the contemporary society and how they are seen by the communities themselves.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
Analyzing Right-wing YouTube Channels: Hate, Violence and Discrimination
Authors:
Raphael Ottoni,
Evandro Cunha,
Gabriel Magno,
Pedro Bernadina,
Wagner Meira Jr,
Virgilio Almeida
Abstract:
As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and co…
▽ More
As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and compare it to a baseline set using a three-layered approach, in which we analyze (a) lexicon, (b) topics and (c) implicit biases present in the texts. Among other results, our analyses show that right-wing channels tend to (a) contain a higher degree of words from "negative" semantic fields, (b) raise more topics related to war and terrorism, and (c) demonstrate more discriminatory bias against Muslims (in videos) and towards LGBT people (in comments). Our findings shed light not only into the collective conduct of the YouTube community promoting and consuming right-wing content, but also into the general behavior of YouTube users.
△ Less
Submitted 11 April, 2018;
originally announced April 2018.
-
Analyzing and characterizing political discussions in WhatsApp public groups
Authors:
Josemar Alves Caetano,
Jaqueline Faria de Oliveira,
Helder Seixas Lima,
Humberto T. Marques-Neto,
Gabriel Magno,
Wagner Meira Jr,
Virgílio A. F. Almeida
Abstract:
We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by…
▽ More
We present a thorough characterization of what we believe to be the first significant analysis of the behavior of groups in WhatsApp in the scientific literature. Our characterization of over 270,000 messages and about 7,000 users spanning a 28-day period is done at three different layers. The message layer focuses on individual messages, each of which is the result of specific posts performed by a user. The user layer characterizes the user actions while interacting with a group. The group layer characterizes the aggregate message patterns of all users that participate in a group. We analyze 81 public groups in WhatsApp and classify them into two categories, political and non-political groups according to keywords associated with each group. Our contributions are two-fold. First, we introduce a framework and a number of metrics to characterize the behavior of communication groups in mobile messaging systems such as WhatsApp. Second, our analysis underscores a Zipf-like profile for user messages in political groups. Also, our analysis reveals that Whatsapp messages are multimedia, with a combination of different forms of content. Multimedia content (i.e., audio, image, and video) and emojis are present in 20% and 11.2% of all messages respectively. Political groups use more text messages than non-political groups. Second, we characterize novel features that represent the behavior of a public group, with multiple conversational turns between key members, with the participation of other members of the group.
△ Less
Submitted 2 April, 2018;
originally announced April 2018.
-
Characterizing and Detecting Hateful Users on Twitter
Authors:
Manoel Horta Ribeiro,
Pedro H. Calais,
Yuri A. Santos,
Virgílio A. F. Almeida,
Wagner Meira Jr
Abstract:
Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing h…
▽ More
Most current approaches to characterize and detect hate speech focus on \textit{content} posted in Online Social Networks. They face shortcomings to collect and annotate hateful speech due to the incompleteness and noisiness of OSN text and the subjectivity of hate speech. These limitations are often aided with constraints that oversimplify the problem, such as considering only tweets containing hate-related words. In this work we partially address these issues by shifting the focus towards \textit{users}. We develop and employ a robust methodology to collect and annotate hateful users which does not depend directly on lexicon and where the users are annotated given their entire profile. This results in a sample of Twitter's retweet graph containing $100,386$ users, out of which $4,972$ were annotated. We also collect the users who were banned in the three months that followed the data collection. We show that hateful users differ from normal ones in terms of their activity patterns, word usage and as well as network structure. We obtain similar results comparing the neighbors of hateful vs. neighbors of normal users and also suspended users vs. active users, increasing the robustness of our analysis. We observe that hateful users are densely connected, and thus formulate the hate speech detection problem as a task of semi-supervised learning over a graph, exploiting the network of connections on Twitter. We find that a node embedding algorithm, which exploits the graph structure, outperforms content-based approaches for the detection of both hateful ($95\%$ AUC vs $88\%$ AUC) and suspended users ($93\%$ AUC vs $88\%$ AUC). Altogether, we present a user-centric view of hate speech, paving the way for better detection and understanding of this relevant and challenging issue.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
"Like Sheep Among Wolves": Characterizing Hateful Users on Twitter
Authors:
Manoel Horta Ribeiro,
Pedro H. Calais,
Yuri A. Santos,
Virgílio A. F. Almeida,
Wagner Meira Jr
Abstract:
Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN…
▽ More
Hateful speech in Online Social Networks (OSNs) is a key challenge for companies and governments, as it impacts users and advertisers, and as several countries have strict legislation against the practice. This has motivated work on detecting and characterizing the phenomenon in tweets, social media posts and comments. However, these approaches face several shortcomings due to the noisiness of OSN data, the sparsity of the phenomenon, and the subjectivity of the definition of hate speech. This works presents a user-centric view of hate speech, paving the way for better detection methods and understanding. We collect a Twitter dataset of $100,386$ users along with up to $200$ tweets from their timelines with a random-walk-based crawler on the retweet graph, and select a subsample of $4,972$ to be manually annotated as hateful or not through crowdsourcing. We examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph. Our results show that hateful users have more recent account creation dates, and more statuses, and followees per day. Additionally, they favorite more tweets, tweet in shorter intervals and are more central in the retweet network, contradicting the "lone wolf" stereotype often associated with such behavior. Hateful users are more negative, more profane, and use less words associated with topics such as hate, terrorism, violence and anger. We also identify similarities between hateful/normal users and their 1-neighborhood, suggesting strong homophily.
△ Less
Submitted 14 January, 2018; v1 submitted 31 December, 2017;
originally announced January 2018.
-
Characterizing videos, audience and advertising in Youtube channels for kids
Authors:
Camila Souza Araujo,
Gabriel Magno,
Wagner Meira Jr,
Virgilio Almeida,
Pedro Hartung,
Danilo Doneda
Abstract:
Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior…
▽ More
Online video services, messaging systems, games and social media services are tremendously popular among young people and children in many countries. Most of the digital services offered on the internet are advertising funded, which makes advertising ubiquitous in children's everyday life. To understand the impact of advertising-based digital services on children, we study the collective behavior of users of YouTube for kids channels and present the demographics of a large number of users. We collected data from 12,848 videos from 17 channels in US and UK and 24 channels in Brazil. The channels in English have been viewed more than 37 billion times. We also collected more than 14 million comments made by users. Based on a combination of text-analysis and face recognition tools, we show the presence of racial and gender biases in our large sample of users. We also identify children actively using YouTube, although the minimum age for using the service is 13 years in most countries. We provide comparisons of user behavior among the three countries, which represent large user populations in the global North and the global South.
△ Less
Submitted 4 July, 2017;
originally announced July 2017.
-
"Everything I Disagree With is #FakeNews": Correlating Political Polarization and Spread of Misinformation
Authors:
Manoel Horta Ribeiro,
Pedro H. Calais,
Virgílio A. F. Almeida,
Wagner Meira Jr
Abstract:
An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relati…
▽ More
An important challenge in the process of tracking and detecting the dissemination of misinformation is to understand the political gap between people that engage with the so called "fake news". A possible factor responsible for this gap is opinion polarization, which may prompt the general public to classify content that they disagree or want to discredit as fake. In this work, we study the relationship between political polarization and content reported by Twitter users as related to "fake news". We investigate how polarization may create distinct narratives on what misinformation actually is. We perform our study based on two datasets collected from Twitter. The first dataset contains tweets about US politics in general, from which we compute the degree of polarization of each user towards the Republican and Democratic Party. In the second dataset, we collect tweets and URLs that co-occurred with "fake news" related keywords and hashtags, such as #FakeNews and #AlternativeFact, as well as reactions towards such tweets and URLs. We then analyze the relationship between polarization and what is perceived as misinformation, and whether users are designating information that they disagree as fake. Our results show an increase in the polarization of users and URLs associated with fake-news keywords and hashtags, when compared to information not labeled as "fake news". We discuss the impact of our findings on the challenges of tracking "fake news" in the ongoing battle against misinformation.
△ Less
Submitted 17 July, 2017; v1 submitted 19 June, 2017;
originally announced June 2017.
-
Enhancement of Epidemiological Models for Dengue Fever Based on Twitter Data
Authors:
Julio Albinati,
Wagner Meira Jr.,
Gisele L. Pappa,
Mauro Teixeira,
Cecilia Marques-Toledo
Abstract:
Epidemiological early warning systems for dengue fever rely on up-to-date epidemiological data to forecast future incidence. However, epidemiological data typically requires time to be available, due to the application of time-consuming laboratorial tests. This implies that epidemiological models need to issue predictions with larger antecedence, making their task even more difficult. On the other…
▽ More
Epidemiological early warning systems for dengue fever rely on up-to-date epidemiological data to forecast future incidence. However, epidemiological data typically requires time to be available, due to the application of time-consuming laboratorial tests. This implies that epidemiological models need to issue predictions with larger antecedence, making their task even more difficult. On the other hand, online platforms, such as Twitter or Google, allow us to obtain samples of users' interaction in near real-time and can be used as sensors to monitor current incidence. In this work, we propose a framework to exploit online data sources to mitigate the lack of up-to-date epidemiological data by obtaining estimates of current incidence, which are then explored by traditional epidemiological models. We show that the proposed framework obtains more accurate predictions than alternative approaches, with statistically better results for delays greater or equal to 4 weeks.
△ Less
Submitted 22 May, 2017;
originally announced May 2017.
-
Complexity-Aware Assignment of Latent Values in Discriminative Models for Accurate Gesture Recognition
Authors:
Manoel Horta Ribeiro,
Bruno Teixeira,
Antônio Otávio Fernandes,
Wagner Meira Jr.,
Erickson R. Nascimento
Abstract:
Many of the state-of-the-art algorithms for gesture recognition are based on Conditional Random Fields (CRFs). Successful approaches, such as the Latent-Dynamic CRFs, extend the CRF by incorporating latent variables, whose values are mapped to the values of the labels. In this paper we propose a novel methodology to set the latent values according to the gesture complexity. We use an heuristic tha…
▽ More
Many of the state-of-the-art algorithms for gesture recognition are based on Conditional Random Fields (CRFs). Successful approaches, such as the Latent-Dynamic CRFs, extend the CRF by incorporating latent variables, whose values are mapped to the values of the labels. In this paper we propose a novel methodology to set the latent values according to the gesture complexity. We use an heuristic that iterates through the samples associated with each label value, stimating their complexity. We then use it to assign the latent values to the label values. We evaluate our method on the task of recognizing human gestures from video streams. The experiments were performed in binary datasets, generated by grou** different labels. Our results demonstrate that our approach outperforms the arbitrary one in many cases, increasing the accuracy by up to 10%.
△ Less
Submitted 1 April, 2017;
originally announced April 2017.
-
Portinari: A Data Exploration Tool to Personalize Cervical Cancer Screening
Authors:
Sagar Sen,
Manoel Horta Ribeiro,
Raquel C. de Melo Minardi,
Wagner Meira Jr.,
Mari Nigard
Abstract:
Socio-technical systems play an important role in public health screening programs to prevent cancer. Cervical cancer incidence has significantly decreased in countries that developed systems for organized screening engaging medical practitioners, laboratories and patients. The system automatically identifies individuals at risk of develo** the disease and invites them for a screening exam or a…
▽ More
Socio-technical systems play an important role in public health screening programs to prevent cancer. Cervical cancer incidence has significantly decreased in countries that developed systems for organized screening engaging medical practitioners, laboratories and patients. The system automatically identifies individuals at risk of develo** the disease and invites them for a screening exam or a follow-up exam conducted by medical professionals. A triage algorithm in the system aims to reduce unnecessary screening exams for individuals at low-risk while detecting and treating individuals at high-risk. Despite the general success of screening, the triage algorithm is a one-size-fits all approach that is not personalized to a patient. This can easily be observed in historical data from screening exams. Often patients rely on personal factors to determine that they are either at high risk or not at risk at all and take action at their own discretion. Can exploring patient trajectories help hypothesize personal factors leading to their decisions? We present Portinari, a data exploration tool to query and visualize future trajectories of patients who have undergone a specific sequence of screening exams. The web-based tool contains (a) a visual query interface (b) a backend graph database of events in patients' lives (c) trajectory visualization using sankey diagrams. We use Portinari to explore diverse trajectories of patients following the Norwegian triage algorithm. The trajectories demonstrated variable degrees of adherence to the triage algorithm and allowed epidemiologists to hypothesize about the possible causes.
△ Less
Submitted 1 April, 2017;
originally announced April 2017.
-
Antagonism also Flows through Retweets: The Impact of Out-of-Context Quotes in Opinion Polarization Analysis
Authors:
Pedro Calais Guerra,
Roberto C. S. N. P. Souza,
Renato M. Assunção,
Wagner Meira Jr
Abstract:
In this paper, we study the implications of the commonplace assumption that most social media studies make with respect to the nature of message shares (such as retweets) as a predominantly positive interaction. By analyzing two large longitudinal Brazilian Twitter datasets containing 5 years of conversations on two polarizing topics - Politics and Sports - we empirically demonstrate that groups h…
▽ More
In this paper, we study the implications of the commonplace assumption that most social media studies make with respect to the nature of message shares (such as retweets) as a predominantly positive interaction. By analyzing two large longitudinal Brazilian Twitter datasets containing 5 years of conversations on two polarizing topics - Politics and Sports - we empirically demonstrate that groups holding antagonistic views can actually retweet each other more often than they retweet other groups. We show that assuming retweets as endorsement interactions can lead to misleading conclusions with respect to the level of antagonism among social communities, and that this apparent paradox is explained in part by the use of retweets to quote the original content creator out of the message's original temporal context, for humor and criticism purposes. As a consequence, messages diffused on online media can have their polarity reversed over time, what poses challenges for social and computer scientists aiming to classify and track opinion groups on online media. On the other hand, we found that the time users take to retweet a message after it has been originally posted can be a useful signal to infer antagonism in social platforms, and that surges of out-of-context retweets correlate with sentiment drifts triggered by real-world events. We also discuss how such evidences can be embedded in sentiment analysis models.
△ Less
Submitted 10 March, 2017;
originally announced March 2017.
-
Stereotypes in Search Engine Results: Understanding The Role of Local and Global Factors
Authors:
Gabriel Magno,
Camila Souza Araújo,
Wagner Meira Jr.,
Virgilio Almeida
Abstract:
The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness st…
▽ More
The internet has been blurring the lines between local and global cultures, affecting in different ways the perception of people about themselves and others. In the global context of the internet, search engine platforms are a key mediator between individuals and information. In this paper, we examine the local and global impact of the internet on the formation of female physical attractiveness stereotypes in search engine results. By investigating datasets of images collected from two major search engines in 42 countries, we identify a significant fraction of replicated images. We find that common images are clustered around countries with the same language. We also show that existence of common images among countries is practically eliminated when the queries are limited to local sites. In summary, we show evidence that results from search engines are biased towards the language used to query the system, which leads to certain attractiveness stereotypes that are often quite different from the majority of the female population of the country.
△ Less
Submitted 7 November, 2016; v1 submitted 17 September, 2016;
originally announced September 2016.
-
Identifying Stereotypes in the Online Perception of Physical Attractiveness
Authors:
Camila Souza Araújo,
Wagner Meira Jr.,
Virgilio Almeida
Abstract:
Stereoty** can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly…
▽ More
Stereoty** can be viewed as oversimplified ideas about social groups. They can be positive, neutral or negative. The main goal of this paper is to identify stereotypes for female physical attractiveness in images available in the Web. We look at the search engines as possible sources of stereotypes. We conducted experiments on Google and Bing by querying the search engines for beautiful and ugly women. We then collect images and extract information of faces. We propose a methodology and apply it to analyze photos gathered from search engines to understand how race and age manifest in the observed stereotypes and how they vary according to countries and regions. Our findings demonstrate the existence of stereotypes for female physical attractiveness, in particular negative stereotypes about black women and positive stereotypes about white women in terms of beauty. We also found negative stereotypes associated with older women in terms of physical attractiveness. Finally, we have identified patterns of stereotypes that are common to groups of countries.
△ Less
Submitted 8 August, 2016;
originally announced August 2016.
-
A latent shared-component generative model for real-time disease surveillance using Twitter data
Authors:
Roberto C. S. N. P. Souza,
Denise E. F de Brito,
Renato M. Assunção,
Wagner Meira Jr
Abstract:
Exploiting the large amount of available data for addressing relevant social problems has been one of the key challenges in data mining. Such efforts have been recently named "data science for social good" and attracted the attention of several researchers and institutions. We give a contribution in this objective in this paper considering a difficult public health problem, the timely monitoring o…
▽ More
Exploiting the large amount of available data for addressing relevant social problems has been one of the key challenges in data mining. Such efforts have been recently named "data science for social good" and attracted the attention of several researchers and institutions. We give a contribution in this objective in this paper considering a difficult public health problem, the timely monitoring of dengue epidemics in small geographical areas. We develop a generative simple yet effective model to connect the fluctuations of disease cases and disease-related Twitter posts. We considered a hidden Markov process driving both, the fluctuations in dengue reported cases and the tweets issued in each region. We add a stable but random source of tweets to represent the posts when no disease cases are recorded. The model is learned through a Markov chain Monte Carlo algorithm that produces the posterior distribution of the relevant parameters. Using data from a significant number of large Brazilian towns, we demonstrate empirically that our model is able to predict well the next weeks of the disease counts using the tweets and disease cases jointly.
△ Less
Submitted 20 October, 2015;
originally announced October 2015.
-
Studying User Footprints in Different Online Social Networks
Authors:
Anshu Malhotra,
Luam Totti,
Wagner Meira Jr.,
Ponnurangam Kumaraguru,
Virgilio Almeida
Abstract:
With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile m…
▽ More
With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users' online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for disambiguating profiles belonging to the same user from different social networks. UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively.
△ Less
Submitted 29 January, 2013;
originally announced January 2013.
-
Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms
Authors:
George Teodoro,
Eduardo Valle,
Nathan Mariano,
Ricardo Torres,
Wagner Meira Jr,
Joel H. Saltz
Abstract:
Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observ…
▽ More
Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observed by the end-users. Additionally, those users dynamically interact with the systems creating fluctuating query request rates, requiring the search algorithm to adapt in order to better utilize the underline hardware to reduce response times. In order to address these challenges, we introduce hypercurves, a flexible framework for answering approximate k-nearest neighbor (kNN) queries for very large multimedia databases, aiming at online content-based multimedia services. Hypercurves executes on hybrid CPU--GPU environments, and is able to employ those devices cooperatively to support massive query request rates. In order to keep the response times optimal as the request rates vary, it employs a novel dynamic scheduler to partition the work between CPU and GPU. Hypercurves was throughly evaluated using a large database of multimedia descriptors. Its cooperative CPU--GPU execution achieved performance improvements of up to 30x when compared to the single CPU-core version. The dynamic work partition mechanism reduces the observed query response times in about 50% when compared to the best static CPU--GPU task partition configuration. In addition, Hypercurves achieves superlinear scalability in distributed (multi-node) executions, while kee** a high guarantee of equivalence with its sequential version --- thanks to the proof of probabilistic equivalence, which supported its aggressive parallelization design.
△ Less
Submitted 3 September, 2012;
originally announced September 2012.
-
Mining Attribute-structure Correlated Patterns in Large Attributed Graphs
Authors:
Arlei Silva,
Wagner Meira Jr.,
Mohammed J. Zaki
Abstract:
In this work, we study the correlation between attribute sets and the occurrence of dense subgraphs in large attributed graphs, a task we call structural correlation pattern mining. A structural correlation pattern is a dense subgraph induced by a particular attribute set. Existing methods are not able to extract relevant knowledge regarding how vertex attributes interact with dense subgraphs. Str…
▽ More
In this work, we study the correlation between attribute sets and the occurrence of dense subgraphs in large attributed graphs, a task we call structural correlation pattern mining. A structural correlation pattern is a dense subgraph induced by a particular attribute set. Existing methods are not able to extract relevant knowledge regarding how vertex attributes interact with dense subgraphs. Structural correlation pattern mining combines aspects of frequent itemset and quasi-clique mining problems. We propose statistical significance measures that compare the structural correlation of attribute sets against their expected values using null models. Moreover, we evaluate the interestingness of structural correlation patterns in terms of size and density. An efficient algorithm that combines search and pruning strategies in the identification of the most relevant structural correlation patterns is presented. We apply our method for the analysis of three real-world attributed graphs: a collaboration, a music, and a citation network, verifying that it provides valuable knowledge in a feasible time.
△ Less
Submitted 31 January, 2012;
originally announced January 2012.
-
Mining Biclusters of Similar Values with Triadic Concept Analysis
Authors:
Mehdi Kaytoue,
Sergei O. Kuznetsov,
Juraj Macko,
Wagner Meira,
Amedeo Napoli
Abstract:
Biclustering numerical data became a popular data-mining task in the beginning of 2000's, especially for analysing gene expression data. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute data-table. So called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a…
▽ More
Biclustering numerical data became a popular data-mining task in the beginning of 2000's, especially for analysing gene expression data. A bicluster reflects a strong association between a subset of objects and a subset of attributes in a numerical object/attribute data-table. So called biclusters of similar values can be thought as maximal sub-tables with close values. Only few methods address a complete, correct and non redundant enumeration of such patterns, which is a well-known intractable problem, while no formal framework exists. In this paper, we introduce important links between biclustering and formal concept analysis. More specifically, we originally show that Triadic Concept Analysis (TCA), provides a nice mathematical framework for biclustering. Interestingly, existing algorithms of TCA, that usually apply on binary data, can be used (directly or with slight modifications) after a preprocessing step for extracting maximal biclusters of similar values.
△ Less
Submitted 14 November, 2011;
originally announced November 2011.