-
adF: A Novel System for Measuring Web Fingerprinting through Ads
Authors:
Miguel A. Bermejo-Agueda,
Patricia Callejo,
Rubén Cuevas,
Ángel Cuevas
Abstract:
This paper introduces adF, a novel system for analyzing the vulnerability of different devices, Operating Systems (OSes), and browsers to web fingerprinting. adF performs its measurements from code inserted in ads. We have used our system in several ad campaigns that delivered 5,40 million ad impressions. The collected data enable us to assess the vulnerability of current desktop and mobile device…
▽ More
This paper introduces adF, a novel system for analyzing the vulnerability of different devices, Operating Systems (OSes), and browsers to web fingerprinting. adF performs its measurements from code inserted in ads. We have used our system in several ad campaigns that delivered 5,40 million ad impressions. The collected data enable us to assess the vulnerability of current desktop and mobile devices to web fingerprinting. Based on our results, we estimate that 64% of desktop devices and 40% of mobile devices can be uniquely fingerprinted with our web fingerprinting system. However, the resilience to web fingerprinting varies significantly across browsers and device types, with Chrome on desktops being the most vulnerable configuration.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Analysis and implementation of nanotargeting on LinkedIn based on publicly available non-PII
Authors:
Ángel Merino,
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas
Abstract:
The literature has shown that combining a few non-Personal Identifiable Information (non-PII) is enough to make a user unique in a dataset including millions of users. This work demonstrates that a combination of a few non-PII items can be activated to nanotarget users. We demonstrate that the combination of the location and {5} rare ({13} random) skills in a LinkedIn profile is enough to become u…
▽ More
The literature has shown that combining a few non-Personal Identifiable Information (non-PII) is enough to make a user unique in a dataset including millions of users. This work demonstrates that a combination of a few non-PII items can be activated to nanotarget users. We demonstrate that the combination of the location and {5} rare ({13} random) skills in a LinkedIn profile is enough to become unique in a user base of {$\sim$970M} users with a probability of 75\%. The novelty is that these attributes are publicly accessible to anyone registered on LinkedIn and can be activated through advertising campaigns. We ran an experiment configuring ad campaigns using the location and skills of three of the paper's authors, demonstrating how all the ads using $\geq13$ skills were delivered exclusively to the targeted user. We reported this vulnerability to LinkedIn, which initially ignored the problem, but fixed it as of November 2023.%This nanotargeting may expose LinkedIn users to privacy and security risks such as malvertising or manipulation.
△ Less
Submitted 16 May, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Time Series Clustering With Random Convolutional Kernels
Authors:
Jorge Marco-Blanco,
Rubén Cuevas
Abstract:
Time series data, spanning applications ranging from climatology to finance to healthcare, presents significant challenges in data mining due to its size and complexity. One open issue lies in time series clustering, which is crucial for processing large volumes of unlabeled time series data and unlocking valuable insights. Traditional and modern analysis methods, however, often struggle with thes…
▽ More
Time series data, spanning applications ranging from climatology to finance to healthcare, presents significant challenges in data mining due to its size and complexity. One open issue lies in time series clustering, which is crucial for processing large volumes of unlabeled time series data and unlocking valuable insights. Traditional and modern analysis methods, however, often struggle with these complexities. To address these limitations, we introduce R-Clustering, a novel method that utilizes convolutional architectures with randomly selected parameters. Through extensive evaluations, R-Clustering demonstrates superior performance over existing methods in terms of clustering accuracy, computational efficiency and scalability. Empirical results obtained using the UCR archive demonstrate the effectiveness of our approach across diverse time series datasets. The findings highlight the significance of R-Clustering in various domains and applications, contributing to the advancement of time series data mining.
△ Less
Submitted 6 July, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
CarbonTag: A Browser-Based Method for Approximating Energy Consumption of Online Ads
Authors:
José González Cabañas,
Patricia Callejo,
Rubén Cuevas,
Steffen Svatberg,
Tommy Torjesen,
Ángel Cuevas,
Antonio Pastor,
Mikko Kotila
Abstract:
Energy is today the most critical environmental challenge. The amount of carbon emissions contributing to climate change is significantly influenced by both the production and consumption of energy. Measuring and reducing the energy consumption of services is a crucial step toward reducing adverse environmental effects caused by carbon emissions. Millions of websites rely on online advertisements…
▽ More
Energy is today the most critical environmental challenge. The amount of carbon emissions contributing to climate change is significantly influenced by both the production and consumption of energy. Measuring and reducing the energy consumption of services is a crucial step toward reducing adverse environmental effects caused by carbon emissions. Millions of websites rely on online advertisements to generate revenue, with most websites earning most or all of their revenues from ads. As a result, hundreds of billions of online ads are delivered daily to internet users to be rendered in their browsers. Both the delivery and rendering of each ad consume energy. This study investigates how much energy online ads use in the rendering process and offers a way for predicting it as part of rendering the ad. To the best of the authors' knowledge, this is the first study to calculate the energy usage of single advertisements in the rendering process. Our research further introduces different levels of consumption by which online ads can be classified based on energy efficiency. This classification will allow advertisers to add energy efficiency metrics and optimize campaigns towards consuming less possible.
△ Less
Submitted 26 June, 2023; v1 submitted 25 October, 2022;
originally announced November 2022.
-
Polarization dynamics, stability and tunability of a dual-comb polarization-multiplexing ring-cavity fiber laser
Authors:
Alberto Rodriguez Cuevas,
Hani J. Kbashi,
Dmitrii Stoliarov,
Sergey Sergeyev
Abstract:
In this paper, we demonstrate the polarization-multiplexed system capable of generating two stable optical frequency combs with tunable frequency differences and a large extinction ratio. Also, the polarization dynamics of a dual-frequency comb generated from a single mode-locked Er-doped fiber laser are experimentally studied. The obtained results will extend the application to areas such as pola…
▽ More
In this paper, we demonstrate the polarization-multiplexed system capable of generating two stable optical frequency combs with tunable frequency differences and a large extinction ratio. Also, the polarization dynamics of a dual-frequency comb generated from a single mode-locked Er-doped fiber laser are experimentally studied. The obtained results will extend the application to areas such as polarization spectroscopy and dual-comb-based polarimetry.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Unique on Facebook: Formulation and Evidence of (Nano)targeting Individual Users with non-PII Data
Authors:
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas,
Juan López-Fernández,
David García
Abstract:
The privacy of an individual is bounded by the ability of a third party to reveal their identity. Certain data items such as a passport ID or a mobile phone number may be used to uniquely identify a person. These are referred to as Personal Identifiable Information (PII) items. Previous literature has also reported that, in datasets including millions of users, a combination of several non-PII ite…
▽ More
The privacy of an individual is bounded by the ability of a third party to reveal their identity. Certain data items such as a passport ID or a mobile phone number may be used to uniquely identify a person. These are referred to as Personal Identifiable Information (PII) items. Previous literature has also reported that, in datasets including millions of users, a combination of several non-PII items (which alone are not enough to identify an individual) can uniquely identify an individual within the dataset. In this paper, we define a data-driven model to quantify the number of interests from a user that make them unique on Facebook. To the best of our knowledge, this represents the first study of individuals' uniqueness at the world population scale. Besides, users' interests are actionable non-PII items that can be used to define ad campaigns and deliver tailored ads to Facebook users. We run an experiment through 21 Facebook ad campaigns that target three of the authors of this paper to prove that, if an advertiser knows enough interests from a user, the Facebook Advertising Platform can be systematically exploited to deliver ads exclusively to a specific user. We refer to this practice as nanotargeting. Finally, we discuss the harmful risks associated with nanotargeting such as psychological persuasion, user manipulation, or blackmailing, and provide easily implementable countermeasures to preclude attacks based on nanotargeting campaigns on Facebook.
△ Less
Submitted 16 October, 2021; v1 submitted 13 October, 2021;
originally announced October 2021.
-
A deep dive into the accuracy of IP Geolocation Databases and its impact on online advertising
Authors:
Patricia Callejo,
Marco Gramaglia,
Rubén Cuevas,
Ángel Cuevas
Abstract:
The quest for every time more personalized Internet experience relies on the enriched contextual information about each user. Online advertising also follows this approach. Among the context information that advertising stakeholders leverage, location information is certainly one of them. However, when this information is not directly available from the end users, advertising stakeholders infer it…
▽ More
The quest for every time more personalized Internet experience relies on the enriched contextual information about each user. Online advertising also follows this approach. Among the context information that advertising stakeholders leverage, location information is certainly one of them. However, when this information is not directly available from the end users, advertising stakeholders infer it using geolocation databases, matching IP addresses to a position on earth. The accuracy of this approach has often been questioned in the past: however, the reality check on an advertising DSP shows that this technique accounts for a large fraction of the served advertisements. In this paper, we revisit the work in the field, that is mostly from almost one decade ago, through the lenses of big data. More specifically, we, i) benchmark two commercial Internet geolocation databases, evaluate the quality of their information using a ground truth database of user positions containing more than 2 billion samples, ii) analyze the internals of these databases, devising a theoretical upper bound for the quality of the Internet geolocation approach, and iii) we run an empirical study that unveils the monetary impact of this technology by considering the costs associated with a real-world ad impressions dataset. We show that when factoring cost in, IP geolocation technology may be, under certain campaign characteristics, a better alternative than GPS from an economic point of view, despite its inferior performance.
△ Less
Submitted 1 June, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
Scalable Analysis for Covid-19 and Vaccine Data
Authors:
Chris Collins,
Roxana Cuevas,
Edward Hernandez,
Reece Hernandez,
Breanna Le,
Jongwook Woo
Abstract:
This paper explains the scalable methods used for extracting and analyzing the Covid-19 vaccine data. Using Big Data such as Hadoop and Hive, we collect and analyze the massive data set of the confirmed, the fatality, and the vaccination data set of Covid-19. The data size is about 3.2 Giga-Byte. We show that it is possible to store and process massive data with Big Data. The paper proceeds tempo-…
▽ More
This paper explains the scalable methods used for extracting and analyzing the Covid-19 vaccine data. Using Big Data such as Hadoop and Hive, we collect and analyze the massive data set of the confirmed, the fatality, and the vaccination data set of Covid-19. The data size is about 3.2 Giga-Byte. We show that it is possible to store and process massive data with Big Data. The paper proceeds tempo-spatial analysis, and visual maps, charts, and pie charts visualize the result of the investigation. We illustrate that the more vaccinated, the fewer the confirmed cases.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
How resilient is the Open Web to the COVID-19 pandemic?
Authors:
José González-Cabañas,
Patricia Callejo,
Pelayo Vallina,
Ángel Cuevas,
Rubén Cuevas,
Antonio Fernández Anta
Abstract:
In this paper we refer to the Open Web to the set of services offered freely to Internet users, representing a pillar of modern societies. Despite its importance for society, it is unknown how the COVID-19 pandemic is affecting the Open Web. In this paper, we address this issue, focusing our analysis on Spain, one of the countries which have been most impacted by the pandemic.
On the one hand, w…
▽ More
In this paper we refer to the Open Web to the set of services offered freely to Internet users, representing a pillar of modern societies. Despite its importance for society, it is unknown how the COVID-19 pandemic is affecting the Open Web. In this paper, we address this issue, focusing our analysis on Spain, one of the countries which have been most impacted by the pandemic.
On the one hand, we study the impact of the pandemic in the financial backbone of the Open Web, the online advertising business. To this end, we leverage concepts from Supply-Demand economic theory to perform a careful analysis of the elasticity in the supply of ad-spaces to the financial shortage of the online advertising business and its subsequent reduction in ad spaces' price. On the other hand, we analyze the distribution of the Open Web composition across business categories and its evolution during the COVID-19 pandemic. These analyses are conducted between Jan 1st and Dec 31st, 2020, using a reference dataset comprising information from more than 18 billion ad spaces.
Our results indicate that the Open Web has experienced a moderate shift in its composition across business categories. However, this change is not produced by the financial shortage of the online advertising business, because as our analysis shows, the Open Web's supply of ad spaces is inelastic (i.e., insensitive) to the sustained low-price of ad spaces during the pandemic. Instead, existing evidence suggests that the reported shift in the Open Web composition is likely due to the change in the users' online behavior (e.g., browsing and mobile apps utilization patterns).
△ Less
Submitted 28 March, 2022; v1 submitted 30 July, 2021;
originally announced July 2021.
-
Science Requirements and Detector Concepts for the Electron-Ion Collider: EIC Yellow Report
Authors:
R. Abdul Khalek,
A. Accardi,
J. Adam,
D. Adamiak,
W. Akers,
M. Albaladejo,
A. Al-bataineh,
M. G. Alexeev,
F. Ameli,
P. Antonioli,
N. Armesto,
W. R. Armstrong,
M. Arratia,
J. Arrington,
A. Asaturyan,
M. Asai,
E. C. Aschenauer,
S. Aune,
H. Avagyan,
C. Ayerbe Gayoso,
B. Azmoun,
A. Bacchetta,
M. D. Baker,
F. Barbosa,
L. Barion
, et al. (390 additional authors not shown)
Abstract:
This report describes the physics case, the resulting detector requirements, and the evolving detector concepts for the experimental program at the Electron-Ion Collider (EIC). The EIC will be a powerful new high-luminosity facility in the United States with the capability to collide high-energy electron beams with high-energy proton and ion beams, providing access to those regions in the nucleon…
▽ More
This report describes the physics case, the resulting detector requirements, and the evolving detector concepts for the experimental program at the Electron-Ion Collider (EIC). The EIC will be a powerful new high-luminosity facility in the United States with the capability to collide high-energy electron beams with high-energy proton and ion beams, providing access to those regions in the nucleon and nuclei where their structure is dominated by gluons. Moreover, polarized beams in the EIC will give unprecedented access to the spatial and spin structure of the proton, neutron, and light ions. The studies leading to this document were commissioned and organized by the EIC User Group with the objective of advancing the state and detail of the physics program and develo** detector concepts that meet the emerging requirements in preparation for the realization of the EIC. The effort aims to provide the basis for further development of concepts for experimental equipment best suited for the science needs, including the importance of two complementary detectors and interaction regions.
This report consists of three volumes. Volume I is an executive summary of our findings and developed concepts. In Volume II we describe studies of a wide range of physics measurements and the emerging requirements on detector acceptance and performance. Volume III discusses general-purpose detector concepts and the underlying technologies to meet the physics requirements. These considerations will form the basis for a world-class experimental program that aims to increase our understanding of the fundamental structure of all visible matter
△ Less
Submitted 26 October, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Digital Contact Tracing: Large-scale Geolocation Data as an Alternative to Bluetooth-based Apps' Failure
Authors:
José González-Cabañas,
Ángel Cuevas,
Rubén Cuevas,
Martin Maier
Abstract:
The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this deb…
▽ More
The currently deployed contact-tracing mobile apps have failed as an efficient solution in the context of the COVID-19 pandemic. None of them has managed to attract the number of active users required to achieve an efficient operation. This urges the research community to re-open the debate and explore new avenues that lead to efficient contact-tracing solutions. This paper contributes to this debate with an alternative contact-tracing solution that leverages already available geolocation information owned by BigTech companies with very large penetration rates in most countries adopting contact-tracing mobile apps. Moreover, our solution provides sufficient privacy guarantees to protect the identity of infected users as well as precluding Health Authorities from obtaining the contact graph from individuals.
△ Less
Submitted 28 March, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Establishing Trust in Online Advertising with Signed Transactions
Authors:
Antonio Pastor,
Rubén Cuevas,
Ángel Cuevas,
Arturo Azcorra
Abstract:
Programmatic advertising operates one of the most sophisticated and efficient service platforms on the Internet. However, the complexity of this ecosystem is a direct cause of one of the most important problems in online advertising, the lack of transparency. This lack of transparency enables subsequent problems such as advertising fraud, which causes billions of dollars in losses. In this paper w…
▽ More
Programmatic advertising operates one of the most sophisticated and efficient service platforms on the Internet. However, the complexity of this ecosystem is a direct cause of one of the most important problems in online advertising, the lack of transparency. This lack of transparency enables subsequent problems such as advertising fraud, which causes billions of dollars in losses. In this paper we propose Ads.chain, a technological solution to the lack-of-transparency problem in programmatic advertising. Ads.chain extends the current effort of the Internet Advertising Bureau (IAB) in providing traceability in online advertising through the Ads.txt and Ads.cert solutions, addressing the limitations of these techniques. Ads.chain is (to the best of the authors' knowledge) the first solution that provides end-to-end cryptographic traceability at the ad transaction level. It is a communication protocol that can be seamlessly embedded into ad-tags and the OpenRTB protocol, the de-facto standards for communications in online advertising, allowing an incremental adoption by the industry. We have implemented Ads.chain and made the code publicly available. We assess the performance of Ads.chain through a thorough analysis in a lab environment that emulates a real ad delivery process at real-life throughputs. The obtained results show that Ads.chain can be implemented with limited impact on the hardware resources and marginal delay increments at the publishers lower than 0.20 milliseconds per ad space on webpages and 2.6 milliseconds at the programmatic advertising platforms. These results confirm that Ads.chain's impact on the user experience and the overall operation of the programmatic ad delivery process can be considered negligible.
△ Less
Submitted 7 January, 2021; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Does Facebook Use Sensitive Data for Advertising Purposes? Worldwide Analysis and GDPR Impact
Authors:
Ángel Cuevas,
José González Cabañas,
Aritz Arrate,
Rubén Cuevas
Abstract:
The recent European General Data Protection Regulation (GDPR) and other data protection regulations restrict the processing of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the privacy risks associated to such information. The GDPR refers to these categories as sensitive personal data. This paper quantifies the p…
▽ More
The recent European General Data Protection Regulation (GDPR) and other data protection regulations restrict the processing of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the privacy risks associated to such information. The GDPR refers to these categories as sensitive personal data. This paper quantifies the portion of Facebook (FB) users, across 197 countries, who are labeled with advertising interests linked to potentially sensitive personal data. Our study reveals that Facebook labels 67% of users with potential sensitive interests. This corresponds to 22% of the population in the referred 197 countries. Moreover, our work shows that the GDPR enforcement had a negligible impact in this context since the portion of FB users labeled with sensitive interests in the European Union remains almost the same 5 months before and 9 months after the GDPR was enacted. The paper also illustrates potential risks associated to the use of sensitive interests. For instance, we quantify the portion of FB users labelled with the interest "Homosexuality" in countries where being gay may be punished with the death penalty. The last contribution is the implementation of a web browser extension that allows FB users removing in a simple way the potentially sensitive interests FB has assigned them.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Beyond content analysis: Detecting targeted ads via distributed counting
Authors:
Costas Iordanou,
Nicolas Kourtellis,
Juan Miguel Carrascosa,
Claudio Soriente,
Ruben Cuevas,
Nikolaos Laoutaris
Abstract:
Being able to check whether an online advertisement has been targeted is essential for resolving privacy controversies and implementing in practice data protection regulations like GDPR, CCPA, and COPPA. In this paper we describe the design, implementation, and deployment of an advertisement auditing system called iWnder that uses crowdsourcing to reveal in real time whether a display advertisemen…
▽ More
Being able to check whether an online advertisement has been targeted is essential for resolving privacy controversies and implementing in practice data protection regulations like GDPR, CCPA, and COPPA. In this paper we describe the design, implementation, and deployment of an advertisement auditing system called iWnder that uses crowdsourcing to reveal in real time whether a display advertisement has been targeted or not. Crowdsourcing simplifies the detection of targeted advertising, but requires reporting to a central repository the impressions seen by different users, thereby jeopardising their privacy. We break this deadlock with a privacy preserving data sharing protocol that allows iWnder to compute global statistics required to detect targeting, while kee** the advertisements seen by individual users and their browsing history private. We conduct a simulation study to explore the effect of different parameters and a live validation to demonstrate the accuracy of our approach. Unlike previous solutions, iWnder can even detect indirect targeting, i.e., marketing campaigns that promote a product or service whose description bears no semantic overlap with its targeted audience.
△ Less
Submitted 23 July, 2019; v1 submitted 3 July, 2019;
originally announced July 2019.
-
Large-scale analysis of user exposure to online advertising in Facebook
Authors:
Aritz Arrate,
José González Cabañas,
Ángel Cuevas,
María Calderón,
Rubén Cuevas
Abstract:
Online advertising is the major source of income for a large portion of Internet Services. There exists a body of literature aiming at optimizing ads engagement, understanding the privacy and ethical implications of online advertising, etc. However, to the best of our knowledge, no previous work analyses at large scale the exposure of real users to online advertising. This paper performs a compreh…
▽ More
Online advertising is the major source of income for a large portion of Internet Services. There exists a body of literature aiming at optimizing ads engagement, understanding the privacy and ethical implications of online advertising, etc. However, to the best of our knowledge, no previous work analyses at large scale the exposure of real users to online advertising. This paper performs a comprehensive analysis of the exposure of users to ads and advertisers using a dataset including more than 7M ads from 140K unique advertisers delivered to more than 5K users that was collected between October 2016 and May 2018. The study focuses on Facebook, which is the second largest advertising platform only to Google in terms of revenue, and accounts for more than 2.2B monthly active users. Our analysis reveals that Facebook users are exposed (in median) to 70 ads per week, which come from 12 advertisers. Ads represent between 10% and 15% of all the information received in users' newsfeed. A small increment of 1% in the portion of ads in the newsfeed could roughly represent a revenue increase of 8.17M USD per week for Facebook. Finally, we also reveal that Facebook users are overprofiled since in the best case only 22.76% of the interests Facebook assigns to users for advertising purpose are actually related to the ads those users receive.
△ Less
Submitted 26 December, 2018; v1 submitted 27 November, 2018;
originally announced November 2018.
-
Facebook Use of Sensitive Data for Advertising in Europe
Authors:
José González Cabañas,
Ángel Cuevas,
Rubén Cuevas
Abstract:
The upcoming European General Data Protection Regulation (GDPR) prohibits the processing and exploitation of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the obvious privacy risks that may be derived from a malicious use of such type of information. These categories are referred to as sensitive personal data. Fa…
▽ More
The upcoming European General Data Protection Regulation (GDPR) prohibits the processing and exploitation of some categories of personal data (health, political orientation, sexual preferences, religious beliefs, ethnic origin, etc.) due to the obvious privacy risks that may be derived from a malicious use of such type of information. These categories are referred to as sensitive personal data. Facebook has been recently fined EUR 1.2M in Spain for collecting, storing and processing sensitive personal data for advertising purposes. This paper quantifies the portion of Facebook users in the European Union (EU) who are labeled with interests linked to sensitive personal data. The results of our study reveal that Facebook labels 73% EU users with sensitive interests. This corresponds to 40% of the overall EU population. We also estimate that a malicious third-party could unveil the identity of Facebook users that have been assigned a sensitive interest at a cost as low as EUR 0.015 per user. Finally, we propose and implement a web browser extension to inform Facebook users of the sensitive interests Facebook has assigned them.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.
-
Analyzing gender inequality through large-scale Facebook advertising data
Authors:
David Garcia,
Yonas Mitike Kassa,
Angel Cuevas,
Manuel Cebrian,
Esteban Moro,
Iyad Rahwan,
Ruben Cuevas
Abstract:
Online social media are information resources that can have a transformative power in society. While the Web was envisioned as an equalizing force that allows everyone to access information, the digital divide prevents large amounts of people from being present online. Online social media in particular are prone to gender inequality, an important issue given the link between social media use and e…
▽ More
Online social media are information resources that can have a transformative power in society. While the Web was envisioned as an equalizing force that allows everyone to access information, the digital divide prevents large amounts of people from being present online. Online social media in particular are prone to gender inequality, an important issue given the link between social media use and employment. Understanding gender inequality in social media is a challenging task due to the necessity of data sources that can provide large-scale measurements across multiple countries. Here we show how the Facebook Gender Divide (FGD), a metric based on aggregated statistics of more than 1.4 Billion users in 217 countries, explains various aspects of worldwide gender inequality. Our analysis shows that the FGD encodes gender equality indices in education, health, and economic opportunity. We find gender differences in network externalities that suggest that using social media has an added value for women. Furthermore, we find that low values of the FGD are associated with increases in economic gender equality. Our results suggest that online social networks, while suffering evident gender imbalance, may lower the barriers that women have to access informational resources and help to narrow the economic gender gap.
△ Less
Submitted 24 March, 2019; v1 submitted 10 October, 2017;
originally announced October 2017.
-
Understanding the evolution of multimedia content in the Internet through BitTorrent glasses
Authors:
Reza Farahbakhsh,
Angel Cuevas,
Ruben Cuevas,
Roberto Gonzalez,
Noel Crespi
Abstract:
Today's Internet traffic is mostly dominated by multimedia content and the prediction is that this trend will intensify in the future. Therefore, main Internet players, such as ISPs, content delivery platforms (e.g. Youtube, Bitorrent, Netflix, etc) or CDN operators, need to understand the evolution of multimedia content availability and popularity in order to adapt their infrastructures and resou…
▽ More
Today's Internet traffic is mostly dominated by multimedia content and the prediction is that this trend will intensify in the future. Therefore, main Internet players, such as ISPs, content delivery platforms (e.g. Youtube, Bitorrent, Netflix, etc) or CDN operators, need to understand the evolution of multimedia content availability and popularity in order to adapt their infrastructures and resources to satisfy clients requirements while they minimize their costs. This paper presents a thorough analysis on the evolution of multimedia content available in BitTorrent. Specifically, we analyze the evolution of four relevant metrics across different content categories: content availability, content popularity, content size and user's feedback. To this end we leverage a large-scale dataset formed by 4 snapshots collected from the most popular BitTorrent portal, namely The Pirate Bay, between Nov. 2009 and Feb. 2012. Overall our dataset is formed by more than 160k content that attracted more than 185M of download sessions.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Understanding the Detection of View Fraud in Video Content Portals
Authors:
Miriam Marciel,
Ruben Cuevas,
Albert Banchs,
Roberto Gonzalez,
Stefano Traverso,
Mohamed Ahmed,
Arturo Azcorra
Abstract:
While substantial effort has been devoted to understand fraudulent activity in traditional online advertising (search and banner), more recent forms such as video ads have received little attention. The understanding and identification of fraudulent activity (i.e., fake views) in video ads for advertisers, is complicated as they rely exclusively on the detection mechanisms deployed by video hostin…
▽ More
While substantial effort has been devoted to understand fraudulent activity in traditional online advertising (search and banner), more recent forms such as video ads have received little attention. The understanding and identification of fraudulent activity (i.e., fake views) in video ads for advertisers, is complicated as they rely exclusively on the detection mechanisms deployed by video hosting portals. In this context, the development of independent tools able to monitor and audit the fidelity of these systems are missing today and needed by both industry and regulators.
In this paper we present a first set of tools to serve this purpose. Using our tools, we evaluate the performance of the audit systems of five major online video portals. Our results reveal that YouTube's detection system significantly outperforms all the others. Despite this, a systematic evaluation indicates that it may still be susceptible to simple attacks. Furthermore, we find that YouTube penalizes its videos' public and monetized view counters differently, the former being more aggressive. This means that views identified as fake and discounted from the public view counter are still monetized. We speculate that even though YouTube's policy puts in lots of effort to compensate users after an attack is discovered, this practice places the burden of the risk on the advertisers, who pay to get their ads displayed.
△ Less
Submitted 5 February, 2016; v1 submitted 31 July, 2015;
originally announced July 2015.
-
I Always Feel Like Somebody's Watching Me. Measuring Online Behavioural Advertising
Authors:
J. M. Carrascosa,
J. Mikians,
R. Cuevas,
V. Erramilli,
N. Laoutaris
Abstract:
Online Behavioural targeted Advertising (OBA) has risen in prominence as a method to increase the effectiveness of online advertising. OBA operates by associating tags or labels to users based on their online activity and then using these labels to target them. This rise has been accompanied by privacy concerns from researchers, regulators and the press. In this paper, we present a novel methodolo…
▽ More
Online Behavioural targeted Advertising (OBA) has risen in prominence as a method to increase the effectiveness of online advertising. OBA operates by associating tags or labels to users based on their online activity and then using these labels to target them. This rise has been accompanied by privacy concerns from researchers, regulators and the press. In this paper, we present a novel methodology for measuring and understanding OBA in the online advertising market. We rely on training artificial online personas representing behavioural traits like 'cooking', 'movies', 'motor sports', etc. and build a measurement system that is automated, scalable and supports testing of multiple configurations. We observe that OBA is a frequent practice and notice that categories valued more by advertisers are more intensely targeted. In addition, we provide evidences showing that the advertising market targets sensitive topics (e.g, religion or health) despite the existence of regulation that bans such practices. We also compare the volume of OBA advertising for our personas in two different geographical locations (US and Spain) and see little geographic bias in terms of intensity of OBA targeting. Finally, we check for targeting with do-not-track (DNT) enabled and discovered that DNT is not yet enforced in the web.
△ Less
Submitted 9 September, 2015; v1 submitted 19 November, 2014;
originally announced November 2014.
-
Google+ or Google-?: Dissecting the Evolution of the New OSN in its First Year
Authors:
Roberto Gonzalez,
Ruben Cuevas,
Reza Motamedi,
Reza Rejaie,
Angel Cuevas
Abstract:
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question that "whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter?".
This paper tackles the above question by prese…
▽ More
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question that "whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter?".
This paper tackles the above question by presenting a detailed characterization of G+ based on large scale measurements. We identify the main components of G+ structure, characterize the key features of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics with other major OSNs. We show that despite the dramatic growth in the size of G+, the relative size of LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity. To our knowledge, our study offers the most comprehensive characterization of G+ based on the largest collected data sets.
△ Less
Submitted 26 March, 2013; v1 submitted 25 May, 2012;
originally announced May 2012.
-
Where are my followers? Understanding the Locality Effect in Twitter
Authors:
Roberto Gonzalez,
Ruben Cuevas,
Angel Cuevas,
Carmen Guerrero
Abstract:
Twitter is one of the most used applications in the current Internet with more than 200M accounts created so far. As other large-scale systems Twitter can obtain enefit by exploiting the Locality effect existing among its users. In this paper we perform the first comprehensive study of the Locality effect of Twitter. For this purpose we have collected the geographical location of around 1M Twitter…
▽ More
Twitter is one of the most used applications in the current Internet with more than 200M accounts created so far. As other large-scale systems Twitter can obtain enefit by exploiting the Locality effect existing among its users. In this paper we perform the first comprehensive study of the Locality effect of Twitter. For this purpose we have collected the geographical location of around 1M Twitter users and 16M of their followers. Our results demonstrate that language and cultural characteristics determine the level of Locality expected for different countries. Those countries with a different language than English such as Brazil typically show a high intra-country Locality whereas those others where English is official or co-official language suffer from an external Locality effect. This is, their users have a larger number of followers in US than within their same country. This is produced by two reasons: first, US is the dominant country in Twitter counting with around half of the users, and second, these countries share a common language and cultural characteristics with US.
△ Less
Submitted 18 May, 2011;
originally announced May 2011.
-
TorrentGuard: stop** scam and malware distribution in the BitTorrent ecosystem
Authors:
Michal Kryczka,
Ruben Cuevas,
Roberto Gonzalez,
Angel Cuevas,
Arturo Azcorra
Abstract:
In this paper we conduct a large scale measurement study in order to analyse the fake content publishing phenomenon in the BitTorrent Ecosystem. Our results reveal that fake content represents an important portion (35%) of those files shared in BitTorrent and just a few tens of users are responsible for 90% of this content. Furthermore, more than 99% of the analysed fake files are linked to either…
▽ More
In this paper we conduct a large scale measurement study in order to analyse the fake content publishing phenomenon in the BitTorrent Ecosystem. Our results reveal that fake content represents an important portion (35%) of those files shared in BitTorrent and just a few tens of users are responsible for 90% of this content. Furthermore, more than 99% of the analysed fake files are linked to either malware or scam websites. This creates a serious threat for the BitTorrent ecosystem. To address this issue, we present a new detection tool named TorrentGuard for the early detection of fake content. Based on our evaluation this tool may prevent the download of more than 35 millions of fake files per year. This could help to reduce the number of computer infections and scams suffered by BitTorrent users. TorrentGuard is already available and it can be accessed through both a webpage or a Vuze plugin.
△ Less
Submitted 19 April, 2012; v1 submitted 18 May, 2011;
originally announced May 2011.
-
Is Content Publishing in BitTorrent Altruistic or Profit-Driven
Authors:
Ruben Cuevas,
Michal Kryczka,
Angel Cuevas,
Sebastian Kaune,
Carmen Guerrero,
Reza Rejaie
Abstract:
BitTorrent is the most popular P2P content delivery application where individual users share various type of content with tens of thousands of other users. The growing popularity of BitTorrent is primarily due to the availability of valuable content without any cost for the consumers. However, apart from required resources, publishing (sharing) valuable (and often copyrighted) content has serious…
▽ More
BitTorrent is the most popular P2P content delivery application where individual users share various type of content with tens of thousands of other users. The growing popularity of BitTorrent is primarily due to the availability of valuable content without any cost for the consumers. However, apart from required resources, publishing (sharing) valuable (and often copyrighted) content has serious legal implications for user who publish the material (or publishers). This raises a question that whether (at least major) content publishers behave in an altruistic fashion or have other incentives such as financial. In this study, we identify the content publishers of more than 55k torrents in 2 major BitTorrent portals and examine their behavior. We demonstrate that a small fraction of publishers are responsible for 66% of published content and 75% of the downloads. Our investigations reveal that these major publishers respond to two different profiles. On one hand, antipiracy agencies and malicious publishers publish a large amount of fake files to protect copyrighted content and spread malware respectively. On the other hand, content publishing in BitTorrent is largely driven by companies with financial incentive. Therefore, if these companies lose their interest or are unable to publish content, BitTorrent traffic/portals may disappear or at least their associated traffic will significantly reduce.
△ Less
Submitted 22 July, 2010; v1 submitted 14 July, 2010;
originally announced July 2010.
-
Deep Diving into BitTorrent Locality
Authors:
Ruben Cuevas,
Nikolaos Laoutaris,
Xiaoyuan Yang,
Georgos Siganos,
Pablo Rodriguez
Abstract:
A substantial amount of work has recently gone into localizing BitTorrent traffic within an ISP in order to avoid excessive and often times unnecessary transit costs. Several architectures and systems have been proposed and the initial results from specific ISPs and a few torrents have been encouraging. In this work we attempt to deepen and scale our understanding of locality and its potential. Lo…
▽ More
A substantial amount of work has recently gone into localizing BitTorrent traffic within an ISP in order to avoid excessive and often times unnecessary transit costs. Several architectures and systems have been proposed and the initial results from specific ISPs and a few torrents have been encouraging. In this work we attempt to deepen and scale our understanding of locality and its potential. Looking at specific ISPs, we consider tens of thousands of concurrent torrents, and thus capture ISP-wide implications that cannot be appreciated by looking at only a handful of torrents. Secondly, we go beyond individual case studies and present results for the top 100 ISPs in terms of number of users represented in our dataset of up to 40K torrents involving more than 3.9M concurrent peers and more than 20M in the course of a day spread in 11K ASes. We develop scalable methodologies that permit us to process this huge dataset and answer questions such as: "\emph{what is the minimum and the maximum transit traffic reduction across hundreds of ISPs?}", "\emph{what are the win-win boundaries for ISPs and their users?}", "\emph{what is the maximum amount of transit traffic that can be localized without requiring fine-grained control of inter-AS overlay connections?}", "\emph{what is the impact to transit traffic from upgrades of residential broadband speeds?}".
△ Less
Submitted 1 February, 2011; v1 submitted 22 July, 2009;
originally announced July 2009.