-
Harm Mitigation in Recommender Systems under User Preference Dynamics
Authors:
Jerry Chee,
Shankar Kalyanaraman,
Sindhu Kiranmai Ernala,
Udi Weinsberg,
Sarah Dean,
Stratis Ioannidis
Abstract:
We consider a recommender system that takes into account the interplay between recommendations, the evolution of user interests, and harmful content. We model the impact of recommendations on user behavior, particularly the tendency to consume harmful content. We seek recommendation policies that establish a tradeoff between maximizing click-through rate (CTR) and mitigating harm. We establish con…
▽ More
We consider a recommender system that takes into account the interplay between recommendations, the evolution of user interests, and harmful content. We model the impact of recommendations on user behavior, particularly the tendency to consume harmful content. We seek recommendation policies that establish a tradeoff between maximizing click-through rate (CTR) and mitigating harm. We establish conditions under which the user profile dynamics have a stationary point, and propose algorithms for finding an optimal recommendation policy at stationarity. We experiment on a semi-synthetic movie recommendation setting initialized with real data and observe that our policies outperform baselines at simultaneously maximizing CTR and mitigating harm.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Efficient Online Crowdsourcing with Complex Annotations
Authors:
Reshef Meir,
Viet-An Nguyen,
Xu Chen,
Jagdish Ramakrishnan,
Udi Weinsberg
Abstract:
Crowdsourcing platforms use various truth discovery algorithms to aggregate annotations from multiple labelers. In an online setting, however, the main challenge is to decide whether to ask for more annotations for each item to efficiently trade off cost (i.e., the number of annotations) for quality of the aggregated annotations. In this paper, we propose a novel approach for general complex annot…
▽ More
Crowdsourcing platforms use various truth discovery algorithms to aggregate annotations from multiple labelers. In an online setting, however, the main challenge is to decide whether to ask for more annotations for each item to efficiently trade off cost (i.e., the number of annotations) for quality of the aggregated annotations. In this paper, we propose a novel approach for general complex annotation (such as bounding boxes and taxonomy paths), that works in an online crowdsourcing setting. We prove that the expected average similarity of a labeler is linear in their accuracy \emph{conditional on the reported label}. This enables us to infer reported label accuracy in a broad range of scenarios. We conduct extensive evaluations on real-world crowdsourcing data from Meta and show the effectiveness of our proposed online algorithms in improving the cost-quality trade-off.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Friend or Faux: Graph-Based Early Detection of Fake Accounts on Social Networks
Authors:
Adam Breuer,
Roee Eilat,
Udi Weinsberg
Abstract:
In this paper, we study the problem of early detection of fake user accounts on social networks based solely on their network connectivity with other users. Removing such accounts is a core task for maintaining the integrity of social networks, and early detection helps to reduce the harm that such accounts inflict. However, new fake accounts are notoriously difficult to detect via graph-based alg…
▽ More
In this paper, we study the problem of early detection of fake user accounts on social networks based solely on their network connectivity with other users. Removing such accounts is a core task for maintaining the integrity of social networks, and early detection helps to reduce the harm that such accounts inflict. However, new fake accounts are notoriously difficult to detect via graph-based algorithms, as their small number of connections are unlikely to reflect a significant structural difference from those of new real accounts. We present the SybilEdge algorithm, which determines whether a new user is a fake account (`sybil') by aggregating over (I) her choices of friend request targets and (II) these targets' respective responses. SybilEdge performs this aggregation giving more weight to a user's choices of targets to the extent that these targets are preferred by other fakes versus real users, and also to the extent that these targets respond differently to fakes versus real users. We show that SybilEdge rapidly detects new fake users at scale on the Facebook network and outperforms state-of-the-art algorithms. We also show that SybilEdge is robust to label noise in the training data, to different prevalences of fake accounts in the network, and to several different ways fakes can select targets for their friend requests. To our knowledge, this is the first time a graph-based algorithm has been shown to achieve high performance (AUC>0.9) on new users who have only sent a small number of friend requests.
△ Less
Submitted 9 April, 2020;
originally announced April 2020.
-
PNP: Fast Path Ensemble Method for Movie Design
Authors:
Danai Koutra,
Abhilash Dighe,
Smriti Bhagat,
Udi Weinsberg,
Stratis Ioannidis,
Christos Faloutsos,
Jean Bolot
Abstract:
How can we design a product or movie that will attract, for example, the interest of Pennsylvania adolescents or liberal newspaper critics? What should be the genre of that movie and who should be in the cast? In this work, we seek to identify how we can design new movies with features tailored to a specific user population. We formulate the movie design as an optimization problem over the inferen…
▽ More
How can we design a product or movie that will attract, for example, the interest of Pennsylvania adolescents or liberal newspaper critics? What should be the genre of that movie and who should be in the cast? In this work, we seek to identify how we can design new movies with features tailored to a specific user population. We formulate the movie design as an optimization problem over the inference of user-feature scores and selection of the features that maximize the number of attracted users. Our approach, PNP, is based on a heterogeneous, tripartite graph of users, movies and features (e.g., actors, directors, genres), where users rate movies and features contribute to movies. We learn the preferences by leveraging user similarities defined through different types of relations, and show that our method outperforms state-of-the-art approaches, including matrix factorization and other heterogeneous graph-based analysis. We evaluate PNP on publicly available real-world data and show that it is highly scalable and effectively provides movie designs oriented towards different groups of users, including men, women, and adolescents.
△ Less
Submitted 7 November, 2016;
originally announced November 2016.
-
The Shapley Value in Knapsack Budgeted Games
Authors:
Smriti Bhagat,
Anthony Kim,
S. Muthukrishnan,
Udi Weinsberg
Abstract:
We propose the study of computing the Shapley value for a new class of cooperative games that we call budgeted games, and investigate in particular knapsack budgeted games, a version modeled after the classical knapsack problem. In these games, the "value" of a set $S$ of agents is determined only by a critical subset $T\subseteq S$ of the agents and not the entirety of $S$ due to a budget constra…
▽ More
We propose the study of computing the Shapley value for a new class of cooperative games that we call budgeted games, and investigate in particular knapsack budgeted games, a version modeled after the classical knapsack problem. In these games, the "value" of a set $S$ of agents is determined only by a critical subset $T\subseteq S$ of the agents and not the entirety of $S$ due to a budget constraint that limits how large $T$ can be. We show that the Shapley value can be computed in time faster than by the naïve exponential time algorithm when there are sufficiently many agents, and also provide an algorithm that approximates the Shapley value within an additive error. For a related budgeted game associated with a greedy heuristic, we show that the Shapley value can be computed in pseudo-polynomial time. Furthermore, we generalize our proof techniques and propose what we term algorithmic representation framework that captures a broad class of cooperative games with the property of efficient computation of the Shapley value. The main idea is that the problem of determining the efficient computation can be reduced to that of finding an alternative representation of the games and an associated algorithm for computing the underlying value function with small time and space complexities in the representation size.
△ Less
Submitted 18 September, 2014;
originally announced September 2014.
-
Privacy Tradeoffs in Predictive Analytics
Authors:
Stratis Ioannidis,
Andrea Montanari,
Udi Weinsberg,
Smriti Bhagat,
Nadia Fawaz,
Nina Taft
Abstract:
Online services routinely mine user data to predict user preferences, make recommendations, and place targeted ads. Recent research has demonstrated that several private user attributes (such as political affiliation, sexual orientation, and gender) can be inferred from such data. Can a privacy-conscious user benefit from personalization while simultaneously protecting her private attributes? We s…
▽ More
Online services routinely mine user data to predict user preferences, make recommendations, and place targeted ads. Recent research has demonstrated that several private user attributes (such as political affiliation, sexual orientation, and gender) can be inferred from such data. Can a privacy-conscious user benefit from personalization while simultaneously protecting her private attributes? We study this question in the context of a rating prediction service based on matrix factorization. We construct a protocol of interactions between the service and users that has remarkable optimality properties: it is privacy-preserving, in that no inference algorithm can succeed in inferring a user's private attribute with a probability better than random guessing; it has maximal accuracy, in that no other privacy-preserving protocol improves rating prediction; and, finally, it involves a minimal disclosure, as the prediction accuracy strictly decreases when the service reveals less information. We extensively evaluate our protocol using several rating datasets, demonstrating that it successfully blocks the inference of gender, age and political affiliation, while incurring less than 5% decrease in the accuracy of rating prediction.
△ Less
Submitted 31 March, 2014;
originally announced March 2014.
-
Recommending with an Agenda: Active Learning of Private Attributes using Matrix Factorization
Authors:
Smriti Bhagat,
Udi Weinsberg,
Stratis Ioannidis,
Nina Taft
Abstract:
Recommender systems leverage user demographic information, such as age, gender, etc., to personalize recommendations and better place their targeted ads. Oftentimes, users do not volunteer this information due to privacy concerns, or due to a lack of initiative in filling out their online profiles. We illustrate a new threat in which a recommender learns private attributes of users who do not volu…
▽ More
Recommender systems leverage user demographic information, such as age, gender, etc., to personalize recommendations and better place their targeted ads. Oftentimes, users do not volunteer this information due to privacy concerns, or due to a lack of initiative in filling out their online profiles. We illustrate a new threat in which a recommender learns private attributes of users who do not voluntarily disclose them. We design both passive and active attacks that solicit ratings for strategically selected items, and could thus be used by a recommender system to pursue this hidden agenda. Our methods are based on a novel usage of Bayesian matrix factorization in an active learning setting. Evaluations on multiple datasets illustrate that such attacks are indeed feasible and use significantly fewer rated items than static inference methods. Importantly, they succeed without sacrificing the quality of recommendations to users.
△ Less
Submitted 30 July, 2014; v1 submitted 26 November, 2013;
originally announced November 2013.
-
CARE: Content Aware Redundancy Elimination for Disaster Communications on Damaged Networks
Authors:
Udi Weinsberg,
Athula Balachandran,
Nina Taft,
Gianluca Iannaccone,
Vyas Sekar,
Srinivasan Seshan
Abstract:
During a disaster scenario, situational awareness information, such as location, physical status and images of the surrounding area, is essential for minimizing loss of life, injury, and property damage. Today's handhelds make it easy for people to gather data from within the disaster area in many formats, including text, images and video. Studies show that the extreme anxiety induced by disasters…
▽ More
During a disaster scenario, situational awareness information, such as location, physical status and images of the surrounding area, is essential for minimizing loss of life, injury, and property damage. Today's handhelds make it easy for people to gather data from within the disaster area in many formats, including text, images and video. Studies show that the extreme anxiety induced by disasters causes humans to create a substantial amount of repetitive and redundant content. Transporting this content outside the disaster zone can be problematic when the network infrastructure is disrupted by the disaster.
This paper presents the design of a novel architecture called CARE (Content-Aware Redundancy Elimination) for better utilizing network resources in disaster-affected regions. Motivated by measurement-driven insights on redundancy patterns found in real-world disaster area photos, we demonstrate that CARE can detect the semantic similarity between photos in the networking layer, thus reducing redundant transfers and improving buffer utilization. Using DTN simulations, we explore the boundaries of the usefulness of deploying CARE on a damaged network, and show that CARE can reduce packet delivery times and drops, and enables 20-40% more unique information to reach the rescue teams outside the disaster area than when CARE is not deployed.
△ Less
Submitted 8 June, 2012;
originally announced June 2012.
-
Topological Trends of Internet Content Providers
Authors:
Yuval Shavitt,
Udi Weinsberg
Abstract:
The Internet is constantly changing, and its hierarchy was recently shown to become flatter. Recent studies of inter-domain traffic showed that large content providers drive this change by bypassing tier-1 networks and reaching closer to their users, enabling them to save transit costs and reduce reliance of transit networks as new services are being deployed, and traffic sha** is becoming incre…
▽ More
The Internet is constantly changing, and its hierarchy was recently shown to become flatter. Recent studies of inter-domain traffic showed that large content providers drive this change by bypassing tier-1 networks and reaching closer to their users, enabling them to save transit costs and reduce reliance of transit networks as new services are being deployed, and traffic sha** is becoming increasingly popular.
In this paper we take a first look at the evolving connectivity of large content provider networks, from a topological point of view of the autonomous systems (AS) graph. We perform a 5-year longitudinal study of the topological trends of large content providers, by analyzing several large content providers and comparing these trends to those observed for large tier-1 networks. We study trends in the connectivity of the networks, neighbor diversity and geographical spread, their hierarchy, the adoption of IXPs as a convenient method for peering, and their centrality. Our observations indicate that content providers gradually increase and diversify their connectivity, enabling them to improve their centrality in the graph, and as a result, tier-1 networks lose dominance over time.
△ Less
Submitted 4 January, 2012;
originally announced January 2012.
-
On the Dynamics of IP Address Allocation and Availability of End-Hosts
Authors:
Oded Argon,
Anat Bremler-Barr,
Osnat Mokryn,
Dvir Schirman,
Yuval Shavitt,
Udi Weinsberg
Abstract:
The availability of end-hosts and their assigned routable IP addresses has impact on the ability to fight spammers and attackers, and on peer-to-peer application performance. Previous works study the availability of hosts mostly by using either active **ing or by studying access to a mail service, both approaches suffer from inherent inaccuracies. We take a different approach by measuring the IP…
▽ More
The availability of end-hosts and their assigned routable IP addresses has impact on the ability to fight spammers and attackers, and on peer-to-peer application performance. Previous works study the availability of hosts mostly by using either active **ing or by studying access to a mail service, both approaches suffer from inherent inaccuracies. We take a different approach by measuring the IP addresses periodically reported by a uniquely identified group of the hosts running the DIMES agent. This fresh approach provides a chance to measure the true availability of end-hosts and the dynamics of their assigned routable IP addresses. Using a two month study of 1804 hosts, we find that over 60% of the hosts have a fixed IP address and 90% median availability, while some of the remaining hosts have more than 30 different IPs. For those that have periodically changing IP addresses, we find that the median average period per AS is roughly 24 hours, with a strong relation between the offline time and the probability of altering IP address.
△ Less
Submitted 10 November, 2010;
originally announced November 2010.
-
Near-Deterministic Inference of AS Relationships
Authors:
Yuval Shavitt,
Eran Shir,
Udi Weinsberg
Abstract:
The discovery of Autonomous Systems (ASes) interconnections and the inference of their commercial Type-of-Relationships (ToR) has been extensively studied during the last few years. The main motivation is to accurately calculate AS-level paths and to provide better topological view of the Internet. An inherent problem in current algorithms is their extensive use of heuristics. Such heuristics in…
▽ More
The discovery of Autonomous Systems (ASes) interconnections and the inference of their commercial Type-of-Relationships (ToR) has been extensively studied during the last few years. The main motivation is to accurately calculate AS-level paths and to provide better topological view of the Internet. An inherent problem in current algorithms is their extensive use of heuristics. Such heuristics incur unbounded errors which are spread over all inferred relationships. We propose a near-deterministic algorithm for solving the ToR inference problem. Our algorithm uses as input the Internet core, which is a dense sub-graph of top-level ASes. We test several methods for creating such a core and demonstrate the robustness of the algorithm to the core's size and density, the inference period, and errors in the core.
We evaluate our algorithm using AS-level paths collected from RouteViews BGP paths and DIMES traceroute measurements. Our proposed algorithm deterministically infers over 95% of the approximately 58,000 AS topology links. The inference becomes stable when using a week worth of data and as little as 20 ASes in the core. The algorithm infers 2-3 times more peer-to-peer relationships in edges discovered only by DIMES than in RouteViews edges, validating the DIMES promise to discover periphery AS edges.
△ Less
Submitted 28 November, 2007;
originally announced November 2007.