-
Unbundle-Rewrite-Rebundle: Runtime Detection and Rewriting of Privacy-Harming Code in JavaScript Bundles
Authors:
Mir Masood Ali,
Peter Snyder,
Chris Kanich,
Hamed Haddadi
Abstract:
This work presents Unbundle-Rewrite-Rebundle (URR), a system for detecting privacy-harming portions of bundled JavaScript code, and rewriting that code at runtime to remove the privacy harming behavior without breaking the surrounding code or overall application. URR is a novel solution to the problem of JavaScript bundles, where websites pre-compile multiple code units into a single file, making…
▽ More
This work presents Unbundle-Rewrite-Rebundle (URR), a system for detecting privacy-harming portions of bundled JavaScript code, and rewriting that code at runtime to remove the privacy harming behavior without breaking the surrounding code or overall application. URR is a novel solution to the problem of JavaScript bundles, where websites pre-compile multiple code units into a single file, making it impossible for content filters and ad-blockers to differentiate between desired and unwanted resources. Where traditional content filtering tools rely on URLs, URR analyzes the code at the AST level, and replaces harmful AST sub-trees with privacy-and-functionality maintaining alternatives.
We present an open-sourced implementation of URR as a Firefox extension, and evaluate it against JavaScript bundles generated by the most popular bundling system (Webpack) deployed on the Tranco 10k. We measure the performance, measured by precision (1.00), recall (0.95), and speed (0.43s per-script) when detecting and rewriting three representative privacy harming libraries often included in JavaScript bundles, and find URR to be an effective approach to a large-and-growing blind spot unaddressed by current privacy tools.
△ Less
Submitted 7 May, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
Internet Sanctions on Russian Media: Actions and Effects
Authors:
John Kristoff,
Moritz Müller,
Arturo Filastò,
Max Resing,
Chris Kanich,
Niels ten Oever
Abstract:
As a response to the Russian aggression against Ukraine, the European Union (EU), through the notion of "digital sovereignty", imposed sanctions on organizations and individuals affiliated with the Russian Federation that prohibit broadcasting content, including online distribution. In this paper, we interrogate the implementation of these sanctions and interpret them as a means to translate the u…
▽ More
As a response to the Russian aggression against Ukraine, the European Union (EU), through the notion of "digital sovereignty", imposed sanctions on organizations and individuals affiliated with the Russian Federation that prohibit broadcasting content, including online distribution. In this paper, we interrogate the implementation of these sanctions and interpret them as a means to translate the union of states' governmental edicts into effective technical countermeasures. Through longitudinal traffic analysis, we construct an understanding of how ISPs in different EU countries attempted to enforce these sanctions, and compare these implementations to similar measures in other western countries. We find a wide variation of blocking coverage, both internationally and within individual member states. We draw the conclusion that digital sovereignty through sanctions in the EU has a concrete but distinctly limited impact on information flows.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies
Authors:
Mir Masood Ali,
David G. Balash,
Monica Kodwani,
Chris Kanich,
Adam J. Aviv
Abstract:
Apple introduced privacy labels in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they also require developers to provide a privacy policy, which offers an important comparison point. In this paper, we fine-tuned BERT-based language models to extract privacy policy features for 474,669 apps on the iOS App Store, comparing the…
▽ More
Apple introduced privacy labels in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they also require developers to provide a privacy policy, which offers an important comparison point. In this paper, we fine-tuned BERT-based language models to extract privacy policy features for 474,669 apps on the iOS App Store, comparing the output to the privacy labels. We identify discrepancies between the policies and the labels, particularly as they relate to data collected linked to users. We find that 228K apps' privacy policies may indicate data collection linked to users than what is reported in the privacy labels. More alarming, a large number (97%) of the apps with a Data Not Collected privacy label have a privacy policy indicating otherwise. We provide insights into potential sources for discrepancies, including the use of templates and confusion around Apple's definitions and requirements. These results suggest that significant work is still needed to help developers more accurately label their apps. Our system can be incorporated as a first-order check to inform developers when privacy labels are possibly misapplied.
△ Less
Submitted 16 June, 2024; v1 submitted 29 June, 2023;
originally announced June 2023.
-
SoK: A Data-driven View on Methods to Detect Reflective Amplification DDoS Attacks Using Honeypots
Authors:
Marcin Nawrocki,
John Kristoff,
Raphael Hiesgen,
Chris Kanich,
Thomas C. Schmidt,
Matthias Wählisch
Abstract:
In this paper, we revisit the use of honeypots for detecting reflective amplification attacks. These measurement tools require careful design of both data collection and data analysis including cautious threshold inference. We survey common amplification honeypot platforms as well as the underlying methods to infer attack detection thresholds and to extract knowledge from the data. By systematical…
▽ More
In this paper, we revisit the use of honeypots for detecting reflective amplification attacks. These measurement tools require careful design of both data collection and data analysis including cautious threshold inference. We survey common amplification honeypot platforms as well as the underlying methods to infer attack detection thresholds and to extract knowledge from the data. By systematically exploring the threshold space, we find most honeypot platforms produce comparable results despite their different configurations. Moreover, by applying data from a large-scale honeypot deployment, network telescopes, and a real-world baseline obtained from a leading DDoS mitigation provider, we question the fundamental assumption of honeypot research that convergence of observations can imply their completeness. Conclusively we derive guidance on precise, reproducible honeypot research, and present open challenges.
△ Less
Submitted 24 April, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
Longitudinal Analysis of Privacy Labels in the Apple App Store
Authors:
David G. Balash,
Mir Masood Ali,
Xiaoyuan Wu,
Chris Kanich,
Adam J. Aviv
Abstract:
In December of 2020, Apple started to require app developers to self-report privacy label annotations on their apps indicating what data is collected and how it is used.To understand the adoption and shifts in privacy labels in the App Store, we collected nearly weekly snapshots of over 1.6 million apps for over a year (July 15, 2021 -- October 25, 2022) to understand the dynamics of privacy label…
▽ More
In December of 2020, Apple started to require app developers to self-report privacy label annotations on their apps indicating what data is collected and how it is used.To understand the adoption and shifts in privacy labels in the App Store, we collected nearly weekly snapshots of over 1.6 million apps for over a year (July 15, 2021 -- October 25, 2022) to understand the dynamics of privacy label ecosystem. Nearly two years after privacy labels launched, only 70.1% of apps have privacy labels, but we observed an increase of 28% during the measurement period. Privacy label adoption rates are mostly driven by new apps rather than older apps coming into compliance. Of apps with labels, 18.1% collect data used to track users, 38.1% collect data that is linked to a user identity, and 42.0% collect data that is not linked. A surprisingly large share (41.8%) of apps with labels indicate that they do not collect any data, and while we do not perform direct analysis of the apps to verify this claim, we observe that it is likely that many of these apps are choosing a Does Not Collect label due to being forced to select a label, rather than this being the true behavior of the app. Moreover, for apps that have assigned labels during the measurement period nearly all do not change their labels, and when they do, the new labels indicate more data collection than less. This suggests that privacy labels may be a ``set once'' mechanism for developers that may not actually provide users with the clarity needed to make informed privacy decisions.
△ Less
Submitted 29 March, 2023; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Fair Decision-Making for Food Inspections
Authors:
Shubham Singh,
Bhuvni Shah,
Chris Kanich,
Ian A. Kash
Abstract:
Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize…
▽ More
Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize the detection of critical food code violations. We perform the first analysis of the model's fairness to the population served by the restaurants in terms of average time to find a critical violation. We find that the model treats inspections unequally based on the sanitarian who conducted the inspection and that, in turn, there are geographic disparities in the benefits of the model. We examine four alternate methods of model training and two alternative ways of scheduling using the model and find that the latter generate more desirable results. The challenges from this application point to important directions for future work around fairness with collective entities rather than individuals, the use of critical violations as a proxy, and the disconnect between fair classification and fairness in the dynamic scheduling system.
△ Less
Submitted 9 March, 2022; v1 submitted 12 August, 2021;
originally announced August 2021.
-
Most Websites Don't Need to Vibrate: A Cost-Benefit Approach to Improving Browser Security
Authors:
Peter Snyder,
Cynthia Taylor,
Chris Kanich
Abstract:
Modern web browsers have accrued an incredibly broad set of features since being invented for hypermedia dissemination in 1990. Many of these features benefit users by enabling new types of web applications. However, some features also bring risk to users' privacy and security, whether through implementation error, unexpected composition, or unintended use. Currently there is no general methodolog…
▽ More
Modern web browsers have accrued an incredibly broad set of features since being invented for hypermedia dissemination in 1990. Many of these features benefit users by enabling new types of web applications. However, some features also bring risk to users' privacy and security, whether through implementation error, unexpected composition, or unintended use. Currently there is no general methodology for weighing these costs and benefits. Restricting access to only the features which are necessary for delivering desired functionality on a given website would allow users to enforce the principle of lease privilege on use of the myriad APIs present in the modern web browser.
However, security benefits gained by increasing restrictions must be balanced against the risk of breaking existing websites. This work addresses this problem with a methodology for weighing the costs and benefits of giving websites default access to each browser feature. We model the benefit as the number of websites that require the feature for some user-visible benefit, and the cost as the number of CVEs, lines of code, and academic attacks related to the functionality. We then apply this methodology to 74 Web API standards implemented in modern browsers. We find that allowing websites default access to large parts of the Web API poses significant security and privacy risks, with little corresponding benefit.
We also introduce a configurable browser extension that allows users to selectively restrict access to low-benefit, high-risk features on a per site basis. We evaluated our extension with two hardened browser configurations, and found that blocking 15 of the 74 standards avoids 52.0% of code paths related to previous CVEs, and 50.0% of implementation code identified by our metric, without affecting the functionality of 94.7% of measured websites.
△ Less
Submitted 4 September, 2017; v1 submitted 28 August, 2017;
originally announced August 2017.
-
Network Model Selection for Task-Focused Attributed Network Inference
Authors:
Ivan Brugere,
Chris Kanich,
Tanya Y. Berger-Wolf
Abstract:
Networks are models representing relationships between entities. Often these relationships are explicitly given, or we must learn a representation which generalizes and predicts observed behavior in underlying individual data (e.g. attributes or labels). Whether given or inferred, choosing the best representation affects subsequent tasks and questions on the network. This work focuses on model sel…
▽ More
Networks are models representing relationships between entities. Often these relationships are explicitly given, or we must learn a representation which generalizes and predicts observed behavior in underlying individual data (e.g. attributes or labels). Whether given or inferred, choosing the best representation affects subsequent tasks and questions on the network. This work focuses on model selection to evaluate network representations from data, focusing on fundamental predictive tasks on networks. We present a modular methodology using general, interpretable network models, task neighborhood functions found across domains, and several criteria for robust model selection. We demonstrate our methodology on three online user activity datasets and show that network model selection for the appropriate network task vs. an alternate task increases performance by an order of magnitude in our experiments.
△ Less
Submitted 16 September, 2017; v1 submitted 21 August, 2017;
originally announced August 2017.
-
Evaluating Social Networks Using Task-Focused Network Inference
Authors:
Ivan Brugere,
Chris Kanich,
Tanya Y. Berger-Wolf
Abstract:
Networks are representations of complex underlying social processes. However, the same given network may be more suitable to model one behavior of individuals than another. In many cases, aggregate population models may be more effective than modeling on the network. We present a general framework for evaluating the suitability of given networks for a set of predictive tasks of interest, compared…
▽ More
Networks are representations of complex underlying social processes. However, the same given network may be more suitable to model one behavior of individuals than another. In many cases, aggregate population models may be more effective than modeling on the network. We present a general framework for evaluating the suitability of given networks for a set of predictive tasks of interest, compared against alternative, networks inferred from data. We present several interpretable network models and measures for our comparison. We apply this general framework to the case study on collective classification of music preferences in a newly available dataset of the Last.fm social network.
△ Less
Submitted 7 July, 2017;
originally announced July 2017.
-
A General Framework For Task-Oriented Network Inference
Authors:
Ivan Brugere,
Chris Kanich,
Tanya Y. Berger-Wolf
Abstract:
We present a brief introduction to a flexible, general network inference framework which models data as a network space, sampled to optimize network structure to a particular task. We introduce a formal problem statement related to influence maximization in networks, where the network structure is not given as input, but learned jointly with an influence maximization solution.
We present a brief introduction to a flexible, general network inference framework which models data as a network space, sampled to optimize network structure to a particular task. We introduce a formal problem statement related to influence maximization in networks, where the network structure is not given as input, but learned jointly with an influence maximization solution.
△ Less
Submitted 1 May, 2017;
originally announced May 2017.
-
Browser Feature Usage on the Modern Web
Authors:
Peter Snyder,
Lara Ansari,
Cynthia Taylor,
Chris Kanich
Abstract:
Modern web browsers are incredibly complex, with millions of lines of code and over one thousand JavaScript functions and properties available to website authors. This work investigates how these browser features are used on the modern, open web. We find that JavaScript features differ wildly in popularity, with over 50% of provided features never used in the Alexa 10k.
We also look at how popul…
▽ More
Modern web browsers are incredibly complex, with millions of lines of code and over one thousand JavaScript functions and properties available to website authors. This work investigates how these browser features are used on the modern, open web. We find that JavaScript features differ wildly in popularity, with over 50% of provided features never used in the Alexa 10k.
We also look at how popular ad and tracking blockers change the distribution of features used by sites, and identify a set of approximately 10% of features that are disproportionately blocked (prevented from executing by these extensions at least 90% of the time they are used). We additionally find that in the presence of these blockers, over 83% of available features are executed on less than 1% of the most popular 10,000 websites.
We additionally measure a variety of aspects of browser feature usage on the web, including how complex sites have become in terms of feature usage, how the length of time a browser feature has been in the browser relates to its usage on the web, and how many security vulnerabilities have been associated with related browser features.
△ Less
Submitted 20 May, 2016;
originally announced May 2016.