Search | arXiv e-print repository

ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy Browser Proxies

Authors: Dzung Pham, Jade Sheffey, Chau Minh Pham, Amir Houmansadr

Abstract: AI-powered chatbots (ChatGPT, Claude, etc.) require users to create an account using their email and phone number, thereby linking their personally identifiable information to their conversational data and usage patterns. As these chatbots are increasingly being used for tasks involving sensitive information, privacy concerns have been raised about how chatbot providers handle user data. To addres… ▽ More AI-powered chatbots (ChatGPT, Claude, etc.) require users to create an account using their email and phone number, thereby linking their personally identifiable information to their conversational data and usage patterns. As these chatbots are increasingly being used for tasks involving sensitive information, privacy concerns have been raised about how chatbot providers handle user data. To address these concerns, we present ProxyGPT, a privacy-enhancing system that enables anonymous queries in popular chatbot platforms. ProxyGPT leverages volunteer proxies to submit user queries on their behalf, thus providing network-level anonymity for chatbot users. The system is designed to support key security properties such as content integrity via TLS-backed data provenance, end-to-end encryption, and anonymous payment, while also ensuring usability and sustainability. We provide a thorough analysis of the privacy, security, and integrity of our system and identify various future research directions, particularly in the area of private chatbot query synthesis. Our human evaluation shows that ProxyGPT offers users a greater sense of privacy compared to traditional AI chatbots, especially in scenarios where users are hesitant to share their identity with chatbot providers. Although our proof-of-concept has higher latency than popular chatbots, our human interview participants consider this to be an acceptable trade-off for anonymity. To the best of our knowledge, ProxyGPT is the first comprehensive proxy-based solution for privacy-preserving AI chatbots. Our codebase is available at https://github.com/dzungvpham/proxygpt. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2406.15213 [pdf, other]

Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors

Authors: Ali Naseh, Jaechul Roh, Eugene Bagdasaryan, Amir Houmansadr

Abstract: Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility… ▽ More Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence society and spread propaganda. In this paper, we demonstrate the possibility of such a bias injection threat by an adversary who backdoors such models with a small number of malicious data samples; the implemented backdoor is activated when special triggers exist in the input prompt of the backdoored models. On the other hand, the model's utility is preserved in the absence of the triggers, making the attack highly undetectable. We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack. Our extensive experiments using over 1 million generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases can bypass conventional detection mechanisms, highlighting the challenges in proving the existence of biases within operational constraints. Our cost analysis confirms the low financial barrier to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in text-to-image generation models. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14517 [pdf, other]

PostMark: A Robust Blackbox Watermark for Large Language Models

Authors: Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Wieting, Mohit Iyyer

Abstract: The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. I… ▽ More The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: preprint; 18 pages, 5 figures

arXiv:2406.05927 [pdf, other]

MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

Authors: Sajjad Amini, Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr

Abstract: We present a simple yet effective method to improve the robustness of Convolutional Neural Networks (CNNs) against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean,… ▽ More We present a simple yet effective method to improve the robustness of Convolutional Neural Networks (CNNs) against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, it achieves a new robustness record of 72.08% (from 71.07%) and 59.64% (from 59.56%) on CIFAR-10 and ImageNet, respectively, in term of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2405.16978 [pdf, other]

OSLO: One-Shot Label-Only Membership Inference Attacks

Authors: Yuefeng Peng, Jaechul Roh, Subhransu Maji, Amir Houmansadr

Abstract: We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs), which accurately infer a given sample's membership in a target model's training set with high precision using just \emph{a single query}, where the target model only returns the predicted hard label. This is in contrast to state-of-the-art label-only attacks which require $\sim6000$ queries, yet get attack precisions lowe… ▽ More We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs), which accurately infer a given sample's membership in a target model's training set with high precision using just \emph{a single query}, where the target model only returns the predicted hard label. This is in contrast to state-of-the-art label-only attacks which require $\sim6000$ queries, yet get attack precisions lower than OSLO's. OSLO leverages transfer-based black-box adversarial attacks. The core idea is that a member sample exhibits more resistance to adversarial perturbations than a non-member. We compare OSLO against state-of-the-art label-only attacks and demonstrate that, despite requiring only one query, our method significantly outperforms previous attacks in terms of precision and true positive rate (TPR) under the same false positive rates (FPR). For example, compared to previous label-only MIAs, OSLO achieves a TPR that is 7$\times$ to 28$\times$ stronger under a 0.1\% FPR on CIFAR10 for a ResNet model. We evaluated multiple defense mechanisms against OSLO. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2404.13784 [pdf, other]

Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images

Authors: Ali Naseh, Katherine Thai, Mohit Iyyer, Amir Houmansadr

Abstract: With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual… ▽ More With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2403.06319 [pdf, other]

Fake or Compromised? Making Sense of Malicious Clients in Federated Learning

Authors: Hamid Mozaffari, Sunav Choudhary, Amir Houmansadr

Abstract: Federated learning (FL) is a distributed machine learning paradigm that enables training models on decentralized data. The field of FL security against poisoning attacks is plagued with confusion due to the proliferation of research that makes different assumptions about the capabilities of adversaries and the adversary models they operate under. Our work aims to clarify this confusion by presenti… ▽ More Federated learning (FL) is a distributed machine learning paradigm that enables training models on decentralized data. The field of FL security against poisoning attacks is plagued with confusion due to the proliferation of research that makes different assumptions about the capabilities of adversaries and the adversary models they operate under. Our work aims to clarify this confusion by presenting a comprehensive analysis of the various poisoning attacks and defensive aggregation rules (AGRs) proposed in the literature, and connecting them under a common framework. To connect existing adversary models, we present a hybrid adversary model, which lies in the middle of the spectrum of adversaries, where the adversary compromises a few clients, trains a generative (e.g., DDPM) model with their compromised samples, and generates new synthetic data to solve an optimization for a stronger (e.g., cheaper, more practical) attack against different robust aggregation rules. By presenting the spectrum of FL adversaries, we aim to provide practitioners and researchers with a clear understanding of the different types of threats they need to consider when designing FL systems, and identify areas where further research is needed. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.02437 [pdf, other]

SoK: Challenges and Opportunities in Federated Unlearning

Authors: Hyejun Jeong, Shiqing Ma, Amir Houmansadr

Abstract: Federated learning (FL), introduced in 2017, facilitates collaborative learning between non-trusting parties with no need for the parties to explicitly share their data among themselves. This allows training models on user data while respecting privacy regulations such as GDPR and CPRA. However, emerging privacy requirements may mandate model owners to be able to \emph{forget} some learned data, e… ▽ More Federated learning (FL), introduced in 2017, facilitates collaborative learning between non-trusting parties with no need for the parties to explicitly share their data among themselves. This allows training models on user data while respecting privacy regulations such as GDPR and CPRA. However, emerging privacy requirements may mandate model owners to be able to \emph{forget} some learned data, e.g., when requested by data owners or law enforcement. This has given birth to an active field of research called \emph{machine unlearning}. In the context of FL, many techniques developed for unlearning in centralized settings are not trivially applicable! This is due to the unique differences between centralized and distributed learning, in particular, interactivity, stochasticity, heterogeneity, and limited accessibility in FL. In response, a recent line of work has focused on develo** unlearning mechanisms tailored to FL. This SoK paper aims to take a deep look at the \emph{federated unlearning} literature, with the goal of identifying research trends and challenges in this emerging field. By carefully categorizing papers published on FL unlearning (since 2020), we aim to pinpoint the unique complexities of federated unlearning, highlighting limitations on directly applying centralized unlearning methods. We compare existing federated unlearning methods regarding influence removal and performance recovery, compare their threat models and assumptions, and discuss their implications and limitations. For instance, we analyze the experimental setup of FL unlearning studies from various perspectives, including data heterogeneity and its simulation, the datasets used for demonstration, and evaluation metrics. Our work aims to offer insights and suggestions for future research on federated unlearning. △ Less

Submitted 5 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2312.07550 [pdf, other]

Understanding (Un)Intended Memorization in Text-to-Image Generative Models

Authors: Ali Naseh, Jaechul Roh, Amir Houmansadr

Abstract: Multimodal machine learning, especially text-to-image models like Stable Diffusion and DALL-E 3, has gained significance for transforming text into detailed images. Despite their growing use and remarkable generative capabilities, there is a pressing need for a detailed examination of these models' behavior, particularly with respect to memorization. Historically, memorization in machine learnin… ▽ More Multimodal machine learning, especially text-to-image models like Stable Diffusion and DALL-E 3, has gained significance for transforming text into detailed images. Despite their growing use and remarkable generative capabilities, there is a pressing need for a detailed examination of these models' behavior, particularly with respect to memorization. Historically, memorization in machine learning has been context-dependent, with diverse definitions emerging from classification tasks to complex models like Large Language Models (LLMs) and Diffusion models. Yet, a definitive concept of memorization that aligns with the intricacies of text-to-image synthesis remains elusive. This understanding is vital as memorization poses privacy risks yet is essential for meeting user expectations, especially when generating representations of underrepresented entities. In this paper, we introduce a specialized definition of memorization tailored to text-to-image models, categorizing it into three distinct types according to user expectations. We closely examine the subtle distinctions between intended and unintended memorization, emphasizing the importance of balancing user privacy with the generative quality of the model outputs. Using the Stable Diffusion model, we offer examples to validate our memorization definitions and clarify their application. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2312.04692 [pdf, other]

Diffence: Fencing Membership Privacy With Diffusion Models

Authors: Yuefeng Peng, Ali Naseh, Amir Houmansadr

Abstract: Deep learning models, while achieving remarkable performance across various tasks, are vulnerable to member inference attacks, wherein adversaries identify if a specific data point was part of a model's training set. This susceptibility raises substantial privacy concerns, especially when models are trained on sensitive datasets. Current defense methods often struggle to provide robust protection… ▽ More Deep learning models, while achieving remarkable performance across various tasks, are vulnerable to member inference attacks, wherein adversaries identify if a specific data point was part of a model's training set. This susceptibility raises substantial privacy concerns, especially when models are trained on sensitive datasets. Current defense methods often struggle to provide robust protection without hurting model utility, and they often require retraining the model or using extra data. In this work, we introduce a novel defense framework against membership attacks by leveraging generative models. The key intuition of our defense is to remove the differences between member and non-member inputs which can be used to perform membership attacks, by re-generating input samples before feeding them to the target model. Therefore, our defense works \emph{pre-inference}, which is unlike prior defenses that are either training-time (modify the model) or post-inference time (modify the model's output). A unique feature of our defense is that it works on input samples only, without modifying the training or inference phase of the target model. Therefore, it can be cascaded with other defense mechanisms as we demonstrate through experiments. Through extensive experimentation, we show that our approach can serve as a robust plug-n-play defense mechanism, enhancing membership privacy without compromising model utility in both baseline and defended settings. For example, our method enhanced the effectiveness of recent state-of-the-art defenses, reducing attack accuracy by an average of 5.7\% to 12.4\% across three datasets, without any impact on the model's accuracy. By integrating our method with prior defenses, we achieve new state-of-the-art performance in the privacy-utility trade-off. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.03692 [pdf, other]

Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication

Authors: Ali Naseh, Jaechul Roh, Amir Houmansadr

Abstract: Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing pr… ▽ More Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2310.19163 [pdf, other]

RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning with Adversarial Data Manipulation

Authors: Dzung Pham, Shreyas Kulkarni, Amir Houmansadr

Abstract: Federated learning has emerged as a promising privacy-preserving solution for machine learning domains that rely on user interactions, particularly recommender systems and online learning to rank. While there has been substantial research on the privacy of traditional federated learning, little attention has been paid to the privacy properties of these interaction-based settings. In this work, we… ▽ More Federated learning has emerged as a promising privacy-preserving solution for machine learning domains that rely on user interactions, particularly recommender systems and online learning to rank. While there has been substantial research on the privacy of traditional federated learning, little attention has been paid to the privacy properties of these interaction-based settings. In this work, we show that users face an elevated risk of having their private interactions reconstructed by the central server when the server can control the training features of the items that users interact with. We introduce RAIFLE, a novel optimization-based attack framework where the server actively manipulates the features of the items presented to users to increase the success rate of reconstruction. Our experiments with federated recommendation and online learning-to-rank scenarios demonstrate that RAIFLE is significantly more powerful than existing reconstruction attacks like gradient inversion, achieving high performance consistently in most settings. We discuss the pros and cons of several possible countermeasures to defend against RAIFLE in the context of interaction-based federated learning. Our code is open-sourced at https://github.com/dzungvpham/raifle. △ Less

Submitted 11 July, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

arXiv:2309.10147 [pdf, other]

Realistic Website Fingerprinting By Augmenting Network Trace

Authors: Alireza Bahramali, Ardavan Bozorgi, Amir Houmansadr

Abstract: Website Fingerprinting (WF) is considered a major threat to the anonymity of Tor users (and other anonymity systems). While state-of-the-art WF techniques have claimed high attack accuracies, e.g., by leveraging Deep Neural Networks (DNN), several recent works have questioned the practicality of such WF attacks in the real world due to the assumptions made in the design and evaluation of these att… ▽ More Website Fingerprinting (WF) is considered a major threat to the anonymity of Tor users (and other anonymity systems). While state-of-the-art WF techniques have claimed high attack accuracies, e.g., by leveraging Deep Neural Networks (DNN), several recent works have questioned the practicality of such WF attacks in the real world due to the assumptions made in the design and evaluation of these attacks. In this work, we argue that such impracticality issues are mainly due to the attacker's inability in collecting training data in comprehensive network conditions, e.g., a WF classifier may be trained only on samples collected on specific high-bandwidth network links but deployed on connections with different network conditions. We show that augmenting network traces can enhance the performance of WF classifiers in unobserved network conditions. Specifically, we introduce NetAugment, an augmentation technique tailored to the specifications of Tor traces. We instantiate NetAugment through semi-supervised and self-supervised learning techniques. Our extensive open-world and close-world experiments demonstrate that under practical evaluation settings, our WF attacks provide superior performances compared to the state-of-the-art; this is due to their use of augmented network traces for training, which allows them to learn the features of target traffic in unobserved settings. For instance, with a 5-shot learning in a closed-world scenario, our self-supervised WF attack (named NetCLR) reaches up to 80% accuracy when the traces for evaluation are collected in a setting unobserved by the WF adversary. This is compared to an accuracy of 64.4% achieved by the state-of-the-art Triplet Fingerprinting [35]. We believe that the promising results of our work can encourage the use of network trace augmentation in other types of network traffic analysis. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2303.04729 [pdf, other]

doi 10.1145/3576915.3616652

Stealing the Decoding Algorithms of Language Models

Authors: Ali Naseh, Kalpesh Krishna, Mohit Iyyer, Amir Houmansadr

Abstract: A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms. These algorithms determine how to generate text from the internal probability distribution generated by the LM. The process of choosing a decoding algorithm and tuning its hyperparameters takes significant time, manual effort, and computation, and it also requires extensive human… ▽ More A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms. These algorithms determine how to generate text from the internal probability distribution generated by the LM. The process of choosing a decoding algorithm and tuning its hyperparameters takes significant time, manual effort, and computation, and it also requires extensive human evaluation. Therefore, the identity and hyperparameters of such decoding algorithms are considered to be extremely valuable to their owners. In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyperparameters of its decoding algorithms at very low monetary costs. Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo. We demonstrate the feasibility of stealing such information with only a few dollars, e.g., $\$0.8$, $\$1$, $\$4$, and $\$40$ for the four versions of GPT-3. △ Less

Submitted 1 December, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Journal ref: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

arXiv:2212.01716 [pdf, other]

Security Analysis of SplitFed Learning

Authors: Momin Ahmad Khan, Virat Shejwalkar, Amir Houmansadr, Fatima Muhammad Anwar

Abstract: Split Learning (SL) and Federated Learning (FL) are two prominent distributed collaborative learning techniques that maintain data privacy by allowing clients to never share their private data with other clients and servers, and fined extensive IoT applications in smart healthcare, smart cities, and smart industry. Prior work has extensively explored the security vulnerabilities of FL in the form… ▽ More Split Learning (SL) and Federated Learning (FL) are two prominent distributed collaborative learning techniques that maintain data privacy by allowing clients to never share their private data with other clients and servers, and fined extensive IoT applications in smart healthcare, smart cities, and smart industry. Prior work has extensively explored the security vulnerabilities of FL in the form of poisoning attacks. To mitigate the effect of these attacks, several defenses have also been proposed. Recently, a hybrid of both learning techniques has emerged (commonly known as SplitFed) that capitalizes on their advantages (fast training) and eliminates their intrinsic disadvantages (centralized model updates). In this paper, we perform the first ever empirical analysis of SplitFed's robustness to strong model poisoning attacks. We observe that the model updates in SplitFed have significantly smaller dimensionality as compared to FL that is known to have the curse of dimensionality. We show that large models that have higher dimensionality are more susceptible to privacy and security attacks, whereas the clients in SplitFed do not have the complete model and have lower dimensionality, making them more robust to existing model poisoning attacks. Our results show that the accuracy reduction due to the model poisoning attack is 5x lower for SplitFed compared to FL. △ Less

Submitted 3 December, 2022; originally announced December 2022.

arXiv:2211.00453 [pdf, other]

The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning

Authors: Virat Shejwalkar, Lingjuan Lyu, Amir Houmansadr

Abstract: Semi-supervised machine learning (SSL) is gaining popularity as it reduces the cost of training ML models. It does so by using very small amounts of (expensive, well-inspected) labeled data and large amounts of (cheap, non-inspected) unlabeled data. SSL has shown comparable or even superior performances compared to conventional fully-supervised ML techniques. In this paper, we show that the key… ▽ More Semi-supervised machine learning (SSL) is gaining popularity as it reduces the cost of training ML models. It does so by using very small amounts of (expensive, well-inspected) labeled data and large amounts of (cheap, non-inspected) unlabeled data. SSL has shown comparable or even superior performances compared to conventional fully-supervised ML techniques. In this paper, we show that the key feature of SSL that it can learn from (non-inspected) unlabeled data exposes SSL to strong poisoning attacks. In fact, we argue that, due to its reliance on non-inspected unlabeled data, poisoning is a much more severe problem in SSL than in conventional fully-supervised ML. Specifically, we design a backdoor poisoning attack on SSL that can be conducted by a weak adversary with no knowledge of target SSL pipeline. This is unlike prior poisoning attacks in fully-supervised settings that assume strong adversaries with practically-unrealistic capabilities. We show that by poisoning only 0.2% of the unlabeled training data, our attack can cause misclassification of more than 80% of test inputs (when they contain the adversary's backdoor trigger). Our attacks remain effective across twenty combinations of benchmark datasets and SSL algorithms, and even circumvent the state-of-the-art defenses against backdoor attacks. Our work raises significant concerns about the practical utility of existing SSL algorithms. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2205.10454 [pdf, other]

E2FL: Equal and Equitable Federated Learning

Authors: Hamid Mozaffari, Amir Houmansadr

Abstract: Federated Learning (FL) enables data owners to train a shared global model without sharing their private data. Unfortunately, FL is susceptible to an intrinsic fairness issue: due to heterogeneity in clients' data distributions, the final trained model can give disproportionate advantages across the participating clients. In this work, we present Equal and Equitable Federated Learning (E2FL) to pr… ▽ More Federated Learning (FL) enables data owners to train a shared global model without sharing their private data. Unfortunately, FL is susceptible to an intrinsic fairness issue: due to heterogeneity in clients' data distributions, the final trained model can give disproportionate advantages across the participating clients. In this work, we present Equal and Equitable Federated Learning (E2FL) to produce fair federated learning models by preserving two main fairness properties, equity and equality, concurrently. We validate the efficiency and fairness of E2FL in different real-world FL applications, and show that E2FL outperforms existing baselines in terms of the resulting efficiency, fairness of different groups, and fairness among all individual clients. △ Less

Submitted 16 August, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

arXiv:2110.08324 [pdf, other]

Mitigating Membership Inference Attacks by Self-Distillation Through a Novel Ensemble Architecture

Authors: Xinyu Tang, Saeed Mahloujifar, Liwei Song, Virat Shejwalkar, Milad Nasr, Amir Houmansadr, Prateek Mittal

Abstract: Membership inference attacks are a key measure to evaluate privacy leakage in machine learning (ML) models. These attacks aim to distinguish training members from non-members by exploiting differential behavior of the models on member and non-member inputs. The goal of this work is to train ML models that have high membership privacy while largely preserving their utility; we therefore aim for an… ▽ More Membership inference attacks are a key measure to evaluate privacy leakage in machine learning (ML) models. These attacks aim to distinguish training members from non-members by exploiting differential behavior of the models on member and non-member inputs. The goal of this work is to train ML models that have high membership privacy while largely preserving their utility; we therefore aim for an empirical membership privacy guarantee as opposed to the provable privacy guarantees provided by techniques like differential privacy, as such techniques are shown to deteriorate model utility. Specifically, we propose a new framework to train privacy-preserving models that induces similar behavior on member and non-member inputs to mitigate membership inference attacks. Our framework, called SELENA, has two major components. The first component and the core of our defense is a novel ensemble architecture for training. This architecture, which we call Split-AI, splits the training data into random subsets, and trains a model on each subset of the data. We use an adaptive inference strategy at test time: our ensemble architecture aggregates the outputs of only those models that did not contain the input sample in their training data. We prove that our Split-AI architecture defends against a large family of membership inference attacks, however, it is susceptible to new adaptive attacks. Therefore, we use a second component in our framework called Self-Distillation to protect against such stronger attacks. The Self-Distillation component (self-)distills the training dataset through our Split-AI ensemble, without using any external public datasets. Through extensive experiments on major benchmark datasets we show that SELENA presents a superior trade-off between membership privacy and utility compared to the state of the art. △ Less

Submitted 15 October, 2021; originally announced October 2021.

arXiv:2110.04350 [pdf, other]

FRL: Federated Rank Learning

Authors: Hamid Mozaffari, Virat Shejwalkar, Amir Houmansadr

Abstract: Federated learning (FL) allows mutually untrusted clients to collaboratively train a common machine learning model without sharing their private/proprietary training data among each other. FL is unfortunately susceptible to poisoning by malicious clients who aim to hamper the accuracy of the commonly trained model through sending malicious model updates during FL's training process. We argue tha… ▽ More Federated learning (FL) allows mutually untrusted clients to collaboratively train a common machine learning model without sharing their private/proprietary training data among each other. FL is unfortunately susceptible to poisoning by malicious clients who aim to hamper the accuracy of the commonly trained model through sending malicious model updates during FL's training process. We argue that the key factor to the success of poisoning attacks against existing FL systems is the large space of model updates available to the clients, allowing malicious clients to search for the most poisonous model updates, e.g., by solving an optimization problem. To address this, we propose Federated Rank Learning (FRL). FRL reduces the space of client updates from model parameter updates (a continuous space of float numbers) in standard FL to the space of parameter rankings (a discrete space of integer values). To be able to train the global model using parameter ranks (instead of parameter weights), FRL leverage ideas from recent supermasks training mechanisms. Specifically, FRL clients rank the parameters of a randomly initialized neural network (provided by the server) based on their local training data. The FRL server uses a voting mechanism to aggregate the parameter rankings submitted by clients in each training epoch to generate the global ranking of the next training epoch. Intuitively, our voting-based aggregation mechanism prevents poisoning clients from making significant adversarial modifications to the global model, as each client will have a single vote! We demonstrate the robustness of FRL to poisoning through analytical proofs and experimentation. We also show FRL's high communication efficiency. Our experiments demonstrate the superiority of FRL in real-world FL settings. △ Less

Submitted 16 August, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

arXiv:2108.12336 [pdf, other]

Superstring-Based Sequence Obfuscation to Thwart Pattern Matching Attacks

Authors: Bo Guan, Nazanin Takbiri, Dennis Goeckel, Amir Houmansadr, Hossein Pishro-Nik

Abstract: User privacy can be compromised by matching user data traces to records of their previous behavior. The matching of the statistical characteristics of traces to prior user behavior has been widely studied. However, an adversary can also identify a user deterministically by searching data traces for a pattern that is unique to that user. Our goal is to thwart such an adversary by applying small art… ▽ More User privacy can be compromised by matching user data traces to records of their previous behavior. The matching of the statistical characteristics of traces to prior user behavior has been widely studied. However, an adversary can also identify a user deterministically by searching data traces for a pattern that is unique to that user. Our goal is to thwart such an adversary by applying small artificial distortions to data traces such that each potentially identifying pattern is shared by a large number of users. Importantly, in contrast to statistical approaches, we develop data-independent algorithms that require no assumptions on the model by which the traces are generated. By relating the problem to a set of combinatorial questions on sequence construction, we are able to provide provable guarantees for our proposed constructions. We also introduce data-dependent approaches for the same problem. The algorithms are evaluated on synthetic data traces and on the Reality Mining Dataset to demonstrate their utility. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2108.10241 [pdf, other]

Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning

Authors: Virat Shejwalkar, Amir Houmansadr, Peter Kairouz, Daniel Ramage

Abstract: While recent works have indicated that federated learning (FL) may be vulnerable to poisoning attacks by compromised clients, their real impact on production FL systems is not fully understood. In this work, we aim to develop a comprehensive systemization for poisoning attacks on FL by enumerating all possible threat models, variations of poisoning, and adversary capabilities. We specifically put… ▽ More While recent works have indicated that federated learning (FL) may be vulnerable to poisoning attacks by compromised clients, their real impact on production FL systems is not fully understood. In this work, we aim to develop a comprehensive systemization for poisoning attacks on FL by enumerating all possible threat models, variations of poisoning, and adversary capabilities. We specifically put our focus on untargeted poisoning attacks, as we argue that they are significantly relevant to production FL deployments. We present a critical analysis of untargeted poisoning attacks under practical, production FL environments by carefully characterizing the set of realistic threat models and adversarial capabilities. Our findings are rather surprising: contrary to the established belief, we show that FL is highly robust in practice even when using simple, low-cost defenses. We go even further and propose novel, state-of-the-art data and model poisoning attacks, and show via an extensive set of experiments across three benchmark datasets how (in)effective poisoning attacks are in the presence of simple defense mechanisms. We aim to correct previous misconceptions and offer concrete guidelines to conduct more accurate (and more realistic) research on this topic. △ Less

Submitted 13 December, 2021; v1 submitted 23 August, 2021; originally announced August 2021.

Comments: To appear in the IEEE Symposium on Security & Privacy (Oakland), 2022

arXiv:2102.00918 [pdf, other]

Robust Adversarial Attacks Against DNN-Based Wireless Communication Systems

Authors: Alireza Bahramali, Milad Nasr, Amir Houmansadr, Dennis Goeckel, Don Towsley

Abstract: Deep Neural Networks (DNNs) have become prevalent in wireless communication systems due to their promising performance. However, similar to other DNN-based applications, they are vulnerable to adversarial examples. In this work, we propose an input-agnostic, undetectable, and robust adversarial attack against DNN-based wireless communication systems in both white-box and black-box scenarios. We de… ▽ More Deep Neural Networks (DNNs) have become prevalent in wireless communication systems due to their promising performance. However, similar to other DNN-based applications, they are vulnerable to adversarial examples. In this work, we propose an input-agnostic, undetectable, and robust adversarial attack against DNN-based wireless communication systems in both white-box and black-box scenarios. We design tailored Universal Adversarial Perturbations (UAPs) to perform the attack. We also use a Generative Adversarial Network (GAN) to enforce an undetectability constraint for our attack. Furthermore, we investigate the robustness of our attack against countermeasures. We show that in the presence of defense mechanisms deployed by the communicating parties, our attack performs significantly better compared to existing attacks against DNN-based wireless systems. In particular, the results demonstrate that even when employing well-considered defenses, DNN-based wireless communications are vulnerable to adversarial attacks. △ Less

Submitted 1 February, 2021; originally announced February 2021.

arXiv:2007.11524 [pdf, ps, other]

Improving Deep Learning with Differential Privacy using Gradient Encoding and Denoising

Authors: Milad Nasr, Reza Shokri, Amir houmansadr

Abstract: Deep learning models leak significant amounts of information about their training datasets. Previous work has investigated training models with differential privacy (DP) guarantees through adding DP noise to the gradients. However, such solutions (specifically, DPSGD), result in large degradations in the accuracy of the trained models. In this paper, we aim at training deep learning models with DP… ▽ More Deep learning models leak significant amounts of information about their training datasets. Previous work has investigated training models with differential privacy (DP) guarantees through adding DP noise to the gradients. However, such solutions (specifically, DPSGD), result in large degradations in the accuracy of the trained models. In this paper, we aim at training deep learning models with DP guarantees while preserving model accuracy much better than previous works. Our key technique is to encode gradients to map them to a smaller vector space, therefore enabling us to obtain DP guarantees for different noise distributions. This allows us to investigate and choose noise distributions that best preserve model accuracy for a target privacy budget. We also take advantage of the post-processing property of differential privacy by introducing the idea of denoising, which further improves the utility of the trained models without degrading their DP guarantees. We show that our mechanism outperforms the state-of-the-art DPSGD; for instance, for the same model accuracy of $96.1\%$ on MNIST, our technique results in a privacy bound of $ε=3.2$ compared to $ε=6$ of DPSGD, which is a significant improvement. △ Less

Submitted 22 July, 2020; originally announced July 2020.

arXiv:2007.06119 [pdf, other]

Asymptotic Privacy Loss due to Time Series Matching of Dependent Users

Authors: Nazanin Takbiri, Minting Chen, Dennis L. Goeckel, Amir Houmansadr, Hossein Pishro-Nik

Abstract: The Internet of Things (IoT) promises to improve user utility by tuning applications to user behavior, but revealing the characteristics of a user's behavior presents a significant privacy risk. Our previous work has established the challenging requirements for anonymization to protect users' privacy in a Bayesian setting in which we assume a powerful adversary who has perfect knowledge of the pri… ▽ More The Internet of Things (IoT) promises to improve user utility by tuning applications to user behavior, but revealing the characteristics of a user's behavior presents a significant privacy risk. Our previous work has established the challenging requirements for anonymization to protect users' privacy in a Bayesian setting in which we assume a powerful adversary who has perfect knowledge of the prior distribution for each user's behavior. However, even sophisticated adversaries do not often have such perfect knowledge; hence, in this paper, we turn our attention to an adversary who must learn user behavior from past data traces of limited length. We also assume there exists dependency between data traces of different users, and the data points of each user are drawn from a normal distribution. Results on the lengths of training sequences and data sequences that result in a loss of user privacy are presented. △ Less

Submitted 12 July, 2020; originally announced July 2020.

arXiv:2005.00508 [pdf, other]

doi 10.14722/ndss.2020.24347

Practical Traffic Analysis Attacks on Secure Messaging Applications

Authors: Alireza Bahramali, Ramin Soltani, Amir Houmansadr, Dennis Goeckel, Don Towsley

Abstract: Instant Messaging (IM) applications like Telegram, Signal, and WhatsApp have become extremely popular in recent years. Unfortunately, such IM services have been targets of continuous governmental surveillance and censorship, as these services are home to public and private communication channels on socially and politically sensitive topics. To protect their clients, popular IM services deploy stat… ▽ More Instant Messaging (IM) applications like Telegram, Signal, and WhatsApp have become extremely popular in recent years. Unfortunately, such IM services have been targets of continuous governmental surveillance and censorship, as these services are home to public and private communication channels on socially and politically sensitive topics. To protect their clients, popular IM services deploy state-of-the-art encryption mechanisms. In this paper, we show that despite the use of advanced encryption, popular IM applications leak sensitive information about their clients to adversaries who merely monitor their encrypted IM traffic, with no need for leveraging any software vulnerabilities of IM applications. Specifically, we devise traffic analysis attacks that enable an adversary to identify administrators as well as members of target IM channels (e.g., forums) with high accuracies. We believe that our study demonstrates a significant, real-world threat to the users of such services given the increasing attempts by oppressive governments at cracking down controversial IM channels. We demonstrate the practicality of our traffic analysis attacks through extensive experiments on real-world IM communications. We show that standard countermeasure techniques such as adding cover traffic can degrade the effectiveness of the attacks we introduce in this paper. We hope that our study will encourage IM providers to integrate effective traffic obfuscation countermeasures into their software. In the meantime, we have designed and deployed an open-source, publicly available countermeasure system, called IMProxy, that can be used by IM clients with no need for any support from IM providers. We have demonstrated the effectiveness of IMProxy through experiments. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Journal ref: Network and Distributed Systems Security (NDSS) Symposium 2020

arXiv:2002.06495 [pdf, other]

Blind Adversarial Network Perturbations

Authors: Milad Nasr, Alireza Bahramali, Amir Houmansadr

Abstract: Deep Neural Networks (DNNs) are commonly used for various traffic analysis problems, such as website fingerprinting and flow correlation, as they outperform traditional (e.g., statistical) techniques by large margins. However, deep neural networks are known to be vulnerable to adversarial examples: adversarial inputs to the model that get labeled incorrectly by the model due to small adversarial p… ▽ More Deep Neural Networks (DNNs) are commonly used for various traffic analysis problems, such as website fingerprinting and flow correlation, as they outperform traditional (e.g., statistical) techniques by large margins. However, deep neural networks are known to be vulnerable to adversarial examples: adversarial inputs to the model that get labeled incorrectly by the model due to small adversarial perturbations. In this paper, for the first time, we show that an adversary can defeat DNN-based traffic analysis techniques by applying \emph{adversarial perturbations} on the patterns of \emph{live} network traffic. △ Less

Submitted 15 February, 2020; originally announced February 2020.

arXiv:1912.11279 [pdf, ps, other]

Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer

Authors: Hongyan Chang, Virat Shejwalkar, Reza Shokri, Amir Houmansadr

Abstract: Collaborative (federated) learning enables multiple parties to train a model without sharing their private data, but through repeated sharing of the parameters of their local models. Despite its advantages, this approach has many known privacy and security weaknesses and performance overhead, in addition to being limited only to models with homogeneous architectures. Shared parameters leak a signi… ▽ More Collaborative (federated) learning enables multiple parties to train a model without sharing their private data, but through repeated sharing of the parameters of their local models. Despite its advantages, this approach has many known privacy and security weaknesses and performance overhead, in addition to being limited only to models with homogeneous architectures. Shared parameters leak a significant amount of information about the local (and supposedly private) datasets. Besides, federated learning is severely vulnerable to poisoning attacks, where some participants can adversarially influence the aggregate parameters. Large models, with high dimensional parameter vectors, are in particular highly susceptible to privacy and security attacks: curse of dimensionality in federated learning. We argue that sharing parameters is the most naive way of information exchange in collaborative learning, as they open all the internal state of the model to inference attacks, and maximize the model's malleability by stealthy poisoning attacks. We propose Cronus, a robust collaborative machine learning framework. The simple yet effective idea behind designing Cronus is to control, unify, and significantly reduce the dimensions of the exchanged information between parties, through robust knowledge transfer between their black-box local models. We evaluate all existing federated learning algorithms against poisoning attacks, and we show that Cronus is the only secure method, due to its tight robustness guarantee. Treating local models as black-box, reduces the information leakage through models, and enables us using existing privacy-preserving algorithms that mitigate the risk of information leakage through the model's output (predictions). Cronus also has a significantly lower sample complexity, compared to federated learning, which does not bind its security to the number of participants. △ Less

Submitted 24 December, 2019; originally announced December 2019.

arXiv:1912.02209 [pdf, other]

Leveraging Prior Knowledge Asymmetries in the Design of Location Privacy-Preserving Mechanisms

Authors: Nazanin Takbiri, Virat Shejwalker, Amir Houmansadr, Dennis L. Goeckel, Hossein Pishro-Nik

Abstract: The prevalence of mobile devices and Location-Based Services (LBS) necessitate the study of Location Privacy-Preserving Mechanisms (LPPM). However, LPPMs reduce the utility of LBS due to the noise they add to users' locations. Here, we consider the remap** technique, which presumes the adversary has a perfect statistical model for the user location. We consider this assumption and show that unde… ▽ More The prevalence of mobile devices and Location-Based Services (LBS) necessitate the study of Location Privacy-Preserving Mechanisms (LPPM). However, LPPMs reduce the utility of LBS due to the noise they add to users' locations. Here, we consider the remap** technique, which presumes the adversary has a perfect statistical model for the user location. We consider this assumption and show that under practical assumptions on the adversary's knowledge, the remap** technique leaks privacy not only about the true location data, but also about the statistical model. Finally, we introduce a novel solution called "Randomized Remap**" as a countermeasure. △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: Submitted to IEEE Wireless Communications Letters

arXiv:1906.06589 [pdf, other]

Membership Privacy for Machine Learning Models Through Knowledge Transfer

Authors: Virat Shejwalkar, Amir Houmansadr

Abstract: Large capacity machine learning (ML) models are prone to membership inference attacks (MIAs), which aim to infer whether the target sample is a member of the target model's training dataset. The serious privacy concerns due to the membership inference have motivated multiple defenses against MIAs, e.g., differential privacy and adversarial regularization. Unfortunately, these defenses produce ML m… ▽ More Large capacity machine learning (ML) models are prone to membership inference attacks (MIAs), which aim to infer whether the target sample is a member of the target model's training dataset. The serious privacy concerns due to the membership inference have motivated multiple defenses against MIAs, e.g., differential privacy and adversarial regularization. Unfortunately, these defenses produce ML models with unacceptably low classification performances. Our work proposes a new defense, called distillation for membership privacy (DMP), against MIAs that preserves the utility of the resulting models significantly better than prior defenses. DMP leverages knowledge distillation to train ML models with membership privacy. We provide a novel criterion to tune the data used for knowledge transfer in order to amplify the membership privacy of DMP. Our extensive evaluation shows that DMP provides significantly better tradeoffs between membership privacy and classification accuracies compared to state-of-the-art MIA defenses. For instance, DMP achieves ~100% accuracy improvement over adversarial regularization for DenseNet trained on CIFAR100, for similar membership privacy (measured using MIA risk): when the MIA risk is 53.7%, adversarially regularized DenseNet is 33.6% accurate, while DMP-trained DenseNet is 65.3% accurate. △ Less

Submitted 31 December, 2020; v1 submitted 15 June, 2019; originally announced June 2019.

Comments: To Appear in the 35th AAAI Conference on Artificial Intelligence, 2021

arXiv:1903.11640 [pdf, other]

Fundamental Limits of Covert Packet Insertion

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Covert communication conceals the existence of the transmission from a watchful adversary. We consider the fundamental limits for covert communications via packet insertion over packet channels whose packet timings are governed by a renewal process of rate $λ$. Authorized transmitter Jack sends packets to authorized receiver Steve, and covert transmitter Alice wishes to transmit packets to covert… ▽ More Covert communication conceals the existence of the transmission from a watchful adversary. We consider the fundamental limits for covert communications via packet insertion over packet channels whose packet timings are governed by a renewal process of rate $λ$. Authorized transmitter Jack sends packets to authorized receiver Steve, and covert transmitter Alice wishes to transmit packets to covert receiver Bob without being detected by watchful adversary Willie. Willie cannot authenticate the source of the packets. Hence, he looks for statistical anomalies in the packet stream from Jack to Steve to attempt detection of unauthorized packet insertion. First, we consider a special case where the packet timings are governed by a Poisson process and we show that Alice can covertly insert $\mathcal{O}(\sqrt{λT})$ packets for Bob in a time interval of length $T$; conversely, if Alice inserts $ω(\sqrt{λT})$, she will be detected by Willie with high probability. Then, we extend our results to general renewal channels and show that in a stream of $N$ packets transmitted by Jack, Alice can covertly insert $\mathcal{O}(\sqrt{N})$ packets; if she inserts $ω(\sqrt{N})$ packets, she will be detected by Willie with high probability. △ Less

Submitted 27 March, 2019; originally announced March 2019.

arXiv:1902.06404 [pdf, other]

Asymptotic Limits of Privacy in Bayesian Time Series Matching

Authors: Nazanin Takbiri, Dennis L. Goeckel, Amir Houmansadr, Hossein Pishro-Nik

Abstract: Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user's experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users' privacy may still be compromised by an external party who leverages statistical matching methods to… ▽ More Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user's experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users' privacy may still be compromised by an external party who leverages statistical matching methods to match users' traces with their previous activities. In this paper, we obtain the theoretical bounds on user privacy for situations in which user traces are matchable to sequences of prior behavior, despite anonymization of data time series. We provide both achievability and converse results for the case where the data trace of each user consists of independent and identically distributed (i.i.d.) random samples drawn from a multinomial distribution, as well as the case that the users' data points are dependent over time and the data trace of each user is governed by a Markov chain model. △ Less

Submitted 18 February, 2019; originally announced February 2019.

Comments: The 53rd Annual Conference on Information Sciences and Systems

Journal ref: The 53rd Annual Conference on Information Sciences and Systems 2019

arXiv:1812.00910 [pdf, ps, other]

doi 10.1109/SP.2019.00065

Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning

Authors: Milad Nasr, Reza Shokri, Amir Houmansadr

Abstract: Deep neural networks are susceptible to various inference attacks as they remember information about their training data. We design white-box inference attacks to perform a comprehensive privacy analysis of deep learning models. We measure the privacy leakage through parameters of fully trained models as well as the parameter updates of models during training. We design inference algorithms for bo… ▽ More Deep neural networks are susceptible to various inference attacks as they remember information about their training data. We design white-box inference attacks to perform a comprehensive privacy analysis of deep learning models. We measure the privacy leakage through parameters of fully trained models as well as the parameter updates of models during training. We design inference algorithms for both centralized and federated learning, with respect to passive and active inference attackers, and assuming different adversary prior knowledge. We evaluate our novel white-box membership inference attacks against deep learning algorithms to trace their training data records. We show that a straightforward extension of the known black-box attacks to the white-box setting (through analyzing the outputs of activation functions) is ineffective. We therefore design new algorithms tailored to the white-box setting by exploiting the privacy vulnerabilities of the stochastic gradient descent algorithm, which is the algorithm used to train deep neural networks. We investigate the reasons why deep learning models may leak information about their training data. We then show that even well-generalized models are significantly susceptible to white-box membership inference attacks, by analyzing state-of-the-art pre-trained and publicly available models for the CIFAR dataset. We also show how adversarial participants, in the federated learning setting, can successfully run active membership inference attacks against other participants, even when the global model achieves high prediction accuracies. △ Less

Submitted 6 June, 2020; v1 submitted 3 December, 2018; originally announced December 2018.

Comments: 2019 IEEE Symposium on Security and Privacy (SP)

arXiv:1810.03510 [pdf, other]

Fundamental Limits of Covert Bit Insertion in Packets

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Covert communication is necessary when revealing the mere existence of a message leaks sensitive information to an attacker. Consider a network link where an authorized transmitter Jack sends packets to an authorized receiver Steve, and the packets visit Alice, Willie, and Bob, respectively, before they reach Steve. Covert transmitter Alice wishes to alter the packet stream in some way to send inf… ▽ More Covert communication is necessary when revealing the mere existence of a message leaks sensitive information to an attacker. Consider a network link where an authorized transmitter Jack sends packets to an authorized receiver Steve, and the packets visit Alice, Willie, and Bob, respectively, before they reach Steve. Covert transmitter Alice wishes to alter the packet stream in some way to send information to covert receiver Bob without watchful and capable adversary Willie being able to detect the presence of the message. In our previous works, we addressed two techniques for such covert transmission from Alice to Bob: packet insertion and packet timing. In this paper, we consider covert communication via bit insertion in packets with available space (e.g., with size less than the maximum transmission unit). We consider three scenarios: 1) packet sizes are independent and identically distributed (i.i.d.) with a probability mass function (pmf) whose support is a set of one bit spaced values; 2) packet sizes are i.i.d. with a pmf whose support is arbitrary; 3) packet sizes may be dependent. For the first and second assumptions, we show that Alice can covertly insert $\mathcal{O}(\sqrt{n})$ bits of information in a flow of $n$ packets; conversely, if she inserts $ω(\sqrt{n})$ bits of information, Willie can detect her with arbitrarily small error probability. For the third assumption, we prove Alice can covertly insert on average $\mathcal{O}(c(n)/\sqrt{n})$ bits in a sequence of $n$ packets, where $c(n)$ is the average number of conditional pmf of packet sizes given the history, with a support of at least size two. △ Less

Submitted 8 October, 2018; originally announced October 2018.

Comments: This work has been presented at the 56th Annual Allerton Conference on Communication, Control, and Computing, October 2018

arXiv:1809.10289 [pdf, other]

Asymptotic Loss in Privacy due to Dependency in Gaussian Traces

Authors: Nazanin Takbiri, Ramin Soltani, Dennis L. Goeckel, Amir Houmansadr, Hossein Pishro-Nik

Abstract: The rapid growth of the Internet of Things (IoT) necessitates employing privacy-preserving techniques to protect users' sensitive information. Even when user traces are anonymized, statistical matching can be employed to infer sensitive information. In our previous work, we have established the privacy requirements for the case that the user traces are instantiations of discrete random variables a… ▽ More The rapid growth of the Internet of Things (IoT) necessitates employing privacy-preserving techniques to protect users' sensitive information. Even when user traces are anonymized, statistical matching can be employed to infer sensitive information. In our previous work, we have established the privacy requirements for the case that the user traces are instantiations of discrete random variables and the adversary knows only the structure of the dependency graph, i.e., whether each pair of users is connected. In this paper, we consider the case where data traces are instantiations of Gaussian random variables and the adversary knows not only the structure of the graph but also the pairwise correlation coefficients. We establish the requirements on anonymization to thwart such statistical matching, which demonstrate the significant degree to which knowledge of the pairwise correlation coefficients further significantly aids the adversary in breaking user anonymity. △ Less

Submitted 18 February, 2019; v1 submitted 26 September, 2018; originally announced September 2018.

Comments: IEEE Wireless Communications and Networking Conference

arXiv:1809.08514 [pdf, other]

Fundamental Limits of Invisible Flow Fingerprinting

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Network flow fingerprinting can be used to de-anonymize communications on anonymity systems such as Tor by linking the ingress and egress segments of anonymized connections. Assume Alice and Bob have access to the input and the output links of an anonymous network, respectively, and they wish to collaboratively reveal the connections between the input and the output links without being detected by… ▽ More Network flow fingerprinting can be used to de-anonymize communications on anonymity systems such as Tor by linking the ingress and egress segments of anonymized connections. Assume Alice and Bob have access to the input and the output links of an anonymous network, respectively, and they wish to collaboratively reveal the connections between the input and the output links without being detected by Willie who protects the network. Alice generates a codebook of fingerprints, where each fingerprint corresponds to a unique sequence of inter-packet delays and shares it only with Bob. For each input flow, she selects a fingerprint from the codebook and embeds it in the flow, i.e., changes the packet timings of the flow to follow the packet timings suggested by the fingerprint, and Bob extracts the fingerprints from the output flows. We model the network as parallel $M/M/1$ queues where each queue is shared by a flow from Alice to Bob and other flows independent of the flow from Alice to Bob. The timings of the flows are governed by independent Poisson point processes. Assuming all input flows have equal rates and that Bob observes only flows with fingerprints, we first present two scenarios: 1) Alice fingerprints all the flows; 2) Alice fingerprints a subset of the flows, unknown to Willie. Then, we extend the construction and analysis to the case where flow rates are arbitrary as well as the case where not all the flows that Bob observes have a fingerprint. For each scenario, we derive the number of flows that Alice can fingerprint and Bob can trace by fingerprinting. △ Less

Submitted 27 March, 2019; v1 submitted 22 September, 2018; originally announced September 2018.

arXiv:1808.07285 [pdf, other]

doi 10.1145/3243734.3243824

DeepCorr: Strong Flow Correlation Attacks on Tor Using Deep Learning

Authors: Milad Nasr, Alireza Bahramali, Amir Houmansadr

Abstract: Flow correlation is the core technique used in a multitude of deanonymization attacks on Tor. Despite the importance of flow correlation attacks on Tor, existing flow correlation techniques are considered to be ineffective and unreliable in linking Tor flows when applied at a large scale, i.e., they impose high rates of false positive error rates or require impractically long flow observations to… ▽ More Flow correlation is the core technique used in a multitude of deanonymization attacks on Tor. Despite the importance of flow correlation attacks on Tor, existing flow correlation techniques are considered to be ineffective and unreliable in linking Tor flows when applied at a large scale, i.e., they impose high rates of false positive error rates or require impractically long flow observations to be able to make reliable correlations. In this paper, we show that, unfortunately, flow correlation attacks can be conducted on Tor traffic with drastically higher accuracies than before by leveraging emerging learning mechanisms. We particularly design a system, called DeepCorr, that outperforms the state-of-the-art by significant margins in correlating Tor connections. DeepCorr leverages an advanced deep learning architecture to learn a flow correlation function tailored to Tor's complex network this is in contrast to previous works' use of generic statistical correlation metrics to correlated Tor flows. We show that with moderate learning, DeepCorr can correlate Tor connections (and therefore break its anonymity) with accuracies significantly higher than existing algorithms, and using substantially shorter lengths of flow observations. For instance, by collecting only about 900 packets of each target Tor flow (roughly 900KB of Tor data), DeepCorr provides a flow correlation accuracy of 96% compared to 4% by the state-of-the-art system of RAPTOR using the same exact setting. We hope that our work demonstrates the escalating threat of flow correlation attacks on Tor given recent advances in learning algorithms, calling for the timely deployment of effective countermeasures by the Tor community. △ Less

Submitted 22 August, 2018; originally announced August 2018.

arXiv:1807.05852 [pdf, ps, other]

Machine Learning with Membership Privacy using Adversarial Regularization

Authors: Milad Nasr, Reza Shokri, Amir Houmansadr

Abstract: Machine learning models leak information about the datasets on which they are trained. An adversary can build an algorithm to trace the individual members of a model's training dataset. As a fundamental inference attack, he aims to distinguish between data points that were part of the model's training set and any other data points from the same distribution. This is known as the tracing (and also… ▽ More Machine learning models leak information about the datasets on which they are trained. An adversary can build an algorithm to trace the individual members of a model's training dataset. As a fundamental inference attack, he aims to distinguish between data points that were part of the model's training set and any other data points from the same distribution. This is known as the tracing (and also membership inference) attack. In this paper, we focus on such attacks against black-box models, where the adversary can only observe the output of the model, but not its parameters. This is the current setting of machine learning as a service in the Internet. We introduce a privacy mechanism to train machine learning models that provably achieve membership privacy: the model's predictions on its training data are indistinguishable from its predictions on other data points from the same distribution. We design a strategic mechanism where the privacy mechanism anticipates the membership inference attacks. The objective is to train a model such that not only does it have the minimum prediction error (high utility), but also it is the most robust model against its corresponding strongest inference attack (high privacy). We formalize this as a min-max game optimization problem, and design an adversarial training algorithm that minimizes the classification loss of the model as well as the maximum gain of the membership inference attack against it. This strategy, which guarantees membership privacy (as prediction indistinguishability), acts also as a strong regularizer and significantly generalizes the model. We evaluate our privacy mechanism on deep neural networks using different benchmark datasets. We show that our min-max strategy can mitigate the risk of membership inference attacks (close to the random guess) with a negligible cost in terms of the classification error. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1806.11108 [pdf, other]

Privacy of Dependent Users Against Statistical Matching

Authors: Nazanin Takbiri, Amir Houmansadr, Dennis L. Goeckel, Hossein Pishro-Nik

Abstract: Modern applications significantly enhance user experience by adapting to each user's individual condition and/or preferences. While this adaptation can greatly improve a user's experience or be essential for the application to work, the exposure of user data to the application presents a significant privacy threat to the users\textemdash even when the traces are anonymized\textemdash since the sta… ▽ More Modern applications significantly enhance user experience by adapting to each user's individual condition and/or preferences. While this adaptation can greatly improve a user's experience or be essential for the application to work, the exposure of user data to the application presents a significant privacy threat to the users\textemdash even when the traces are anonymized\textemdash since the statistical matching of an anonymized trace to prior user behavior can identify a user and their habits. Because of the current and growing algorithmic and computational capabilities of adversaries, provable privacy guarantees as a function of the degree of anonymization and obfuscation of the traces are necessary. Our previous work has established the requirements on anonymization and obfuscation in the case that data traces are independent between users. However, the data traces of different users will be dependent in many applications, and an adversary can potentially exploit such. In this paper, we consider the impact of dependency between user traces on their privacy. First, we demonstrate that the adversary can readily identify the association graph of the obfuscated and anonymized version of the data, revealing which user data traces are dependent. Next, we demonstrate that the adversary can use this association graph to break user privacy with significantly shorter traces than in the case of independent users, and that obfuscating data traces independently across users is often insufficient to remedy such leakage. Finally, we discuss how users can improve privacy by employing joint obfuscation that removes or reduces the data dependency. △ Less

Submitted 29 May, 2019; v1 submitted 28 June, 2018; originally announced June 2018.

Comments: Submitted to IEEE Transaction on Information Theory

arXiv:1805.01296 [pdf, other]

Privacy against Statistical Matching: Inter-User Correlation

Authors: Nazanin Takbiri, Amir Houmansadr, Dennis L. Goeckel, Hossein Pishro-Nik

Abstract: Modern applications significantly enhance user experience by adapting to each user's individual condition and/or preferences. While this adaptation can greatly improve utility or be essential for the application to work (e.g., for ride-sharing applications), the exposure of user data to the application presents a significant privacy threat to the users, even when the traces are anonymized, since t… ▽ More Modern applications significantly enhance user experience by adapting to each user's individual condition and/or preferences. While this adaptation can greatly improve utility or be essential for the application to work (e.g., for ride-sharing applications), the exposure of user data to the application presents a significant privacy threat to the users, even when the traces are anonymized, since the statistical matching of an anonymized trace to prior user behavior can identify a user and their habits. Because of the current and growing algorithmic and computational capabilities of adversaries, provable privacy guarantees as a function of the degree of anonymization and obfuscation of the traces are necessary. Our previous work has established the requirements on anonymization and obfuscation in the case that data traces are independent between users. However, the data traces of different users will be dependent in many applications, and an adversary can potentially exploit such. In this paper, we consider the impact of correlation between user traces on their privacy. First, we demonstrate that the adversary can readily identify the association graph, revealing which user data traces are correlated. Next, we demonstrate that the adversary can use this association graph to break user privacy with significantly shorter traces than in the case when traces are independent between users, and that independent obfuscation of the data traces is often insufficient to remedy such. Finally, we discuss how the users can employ dependency in their obfuscation to improve their privacy. △ Less

Submitted 27 June, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

Comments: arXiv admin note: text overlap with arXiv:1702.02701 and arXiv:1710.00197

Journal ref: ISIT 2018

arXiv:1711.10079 [pdf, other]

doi 10.1109/ACSSC.2017.8335179

Towards Provably Invisible Network Flow Fingerprints

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Network traffic analysis reveals important information even when messages are encrypted. We consider active traffic analysis via flow fingerprinting by invisibly embedding information into packet timings of flows. In particular, assume Alice wishes to embed fingerprints into flows of a set of network input links, whose packet timings are modeled by Poisson processes, without being detected by a wa… ▽ More Network traffic analysis reveals important information even when messages are encrypted. We consider active traffic analysis via flow fingerprinting by invisibly embedding information into packet timings of flows. In particular, assume Alice wishes to embed fingerprints into flows of a set of network input links, whose packet timings are modeled by Poisson processes, without being detected by a watchful adversary Willie. Bob, who receives the set of fingerprinted flows after they pass through the network modeled as a collection of independent and parallel $M/M/1$ queues, wishes to extract Alice's embedded fingerprints to infer the connection between input and output links of the network. We consider two scenarios: 1) Alice embeds fingerprints in all of the flows; 2) Alice embeds fingerprints in each flow independently with probability $p$. Assuming that the flow rates are equal, we calculate the maximum number of flows in which Alice can invisibly embed fingerprints while having those fingerprints successfully decoded by Bob. Then, we extend the construction and analysis to the case where flow rates are distinct, and discuss the extension of the network model. △ Less

Submitted 22 September, 2018; v1 submitted 27 November, 2017; originally announced November 2017.

arXiv:1710.00197 [pdf, other]

Matching Anonymized and Obfuscated Time Series to Users' Profiles

Authors: Nazanin Takbiri, Amir Houmansadr, Dennis L. Goeckel, Hossein Pishro-Nik

Abstract: Many popular applications use traces of user data to offer various services to their users. However, even if user data is anonymized and obfuscated, a user's privacy can be compromised through the use of statistical matching techniques that match a user trace to prior user behavior. In this work, we derive the theoretical bounds on the privacy of users in such a scenario. We build on our recent st… ▽ More Many popular applications use traces of user data to offer various services to their users. However, even if user data is anonymized and obfuscated, a user's privacy can be compromised through the use of statistical matching techniques that match a user trace to prior user behavior. In this work, we derive the theoretical bounds on the privacy of users in such a scenario. We build on our recent study in the area of location privacy, in which we introduced formal notions of location privacy for anonymization-based location privacy-protection mechanisms. Here we derive the fundamental limits of user privacy when both anonymization and obfuscation-based protection mechanisms are applied to users' time series of data. We investigate the impact of such mechanisms on the trade-off between privacy protection and user utility. We first study achievability results for the case where the time-series of users are governed by an i.i.d. process. The converse results are proved both for the i.i.d. case as well as the more general Markov chain model. We demonstrate that as the number of users in the network grows, the obfuscation-anonymization plane can be divided into two regions: in the first region, all users have perfect privacy; and, in the second region, no user has privacy. △ Less

Submitted 27 June, 2018; v1 submitted 30 September, 2017; originally announced October 2017.

Comments: 48 pages, 12 figures, Submitted to IEEE Transactions on Information Theory

arXiv:1709.04030 [pdf, other]

Enemy At the Gateways: A Game Theoretic Approach to Proxy Distribution

Authors: Milad Nasr, Sadegh Farhang, Amir Houmansadr, Jens Grossklags

Abstract: A core technique used by popular proxy-based circumvention systems like Tor, Psiphon, and Lantern is to secretly share the IP addresses of circumvention proxies with the censored clients for them to be able to use such systems. For instance, such secretly shared proxies are known as bridges in Tor. However, a key challenge to this mechanism is the insider attack problem: censoring agents can imper… ▽ More A core technique used by popular proxy-based circumvention systems like Tor, Psiphon, and Lantern is to secretly share the IP addresses of circumvention proxies with the censored clients for them to be able to use such systems. For instance, such secretly shared proxies are known as bridges in Tor. However, a key challenge to this mechanism is the insider attack problem: censoring agents can impersonate as benign censored clients in order to obtain (and then block) such secretly shared circumvention proxies. In this paper, we perform a fundamental study on the problem of insider attack on proxy-based circumvention systems. We model the proxy distribution problem using game theory, based on which we derive the optimal strategies of the parties involved, i.e., the censors and circumvention system operators. That is, we derive the optimal proxy distribution mechanism of a circumvention system like Tor, against the censorship adversary who also takes his optimal censorship strategies. This is unlike previous works that design ad hoc mechanisms for proxy distribution, against non-optimal censors. We perform extensive simulations to evaluate our optimal proxy assignment algorithm under various adversarial and network settings. Comparing with the state-of-the-art prior work, we show that our optimal proxy assignment algorithm has superior performance, i.e., better resistance to censorship even against the strongest censorship adversary who takes her optimal actions. We conclude with lessons and recommendation for the design of proxy-based circumvention systems. △ Less

Submitted 12 September, 2017; originally announced September 2017.

arXiv:1610.05210 [pdf, other]

Achieving Perfect Location Privacy in Wireless Devices Using Anonymization

Authors: Zarrin Montazeri, Amir Houmansadr, Hossein Pishro-Nik

Abstract: The popularity of mobile devices and location-based services (LBS) has created great concern regarding the location privacy of their users. Anonymization is a common technique that is often used to protect the location privacy of LBS users. Here, we present an information-theoretic approach to define the notion of perfect location privacy. We show how LBS's should use the anonymization method to e… ▽ More The popularity of mobile devices and location-based services (LBS) has created great concern regarding the location privacy of their users. Anonymization is a common technique that is often used to protect the location privacy of LBS users. Here, we present an information-theoretic approach to define the notion of perfect location privacy. We show how LBS's should use the anonymization method to ensure that their users can achieve perfect location privacy. First, we assume that a user's current location is independent from her past locations. Using this i.i.d model, we show that if the pseudonym of the user is changed before $O(n^{\frac{2}{r-1}})$ observations are made by the adversary for that user, then the user has perfect location privacy. Here, n is the number of the users in the network and r is the number of all possible locations that users can go to. Next, we model users' movements using Markov chains to better model real-world movement patterns. We show that perfect location privacy is achievable for a user if the user's pseudonym is changed before $O(n^{\frac{2}{|E|-r}})$ observations are collected by the adversary for the user, where |E| is the number of edges in the user's Markov chain model. △ Less

Submitted 19 January, 2017; v1 submitted 17 October, 2016; originally announced October 2016.

Comments: 12 pages, 3 figures

arXiv:1610.00381 [pdf, other]

doi 10.1109/ALLERTON.2015.7447124

Covert Communications on Poisson Packet Channels

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Consider a channel where authorized transmitter Jack sends packets to authorized receiver Steve according to a Poisson process with rate $λ$ packets per second for a time period $T$. Suppose that covert transmitter Alice wishes to communicate information to covert receiver Bob on the same channel without being detected by a watchful adversary Willie. We consider two scenarios. In the first scenari… ▽ More Consider a channel where authorized transmitter Jack sends packets to authorized receiver Steve according to a Poisson process with rate $λ$ packets per second for a time period $T$. Suppose that covert transmitter Alice wishes to communicate information to covert receiver Bob on the same channel without being detected by a watchful adversary Willie. We consider two scenarios. In the first scenario, we assume that warden Willie cannot look at packet contents but rather can only observe packet timings, and Alice must send information by inserting her own packets into the channel. We show that the number of packets that Alice can covertly transmit to Bob is on the order of the square root of the number of packets that Jack transmits to Steve; conversely, if Alice transmits more than that, she will be detected by Willie with high probability. In the second scenario, we assume that Willie can look at packet contents but that Alice can communicate across an $M/M/1$ queue to Bob by altering the timings of the packets going from Jack to Steve. First, Alice builds a codebook, with each codeword consisting of a sequence of packet timings to be employed for conveying the information associated with that codeword. However, to successfully employ this codebook, Alice must always have a packet to send at the appropriate time. Hence, leveraging our result from the first scenario, we propose a construction where Alice covertly slows down the packet stream so as to buffer packets to use during a succeeding codeword transmission phase. Using this approach, Alice can covertly and reliably transmit $\mathcal{O}(λT)$ covert bits to Bob in time period $T$ over an $M/M/1$ queue with service rate $μ> λ$. △ Less

Submitted 27 November, 2017; v1 submitted 2 October, 2016; originally announced October 2016.

Comments: Allerton 2015 submission, minor edits. arXiv

arXiv:1610.00368 [pdf, other]

Covert Communications on Renewal Packet Channels

Authors: Ramin Soltani, Dennis Goeckel, Don Towsley, Amir Houmansadr

Abstract: Security and privacy are major concerns in modern communication networks. In recent years, the information theory of covert communications, where the very presence of the communication is undetectable to a watchful and determined adversary, has been of great interest. This emerging body of work has focused on additive white Gaussian noise (AWGN), discrete memoryless channels (DMCs), and optical ch… ▽ More Security and privacy are major concerns in modern communication networks. In recent years, the information theory of covert communications, where the very presence of the communication is undetectable to a watchful and determined adversary, has been of great interest. This emerging body of work has focused on additive white Gaussian noise (AWGN), discrete memoryless channels (DMCs), and optical channels. In contrast, our recent work introduced the information-theoretic limits for covert communications over packet channels whose packet timings are governed by a Poisson point process. However, actual network packet arrival times do not generally conform to the Poisson process assumption, and thus here we consider the extension of our work to timing channels characterized by more general renewal processes of rate $λ$. We consider two scenarios. In the first scenario, the source of the packets on the channel cannot be authenticated by Willie, and therefore Alice can insert packets into the channel. We show that if the total number of transmitted packets by Jack is $N$, Alice can covertly insert $\mathcal{O}\left(\sqrt{N}\right)$ packets and, if she transmits more, she will be detected by Willie. In the second scenario, packets are authenticated by Willie but we assume that Alice and Bob share a secret key; hence, Alice alters the timings of the packets according to a pre-shared codebook with Bob to send information to him over a $G/M/1$ queue with service rate $μ>λ$. We show that Alice can covertly and reliably transmit $\mathcal{O}(N)$ bits to Bob when the total number of packets sent from Jack to Steve is $N$. △ Less

Submitted 27 November, 2017; v1 submitted 2 October, 2016; originally announced October 2016.

Comments: Contains details of an Allerton 2016 submission arXiv:1610.00381

arXiv:1211.3191 [pdf, ps, other]

SWEET: Serving the Web by Exploiting Email Tunnels

Authors: Amir Houmansadr, Wenxuan Zhou, Matthew Caesar, Nikita Borisov

Abstract: Open communication over the Internet poses a serious threat to countries with repressive regimes, leading them to develop and deploy censorship mechanisms within their networks. Unfortunately, existing censorship circumvention systems do not provide high availability guarantees to their users, as censors can identify, hence disrupt, the traffic belonging to these systems using today's advanced cen… ▽ More Open communication over the Internet poses a serious threat to countries with repressive regimes, leading them to develop and deploy censorship mechanisms within their networks. Unfortunately, existing censorship circumvention systems do not provide high availability guarantees to their users, as censors can identify, hence disrupt, the traffic belonging to these systems using today's advanced censorship technologies. In this paper we propose SWEET, a highly available censorship-resistant infrastructure. SWEET works by encapsulating a censored user's traffic to a proxy server inside email messages that are carried over by public email service providers, like Gmail and Yahoo Mail. As the operation of SWEET is not bound to specific email providers we argue that a censor will need to block all email communications in order to disrupt SWEET, which is infeasible as email constitutes an important part of today's Internet. Through experiments with a prototype of our system we find that SWEET's performance is sufficient for web traffic. In particular, regular websites are downloaded within couple of seconds. △ Less

Submitted 17 December, 2012; v1 submitted 13 November, 2012; originally announced November 2012.

arXiv:1207.2683 [pdf, ps, other]

IP over Voice-over-IP for censorship circumvention

Authors: Amir Houmansadr, Thomas Riedl, Nikita Borisov, Andrew Singer

Abstract: Open communication over the Internet poses a serious threat to countries with repressive regimes, leading them to develop and deploy network-based censorship mechanisms within their networks. Existing censorship circumvention systems face different difficulties in providing unobservable communication with their clients; this limits their availability and poses threats to their users. To provide th… ▽ More Open communication over the Internet poses a serious threat to countries with repressive regimes, leading them to develop and deploy network-based censorship mechanisms within their networks. Existing censorship circumvention systems face different difficulties in providing unobservable communication with their clients; this limits their availability and poses threats to their users. To provide the required unobservability, several recent circumvention systems suggest modifying Internet routers running outside the censored region to intercept and redirect packets to censored destinations. However, these approaches require modifications to ISP networks, and hence requires cooperation from ISP operators and/or network equipment vendors, presenting a substantial deployment challenge. In this report we propose a deployable and unobservable censorship-resistant infrastructure, called FreeWave. FreeWave works by modulating a client's Internet connections into acoustic signals that are carried over VoIP connections. Such VoIP connections are targeted to a server, FreeWave server, that extracts the tunneled traffic of clients and proxies them to the uncensored Internet. The use of actual VoIP connections, as opposed to traffic morphing, allows FreeWave to relay its VoIP connections through oblivious VoIP nodes, hence kee** itself unblockable from censors that perform IP address blocking. Also, the use of end-to-end encryption prevents censors from identifying FreeWave's VoIP connections using packet content filtering technologies, like deep-packet inspection. We prototype the designed FreeWave system over the popular VoIP system of Skype. We show that FreeWave is able to reliably achieve communication bandwidths that are sufficient for web browsing, even when clients are far distanced from the FreeWave server. △ Less

Submitted 17 December, 2012; v1 submitted 11 July, 2012; originally announced July 2012.

arXiv:1203.2273 [pdf, ps, other]

Non-blind watermarking of network flows

Authors: Amir Houmansadr, Negar Kiyavash, Nikita Borisov

Abstract: Linking network flows is an important problem in intrusion detection as well as anonymity. Passive traffic analysis can link flows but requires long periods of observation to reduce errors. Active traffic analysis, also known as flow watermarking, allows for better precision and is more scalable. Previous flow watermarks introduce significant delays to the traffic flow as a side effect of using a… ▽ More Linking network flows is an important problem in intrusion detection as well as anonymity. Passive traffic analysis can link flows but requires long periods of observation to reduce errors. Active traffic analysis, also known as flow watermarking, allows for better precision and is more scalable. Previous flow watermarks introduce significant delays to the traffic flow as a side effect of using a blind detection scheme; this enables attacks that detect and remove the watermark, while at the same time slowing down legitimate traffic. We propose the first non-blind approach for flow watermarking, called RAINBOW, that improves watermark invisibility by inserting delays hundreds of times smaller than previous blind watermarks, hence reduces the watermark interference on network flows. We derive and analyze the optimum detectors for RAINBOW as well as the passive traffic analysis under different traffic models by using hypothesis testing. Comparing the detection performance of RAINBOW and the passive approach we observe that both RAINBOW and passive traffic analysis perform similarly good in the case of uncorrelated traffic, however, the RAINBOW detector drastically outperforms the optimum passive detector in the case of correlated network flows. This justifies the use of non-blind watermarks over passive traffic analysis even though both approaches have similar scalability constraints. We confirm our analysis by simulating the detectors and testing them against large traces of real network flows. △ Less

Submitted 10 March, 2012; originally announced March 2012.

arXiv:1203.1673 [pdf, ps, other]

CensorSpoofer: Asymmetric Communication with IP Spoofing for Censorship-Resistant Web Browsing

Authors: Qiyan Wang, Xun Gong, Giang T. K. Nguyen, Amir Houmansadr, Nikita Borisov

Abstract: A key challenge in censorship-resistant web browsing is being able to direct legitimate users to redirection proxies while preventing censors, posing as insiders, from discovering their addresses and blocking them. We propose a new framework for censorship-resistant web browsing called {\it CensorSpoofer} that addresses this challenge by exploiting the asymmetric nature of web browsing traffic and… ▽ More A key challenge in censorship-resistant web browsing is being able to direct legitimate users to redirection proxies while preventing censors, posing as insiders, from discovering their addresses and blocking them. We propose a new framework for censorship-resistant web browsing called {\it CensorSpoofer} that addresses this challenge by exploiting the asymmetric nature of web browsing traffic and making use of IP spoofing. CensorSpoofer de-couples the upstream and downstream channels, using a low-bandwidth indirect channel for delivering outbound requests (URLs) and a high-bandwidth direct channel for downloading web content. The upstream channel hides the request contents using steganographic encoding within email or instant messages, whereas the downstream channel uses IP address spoofing so that the real address of the proxies is not revealed either to legitimate users or censors. We built a proof-of-concept prototype that uses encrypted VoIP for this downstream channel and demonstrated the feasibility of using the CensorSpoofer framework in a realistic environment. △ Less

Submitted 9 March, 2012; v1 submitted 7 March, 2012; originally announced March 2012.

arXiv:1203.1568 [pdf, ps, other]

BotMosaic: Collaborative Network Watermark for Botnet Detection

Authors: Amir Houmansadr, Nikita Borisov

Abstract: Recent research has made great strides in the field of detecting botnets. However, botnets of all kinds continue to plague the Internet, as many ISPs and organizations do not deploy these techniques. We aim to mitigate this state by creating a very low-cost method of detecting infected bot host. Our approach is to leverage the botnet detection work carried out by some organizations to easily locat… ▽ More Recent research has made great strides in the field of detecting botnets. However, botnets of all kinds continue to plague the Internet, as many ISPs and organizations do not deploy these techniques. We aim to mitigate this state by creating a very low-cost method of detecting infected bot host. Our approach is to leverage the botnet detection work carried out by some organizations to easily locate collaborating bots elsewhere. We created BotMosaic as a countermeasure to IRC-based botnets. BotMosaic relies on captured bot instances controlled by a watermarker, who inserts a particular pattern into their network traffic. This pattern can then be detected at a very low cost by client organizations and the watermark can be tuned to provide acceptable false-positive rates. A novel feature of the watermark is that it is inserted collaboratively into the flows of multiple captured bots at once, in order to ensure the signal is strong enough to be detected. BotMosaic can also be used to detect step** stones and to help trace back to the botmaster. It is content agnostic and can operate on encrypted traffic. We evaluate BotMosaic using simulations and a testbed deployment. △ Less

Submitted 8 March, 2012; v1 submitted 7 March, 2012; originally announced March 2012.

Showing 1–50 of 52 results for author: Houmansadr, A