Search | arXiv e-print repository

DiffAudit: Auditing Privacy Practices of Online Services for Children and Adolescents

Authors: Olivia Figueira, Rahmadi Trimananda, Athina Markopoulou, Scott Jordan

Abstract: Children's and adolescents' online data privacy are regulated by laws such as the Children's Online Privacy Protection Act (COPPA) and the California Consumer Privacy Act (CCPA). Online services that are directed towards general audiences (i.e., including children, adolescents, and adults) must comply with these laws. In this paper, first, we present DiffAudit, a platform-agnostic privacy auditing… ▽ More Children's and adolescents' online data privacy are regulated by laws such as the Children's Online Privacy Protection Act (COPPA) and the California Consumer Privacy Act (CCPA). Online services that are directed towards general audiences (i.e., including children, adolescents, and adults) must comply with these laws. In this paper, first, we present DiffAudit, a platform-agnostic privacy auditing methodology for general audience services. DiffAudit performs differential analysis of network traffic data flows to compare data processing practices (i) between child, adolescent, and adult users and (ii) before and after consent is given and user age is disclosed. We also present a data type classification method that utilizes GPT-4 and our data type ontology based on COPPA and CCPA, allowing us to identify considerably more data types than prior work. Second, we apply DiffAudit to a set of popular general audience mobile and web services and observe a rich set of behaviors extracted from over 440K outgoing requests, containing 3,968 unique data types we extracted and classified. We reveal problematic data processing practices prior to consent and age disclosure, lack of differentiation between age-specific data flows, inconsistent privacy policy disclosures, and sharing of linkable data with third parties, including advertising and tracking services. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.12590 [pdf, other]

Maverick-Aware Shapley Valuation for Client Selection in Federated Learning

Authors: Mengwei Yang, Ismat Jarin, Baturalp Buyukates, Salman Avestimehr, Athina Markopoulou

Abstract: Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. One key challenge in practical FL systems is data heterogeneity, particularly in handling clients with rare data, also referred to as Mavericks. These clients own one or more data classes exclusively, and the model performance becomes poor without their participation. Thus, utilizing Maveric… ▽ More Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. One key challenge in practical FL systems is data heterogeneity, particularly in handling clients with rare data, also referred to as Mavericks. These clients own one or more data classes exclusively, and the model performance becomes poor without their participation. Thus, utilizing Mavericks throughout training is crucial. In this paper, we first design a Maverick-aware Shapley valuation that fairly evaluates the contribution of Mavericks. The main idea is to compute the clients' Shapley values (SV) class-wise, i.e., per label. Next, we propose FedMS, a Maverick-Shapley client selection mechanism for FL that intelligently selects the clients that contribute the most in each round, by employing our Maverick-aware SV-based contribution score. We show that, compared to an extensive list of baselines, FedMS achieves better model performance and fairer Shapley Rewards distribution. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2312.17370 [pdf, other]

Seqnature: Extracting Network Fingerprints from Packet Sequences

Authors: Janus Varmarken, Rahmadi Trimananda, Athina Markopoulou

Abstract: This paper proposes a general network fingerprinting framework, Seqnature, that uses packet sequences as its basic data unit and that makes it simple to implement any fingerprinting technique that can be formulated as a problem of identifying packet exchanges that consistently occur when the fingerprinted event is triggered. We demonstrate the versatility of Seqnature by using it to implement five… ▽ More This paper proposes a general network fingerprinting framework, Seqnature, that uses packet sequences as its basic data unit and that makes it simple to implement any fingerprinting technique that can be formulated as a problem of identifying packet exchanges that consistently occur when the fingerprinted event is triggered. We demonstrate the versatility of Seqnature by using it to implement five different fingerprinting techniques, as special cases of the framework, which broadly fall into two categories: (i) fingerprinting techniques that consider features of each individual packet in a packet sequence, e.g., size and direction; and (ii) fingerprinting techniques that only consider stream-wide features, specifically what Internet endpoints are contacted. We illustrate how Seqnature facilitates comparisons of the relative performance of different fingerprinting techniques by applying the five fingerprinting techniques to datasets from the literature. The results confirm findings in prior work, for example that endpoint information alone is insufficient to differentiate between individual events on Internet of Things devices, but also show that smart TV app fingerprints based exclusively on endpoint information are not as distinct as previously reported. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.04470 [pdf, other]

GaitGuard: Towards Private Gait in Mixed Reality

Authors: Diana Romero, Ruchi Jagdish Patel, Athina Markopoulou, Salma Elmalaki

Abstract: Augmented/Mixed Reality (AR/MR) technologies offers a new era of immersive, collaborative experiences, distinctively setting them apart from conventional mobile systems. However, as we further investigate the privacy and security implications within these environments, the issue of gait privacy emerges as a critical yet underexplored concern. Given its uniqueness as a biometric identifier that can… ▽ More Augmented/Mixed Reality (AR/MR) technologies offers a new era of immersive, collaborative experiences, distinctively setting them apart from conventional mobile systems. However, as we further investigate the privacy and security implications within these environments, the issue of gait privacy emerges as a critical yet underexplored concern. Given its uniqueness as a biometric identifier that can be correlated to several sensitive attributes, the protection of gait information becomes crucial in preventing potential identity tracking and unauthorized profiling within these systems. In this paper, we conduct a user study with 20 participants to assess the risk of individual identification through gait feature analysis extracted from video feeds captured by MR devices. Our results show the capability to uniquely identify individuals with an accuracy of up to 92%, underscoring an urgent need for effective gait privacy protection measures. Through rigorous evaluation, we present a comparative analysis of various mitigation techniques, addressing both aware and unaware adversaries, in terms of their utility and impact on privacy preservation. From these evaluations, we introduce GaitGuard, the first real-time framework designed to protect the privacy of gait features within the camera view of AR/MR devices. Our evaluations of GaitGuard within a MR collaborative scenario demonstrate its effectiveness in implementing mitigation that reduces the risk of identification by up to 68%, while maintaining a minimal latency of merely 118.77 ms, thus marking a critical step forward in safeguarding privacy within AR/MR ecosystems. △ Less

Submitted 4 June, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: 21 pages, 17 figures

arXiv:2310.19958 [pdf, other]

PriPrune: Quantifying and Preserving Privacy in Pruned Federated Learning

Authors: Tianyue Chu, Mengwei Yang, Nikolaos Laoutaris, Athina Markopoulou

Abstract: Federated learning (FL) is a paradigm that allows several client devices and a server to collaboratively train a global model, by exchanging only model updates, without the devices sharing their local training data. These devices are often constrained in terms of communication and computation resources, and can further benefit from model pruning -- a paradigm that is widely used to reduce the size… ▽ More Federated learning (FL) is a paradigm that allows several client devices and a server to collaboratively train a global model, by exchanging only model updates, without the devices sharing their local training data. These devices are often constrained in terms of communication and computation resources, and can further benefit from model pruning -- a paradigm that is widely used to reduce the size and complexity of models. Intuitively, by making local models coarser, pruning is expected to also provide some protection against privacy attacks in the context of FL. However this protection has not been previously characterized, formally or experimentally, and it is unclear if it is sufficient against state-of-the-art attacks. In this paper, we perform the first investigation of privacy guarantees for model pruning in FL. We derive information-theoretic upper bounds on the amount of information leaked by pruned FL models. We complement and validate these theoretical findings, with comprehensive experiments that involve state-of-the-art privacy attacks, on several state-of-the-art FL pruning schemes, using benchmark datasets. This evaluation provides valuable insights into the choices and parameters that can affect the privacy protection provided by pruning. Based on these insights, we introduce PriPrune -- a privacy-aware algorithm for local model pruning, which uses a personalized per-client defense mask and adapts the defense pruning rate so as to jointly optimize privacy and model performance. PriPrune is universal in that can be applied after any pruned FL scheme on the client, without modification, and protects against any inversion attack by the server. Our empirical evaluation demonstrates that PriPrune significantly improves the privacy-accuracy tradeoff compared to state-of-the-art pruned FL schemes that do not take privacy into account. △ Less

Submitted 22 December, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2308.07304 [pdf, other]

BehaVR: User Identification Based on VR Sensor Data

Authors: Ismat Jarin, Yu Duan, Rahmadi Trimananda, Hao Cui, Salma Elmalaki, Athina Markopoulou

Abstract: Virtual reality (VR) platforms enable a wide range of applications, however pose unique privacy risks. In particular, VR devices are equipped with a rich set of sensors that collect personal and sensitive information (e.g., body motion, eye gaze, hand joints, and facial expression), which can be used to uniquely identify a user, even without explicit identifiers. In this paper, we are interested i… ▽ More Virtual reality (VR) platforms enable a wide range of applications, however pose unique privacy risks. In particular, VR devices are equipped with a rich set of sensors that collect personal and sensitive information (e.g., body motion, eye gaze, hand joints, and facial expression), which can be used to uniquely identify a user, even without explicit identifiers. In this paper, we are interested in understanding the extent to which a user can be identified based on data collected by different VR sensors. We consider adversaries with capabilities that range from observing APIs available within a single VR app (app adversary) to observing all, or selected, sensor measurements across all apps on the VR device (device adversary). To that end, we introduce BEHAVR, a framework for collecting and analyzing data from all sensor groups collected by all apps running on a VR device. We use BEHAVR to perform a user study and collect data from real users that interact with popular real-world apps. We use that data to build machine learning models for user identification, with features extracted from sensor data available within and across apps. We show that these models can identify users with an accuracy of up to 100%, and we reveal the most important features and sensor groups, depending on the functionality of the app and the strength of the adversary, as well as the minimum time needed for user identification. To the best of our knowledge, BEHAVR is the first to analyze user identification in VR comprehensively, i.e., considering jointly all sensor measurements available on a VR device (whether within an app or across multiple apps), collected by real-world, as opposed to custom-made, apps. △ Less

Submitted 14 August, 2023; originally announced August 2023.

arXiv:2303.17740 [pdf, other]

A CI-based Auditing Framework for Data Collection Practices

Authors: Athina Markopoulou, Rahmadi Trimananda, Hao Cui

Abstract: Apps and devices (mobile devices, web browsers, IoT, VR, voice assistants, etc.) routinely collect user data, and send them to first- and third-party servers through the network. Recently, there is a lot of interest in (1) auditing the actual data collection practices of those systems; and also in (2) checking the consistency of those practices against the statements made in the corresponding priv… ▽ More Apps and devices (mobile devices, web browsers, IoT, VR, voice assistants, etc.) routinely collect user data, and send them to first- and third-party servers through the network. Recently, there is a lot of interest in (1) auditing the actual data collection practices of those systems; and also in (2) checking the consistency of those practices against the statements made in the corresponding privacy policies. In this paper, we argue that the contextual integrity (CI) tuple can be the basic building block for defining and implementing such an auditing framework. We elaborate on the special case where the tuple is partially extracted from the network traffic generated by the end-device of interest, and partially from the corresponding privacy policies using natural language processing (NLP) techniques. Along the way, we discuss related bodies of work and representative examples that fit into that framework. More generally, we believe that CI can be the building block not only for auditing at the edge, but also for specifying privacy policies and system APIs. We also discuss limitations and directions for future work. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 5 pages, 5 figures. The paper was first presented at the 4th Annual Symposium on Applications of Contextual Integrity, NYC, Sept. 2022

arXiv:2210.06746 [pdf, other]

PoliGraph: Automated Privacy Policy Analysis using Knowledge Graphs

Authors: Hao Cui, Rahmadi Trimananda, Athina Markopoulou, Scott Jordan

Abstract: Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging natural language processing (NLP) to automate privacy policy analysis and extract data collection statements from different sentences, considered in isolation from each other. In this paper, we view and analyze, for the first time, the entire text of a privacy policy… ▽ More Privacy policies disclose how an organization collects and handles personal information. Recent work has made progress in leveraging natural language processing (NLP) to automate privacy policy analysis and extract data collection statements from different sentences, considered in isolation from each other. In this paper, we view and analyze, for the first time, the entire text of a privacy policy in an integrated way. In terms of methodology: (1) we define PoliGraph, a type of knowledge graph that captures statements in a privacy policy as relations between different parts of the text; and (2) we develop an NLP-based tool, PoliGraph-er, to automatically extract PoliGraph from the text. In addition, (3) we revisit the notion of ontologies, previously defined in heuristic ways, to capture subsumption relations between terms. We make a clear distinction between local and global ontologies to capture the context of individual privacy policies, application domains, and privacy laws. Using a public dataset for evaluation, we show that PoliGraph-er identifies 40% more collection statements than prior state-of-the-art, with 97% precision. In terms of applications, PoliGraph enables automated analysis of a corpus of privacy policies and allows us to: (1) reveal common patterns in the texts across different privacy policies, and (2) assess the correctness of the terms as defined within a privacy policy. We also apply PoliGraph to: (3) detect contradictions in a privacy policy, where we show false alarms by prior work, and (4) analyze the consistency of privacy policies and network traffic, where we identify significantly more clear disclosures than prior work. △ Less

Submitted 20 June, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: 24 pages, 15 figures (including subfigures), 9 tables. This is the extended version of the paper with the same title published at USENIX Security '23

arXiv:2204.10920 [pdf, other]

doi 10.1145/3618257.3624803

Tracking, Profiling, and Ad Targeting in the Alexa Echo Smart Speaker Ecosystem

Authors: Umar Iqbal, Pouneh Nikkhah Bahrami, Rahmadi Trimananda, Hao Cui, Alexander Gamero-Garrido, Daniel Dubois, David Choffnes, Athina Markopoulou, Franziska Roesner, Zubair Shafiq

Abstract: Smart speakers collect voice commands, which can be used to infer sensitive information about users. Given the potential for privacy harms, there is a need for greater transparency and control over the data collected, used, and shared by smart speaker platforms as well as third party skills supported on them. To bridge this gap, we build a framework to measure data collection, usage, and sharing b… ▽ More Smart speakers collect voice commands, which can be used to infer sensitive information about users. Given the potential for privacy harms, there is a need for greater transparency and control over the data collected, used, and shared by smart speaker platforms as well as third party skills supported on them. To bridge this gap, we build a framework to measure data collection, usage, and sharing by the smart speaker platforms. We apply our framework to the Amazon smart speaker ecosystem. Our results show that Amazon and third parties, including advertising and tracking services that are unique to the smart speaker ecosystem, collect smart speaker interaction data. We also find that Amazon processes smart speaker interaction data to infer user interests and uses those inferences to serve targeted ads to users. Smart speaker interaction also leads to ad targeting and as much as 30X higher bids in ad auctions, from third party advertisers. Finally, we find that Amazon's and third party skills' data practices are often not clearly disclosed in their policy documents. △ Less

Submitted 13 October, 2023; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: Published at the ACM Internet Measurement Conference 2023

arXiv:2202.12872 [pdf, other]

AutoFR: Automated Filter Rule Generation for Adblocking

Authors: Hieu Le, Salma Elmalaki, Athina Markopoulou, Zubair Shafiq

Abstract: Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design a… ▽ More Adblocking relies on filter lists, which are manually curated and maintained by a community of filter list authors. Filter list curation is a laborious process that does not scale well to a large number of sites or over time. In this paper, we introduce AutoFR, a reinforcement learning framework to fully automate the process of filter rule creation and evaluation for sites of interest. We design an algorithm based on multi-arm bandits to generate filter rules that block ads while controlling the trade-off between blocking ads and avoiding visual breakage. We test AutoFR on thousands of sites and we show that it is efficient: it takes only a few minutes to generate filter rules for a site of interest. AutoFR is effective: it generates filter rules that can block 86% of the ads, as compared to 87% by EasyList, while achieving comparable visual breakage. Furthermore, AutoFR generates filter rules that generalize well to new sites. We envision that AutoFR can assist the adblocking community in filter rule generation at scale. △ Less

Submitted 7 March, 2023; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 16 pages with 13 figures, 3 tables, 1 algorithm. 3.5 pages of references. Appendices include 10 pages of appendices with 11 figures and 3 tables

arXiv:2202.03679 [pdf, other]

A Unified Prediction Framework for Signal Maps

Authors: Emmanouil Alimpertis, Athina Markopoulou, Carter T. Butts, Evita Bakopoulou, Konstantinos Psounis

Abstract: Signal maps are essential for the planning and operation of cellular networks. However, the measurements needed to create such maps are expensive, often biased, not always reflecting the metrics of interest, and posing privacy risks. In this paper, we develop a unified framework for predicting cellular signal maps from limited measurements. Our framework builds on a state-of-the-art random-forest… ▽ More Signal maps are essential for the planning and operation of cellular networks. However, the measurements needed to create such maps are expensive, often biased, not always reflecting the metrics of interest, and posing privacy risks. In this paper, we develop a unified framework for predicting cellular signal maps from limited measurements. Our framework builds on a state-of-the-art random-forest predictor, or any other base predictor. We propose and combine three mechanisms that deal with the fact that not all measurements are equally important for a particular prediction task. First, we design quality-of-service functions ($Q$), including signal strength (RSRP) but also other metrics of interest to operators, i.e., coverage and call drop probability. By implicitly altering the loss function employed in learning, quality functions can also improve prediction for RSRP itself where it matters (e.g., MSE reduction up to 27% in the low signal strength regime, where errors are critical). Second, we introduce weight functions ($W$) to specify the relative importance of prediction at different locations and other parts of the feature space. We propose re-weighting based on importance sampling to obtain unbiased estimators when the sampling and target distributions are different. This yields improvements up to 20% for targets based on spatially uniform loss or losses based on user population density. Third, we apply the Data Shapley framework for the first time in this context: to assign values ($φ$) to individual measurement points, which capture the importance of their contribution to the prediction task. This improves prediction (e.g., from 64% to 94% in recall for coverage loss) by removing points with negative values, and can also enable data minimization. We evaluate our methods and demonstrate significant improvement in prediction performance, using several real-world datasets. △ Less

Submitted 12 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: Coverage Maps; Signal Strength Maps; LTE; RSRP; CQI; RSRQ; RSS; Importance Sampling; Random Forests; Carrier's Objectives; Call Drops;Key Performance Indicators

arXiv:2112.03452 [pdf, other]

doi 10.1109/TMC.2023.3332034

Location Leakage in Federated Signal Maps

Authors: Evita Bakopoulou, Mengwei Yang, Jiang Zhang, Konstantinos Psounis, Athina Markopoulou

Abstract: We consider the problem of predicting cellular network performance (signal maps) from measurements collected by several mobile devices. We formulate the problem within the online federated learning framework: (i) federated learning (FL) enables users to collaboratively train a model, while kee** their training data on their devices; (ii) measurements are collected as users move around over time… ▽ More We consider the problem of predicting cellular network performance (signal maps) from measurements collected by several mobile devices. We formulate the problem within the online federated learning framework: (i) federated learning (FL) enables users to collaboratively train a model, while kee** their training data on their devices; (ii) measurements are collected as users move around over time and are used for local training in an online fashion. We consider an honest-but-curious server, who observes the updates from target users participating in FL and infers their location using a deep leakage from gradients (DLG) type of attack, originally developed to reconstruct training data of DNN image classifiers. We make the key observation that a DLG attack, applied to our setting, infers the average location of a batch of local data, and can thus be used to reconstruct the target users' trajectory at a coarse granularity. We build on this observation to protect location privacy, in our setting, by revisiting and designing mechanisms within the federated learning framework including: tuning the FL parameters for averaging, curating local batches so as to mislead the DLG attacker, and aggregating across multiple users with different trajectories. We evaluate the performance of our algorithms through both analysis and simulation based on real-world mobile datasets, and we show that they achieve a good privacy-utility tradeoff. △ Less

Submitted 5 January, 2024; v1 submitted 6 December, 2021; originally announced December 2021.

arXiv:2106.05407 [pdf, other]

OVRseen: Auditing Network Traffic and Privacy Policies in Oculus VR

Authors: Rahmadi Trimananda, Hieu Le, Hao Cui, Janice Tran Ho, Anastasia Shuba, Athina Markopoulou

Abstract: Virtual reality (VR) is an emerging technology that enables new applications but also introduces privacy risks. In this paper, we focus on Oculus VR (OVR), the leading platform in the VR space and we provide the first comprehensive analysis of personal data exposed by OVR apps and the platform itself, from a combined networking and privacy policy perspective. We experimented with the Quest 2 heads… ▽ More Virtual reality (VR) is an emerging technology that enables new applications but also introduces privacy risks. In this paper, we focus on Oculus VR (OVR), the leading platform in the VR space and we provide the first comprehensive analysis of personal data exposed by OVR apps and the platform itself, from a combined networking and privacy policy perspective. We experimented with the Quest 2 headset and tested the most popular VR apps available on the official Oculus and the SideQuest app stores. We developed OVRseen, a methodology and system for collecting, analyzing, and comparing network traffic and privacy policies on OVR. On the networking side, we captured and decrypted network traffic of VR apps, which was previously not possible on OVR, and we extracted data flows, defined as <app, data type, destination>. Compared to the mobile and other app ecosystems, we found OVR to be more centralized and driven by tracking and analytics, rather than by third-party advertising. We show that the data types exposed by VR apps include personally identifiable information (PII), device information that can be used for fingerprinting, and VR-specific data types. By comparing the data flows found in the network traffic with statements made in the apps' privacy policies, we found that approximately 70% of OVR data flows were not properly disclosed. Furthermore, we extracted additional context from the privacy policies, and we observed that 69% of the data flows were used for purposes unrelated to the core functionality of apps. △ Less

Submitted 19 November, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: This is the extended version of the paper with the same title published at USENIX Security Symposium 2022

arXiv:2008.08973 [pdf, other]

Exposures Exposed: A Measurement and User Study to Assess Mobile Data Privacy in Context

Authors: Evita Bakopoulou, Anastasia Shuba, Athina Markopoulou

Abstract: Mobile devices have access to personal, potentially sensitive data, and there is a large number of mobile applications and third-party libraries that transmit this information over the network to remote servers (including app developer servers and third party servers). In this paper, we are interested in better understanding of not just the extent of personally identifiable information (PII) expos… ▽ More Mobile devices have access to personal, potentially sensitive data, and there is a large number of mobile applications and third-party libraries that transmit this information over the network to remote servers (including app developer servers and third party servers). In this paper, we are interested in better understanding of not just the extent of personally identifiable information (PII) exposure, but also its context i.e., functionality of the app, destination server, encryption used, etc.) and the risk perceived by mobile users today. To that end we take two steps. First, we perform a measurement study: we collect a new dataset via manual and automatic testing and capture the exposure of 16 PII types from 400 most popular Android apps. We analyze these exposures and provide insights into the extent and patterns of mobile apps sharing PII, which can be later used for prediction and prevention. Second, we perform a user study with 220 participants on Amazon Mechanical Turk: we summarize the results of the measurement study in categories, present them in a realistic context, and assess users' understanding, concern, and willingness to take action. To the best of our knowledge, our user study is the first to collect and analyze user input in such fine granularity and on actual (not just potential or permitted) privacy exposures on mobile devices. Although many users did not initially understand the full implications of their PII being exposed, after being better informed through the study, they became appreciative and interested in better privacy practices. △ Less

Submitted 6 June, 2022; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: arXiv admin note: text overlap with arXiv:1803.01261

arXiv:1911.03447 [pdf, other]

The TV is Smart and Full of Trackers: Towards Understanding the Smart TV Advertising and Tracking Ecosystem

Authors: Janus Varmarken, Hieu Le, Anastasia Shuba, Zubair Shafiq, Athina Markopoulou

Abstract: Motivated by the growing popularity of smart TVs, we present a large-scale measurement study of smart TVs by collecting and analyzing their network traffic from two different vantage points. First, we analyze aggregate network traffic of smart TVs in-the-wild, collected from residential gateways of tens of homes and several different smart TV platforms, including Apple, Samsung, Roku, and Chromeca… ▽ More Motivated by the growing popularity of smart TVs, we present a large-scale measurement study of smart TVs by collecting and analyzing their network traffic from two different vantage points. First, we analyze aggregate network traffic of smart TVs in-the-wild, collected from residential gateways of tens of homes and several different smart TV platforms, including Apple, Samsung, Roku, and Chromecast. In addition to accessing video streaming and cloud services, we find that smart TVs frequently connect to well-known as well as platform-specific advertising and tracking services (ATS). Second, we instrument Roku and Amazon Fire TV, two popular smart TV platforms, by setting up a controlled testbed to systematically exercise the top-1000 apps on each platform, and analyze their network traffic at the granularity of the individual apps. We again find that smart TV apps connect to a wide range of ATS, and that the key players of the ATS ecosystems of the two platforms are different from each other and from that of the mobile platform. Third, we evaluate the (in)effectiveness of state-of-the-art DNS-based blocklists in filtering advertising and tracking traffic for smart TVs. We find that personally identifiable information (PII) is exfiltrated to platform-related Internet endpoints and third parties, and that blocklists are generally better at preventing exposure of PII to third parties than to platform-related endpoints. Our work demonstrates the segmentation of the smart TV ATS ecosystem across platforms and its differences from the mobile ATS ecosystem, thus motivating the need for designing privacy-enhancing tools specifically for each smart TV platform. △ Less

Submitted 8 November, 2019; originally announced November 2019.

arXiv:1907.13113 [pdf, other]

A Federated Learning Approach for Mobile Packet Classification

Authors: Evita Bakopoulou, Balint Tillman, Athina Markopoulou

Abstract: In order to improve mobile data transparency, a number of network-based approaches have been proposed to inspect packets generated by mobile devices and detect personally identifiable information (PII), ad requests, or other activities. State-of-the-art approaches train classifiers based on features extracted from HTTP packets. So far, these classifiers have only been trained in a centralized way,… ▽ More In order to improve mobile data transparency, a number of network-based approaches have been proposed to inspect packets generated by mobile devices and detect personally identifiable information (PII), ad requests, or other activities. State-of-the-art approaches train classifiers based on features extracted from HTTP packets. So far, these classifiers have only been trained in a centralized way, where mobile users label and upload their packet logs to a central server, which then trains a global classifier and shares it with the users to apply on their devices. However, packet logs used as training data may contain sensitive information that users may not want to share/upload. In this paper, we apply, for the first time, a Federated Learning approach to mobile packet classification, which allows mobile devices to collaborate and train a global model, without sharing raw training data. Methodological challenges we address in this context include: model and feature selection, and tuning the Federated Learning parameters. We apply our framework to two different packet classification tasks (i.e., to predict PII exposure or ad requests in HTTP packets) and we demonstrate its effectiveness in terms of classification performance, communication and computation cost, using three real-world datasets. △ Less

Submitted 30 July, 2019; originally announced July 2019.

arXiv:1907.11797 [pdf, other]

**Pong: Packet-Level Signatures for Smart Home Device Events

Authors: Rahmadi Trimananda, Janus Varmarken, Athina Markopoulou, Brian Demsky

Abstract: Smart home devices are vulnerable to passive inference attacks based on network traffic, even in the presence of encryption. In this paper, we present PINGPONG, a tool that can automatically extract packet-level signatures for device events (e.g., light bulb turning ON/OFF) from network traffic. We evaluated PINGPONG on popular smart home devices ranging from smart plugs and thermostats to cameras… ▽ More Smart home devices are vulnerable to passive inference attacks based on network traffic, even in the presence of encryption. In this paper, we present PINGPONG, a tool that can automatically extract packet-level signatures for device events (e.g., light bulb turning ON/OFF) from network traffic. We evaluated PINGPONG on popular smart home devices ranging from smart plugs and thermostats to cameras, voice-activated devices, and smart TVs. We were able to: (1) automatically extract previously unknown signatures that consist of simple sequences of packet lengths and directions; (2) use those signatures to detect the devices or specific events with an average recall of more than 97%; (3) show that the signatures are unique among hundreds of millions of packets of real world network traffic; (4) show that our methodology is also applicable to publicly available datasets; and (5) demonstrate its robustness in different settings: events triggered by local and remote smartphones, as well as by homeautomation systems. △ Less

Submitted 10 February, 2020; v1 submitted 26 July, 2019; originally announced July 2019.

Comments: This is the technical report for the paper titled Packet-Level Signatures for Smart Home Devices published at the Network and Distributed System Security (NDSS) Symposium 2020

arXiv:1803.01261 [pdf, other]

AntShield: On-Device Detection of Personal Information Exposure

Authors: Anastasia Shuba, Evita Bakopoulou, Milad Asgari Mehrabadi, Hieu Le, David Choffnes, Athina Markopoulou

Abstract: Mobile devices have access to personal, potentially sensitive data, and there is a growing number of applications that transmit this personally identifiable information (PII) over the network. In this paper, we present the AntShield system that performs on-device packet-level monitoring and detects the transmission of such sensitive information accurately and in real-time. A key insight is to dist… ▽ More Mobile devices have access to personal, potentially sensitive data, and there is a growing number of applications that transmit this personally identifiable information (PII) over the network. In this paper, we present the AntShield system that performs on-device packet-level monitoring and detects the transmission of such sensitive information accurately and in real-time. A key insight is to distinguish PII that is predefined and is easily available on the device from PII that is unknown a priori but can be automatically detected by classifiers. Our system not only combines, for the first time, the advantages of on-device monitoring with the power of learning unknown PII, but also outperforms either of the two approaches alone. We demonstrate the real-time performance of our prototype as well as the classification performance using a dataset that we collect and analyze from scratch (including new findings in terms of leaks and patterns). AntShield is a first step towards enabling distributed learning of private information exposure. △ Less

Submitted 3 March, 2018; originally announced March 2018.

arXiv:1801.01715 [pdf, other]

Spectral Graph Forge: Graph Generation Targeting Modularity

Authors: Luca Baldesi, Athina Markopoulou, Carter T. Butts

Abstract: Community structure is an important property that captures inhomogeneities common in large networks, and modularity is one of the most widely used metrics for such community structure. In this paper, we introduce a principled methodology, the Spectral Graph Forge, for generating random graphs that preserves community structure from a real network of interest, in terms of modularity. Our approach l… ▽ More Community structure is an important property that captures inhomogeneities common in large networks, and modularity is one of the most widely used metrics for such community structure. In this paper, we introduce a principled methodology, the Spectral Graph Forge, for generating random graphs that preserves community structure from a real network of interest, in terms of modularity. Our approach leverages the fact that the spectral structure of matrix representations of a graph encodes global information about community structure. The Spectral Graph Forge uses a low-rank approximation of the modularity matrix to generate synthetic graphs that match a target modularity within user-selectable degree of accuracy, while allowing other aspects of structure to vary. We show that the Spectral Graph Forge outperforms state-of-the-art techniques in terms of accuracy in targeting the modularity and randomness of the realizations, while also preserving other local structural properties and node attributes. We discuss extensions of the Spectral Graph Forge to target other properties beyond modularity, and its applications to anonymization. △ Less

Submitted 5 January, 2018; originally announced January 2018.

arXiv:1703.07340 [pdf, other]

Construction of Directed 2K Graphs

Authors: Bálint Tillman, Athina Markopoulou, Carter T. Butts, Minas Gjoka

Abstract: We study the problem of constructing synthetic graphs that resemble real-world directed graphs in terms of their degree correlations. We define the problem of directed 2K construction (D2K) that takes as input the directed degree sequence (DDS) and a joint degree and attribute matrix (JDAM) so as to capture degree correlation specifically in directed graphs. We provide necessary and sufficient con… ▽ More We study the problem of constructing synthetic graphs that resemble real-world directed graphs in terms of their degree correlations. We define the problem of directed 2K construction (D2K) that takes as input the directed degree sequence (DDS) and a joint degree and attribute matrix (JDAM) so as to capture degree correlation specifically in directed graphs. We provide necessary and sufficient conditions to decide whether a target D2K is realizable, and we design an efficient algorithm that creates realizations with that target D2K. We evaluate our algorithm in creating synthetic graphs that target real-world directed graphs (such as Twitter) and we show that it brings significant benefits compared to state-of-the-art approaches. △ Less

Submitted 21 March, 2017; originally announced March 2017.

arXiv:1611.04268 [pdf, other]

AntMonitor: A System for On-Device Mobile Network Monitoring and its Applications

Authors: Anastasia Shuba, Anh Le, Emmanouil Alimpertis, Minas Gjoka, Athina Markopoulou

Abstract: In this paper, we present a complete system for on-device passive monitoring, collection, and analysis of fine grained, large-scale packet measurements from mobile devices. First, we describe the design and implementation of AntMonitor as a userspace mobile app based on a VPN-service but only on the device (without the need to route through a remote VPN server) and using only the minimum resources… ▽ More In this paper, we present a complete system for on-device passive monitoring, collection, and analysis of fine grained, large-scale packet measurements from mobile devices. First, we describe the design and implementation of AntMonitor as a userspace mobile app based on a VPN-service but only on the device (without the need to route through a remote VPN server) and using only the minimum resources required. We evaluate our prototype and show that it significantly outperforms prior state-of-the-art approaches: it achieves throughput of over 90 Mbps downlink and 65 Mbps uplink, which is 2x and 8x faster than mobile-only baselines and is 94% of the throughput without VPN, while using 2-12x less energy. Second, we show that AntMonitor is uniquely positioned to serve as a platform for passive on-device mobile network monitoring and to enable a number of applications, including: (i) real-time detection and prevention of private information leakage from the device to the network; (ii) passive network performance monitoring; and (iii) application classification and user profiling. We showcase preliminary results from a pilot user study at a university campus. △ Less

Submitted 4 April, 2017; v1 submitted 14 November, 2016; originally announced November 2016.

arXiv:1405.7348 [pdf, other]

ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics

Authors: Omer Nebil Yaveroglu, Sean M. Fitzhugh, Maciej Kurant, Athina Markopoulou, Carter T. Butts, Natasa Przulj

Abstract: Exponential-family random graph models (ERGMs) are probabilistic network models that are parametrized by sufficient statistics based on structural (i.e., graph-theoretic) properties. The ergm package for the R statistical computing system is a collection of tools for the analysis of network data within an ERGM framework. Many different network properties can be employed as sufficient statistics fo… ▽ More Exponential-family random graph models (ERGMs) are probabilistic network models that are parametrized by sufficient statistics based on structural (i.e., graph-theoretic) properties. The ergm package for the R statistical computing system is a collection of tools for the analysis of network data within an ERGM framework. Many different network properties can be employed as sufficient statistics for ERGMs by using the model terms defined in the ergm package; this functionality can be expanded by the creation of packages that code for additional network statistics. Here, our focus is on the addition of statistics based on graphlets. Graphlets are small, connected, and non-isomorphic induced subgraphs that describe the topological structure of a network. We introduce an R package called ergm.graphlets that enables the use of graphlet properties of a network within the ergm package of R. The ergm.graphlets package provides a complete list of model terms that allows to incorporate statistics of any 2-, 3-, 4- and 5-node graphlet into ERGMs. The new model terms of ergm.graphlets package enable both ERG modelling of global structural properties and investigation of relationships between nodal attributes (i.e., covariates) and local topologies around nodes. △ Less

Submitted 28 May, 2014; originally announced May 2014.

Comments: 32 pages, 9 figures, under review by Journal of Statistical Software

Journal ref: Journal of Statistical Software, Vol. 65, Issue 12, Jun 2015

arXiv:1405.3622 [pdf, other]

MicroCast: Cooperative Video Streaming using Cellular and D2D Connections

Authors: Anh Le, Lorenzo Keller, Hulya Seferoglu, Blerim Cici, Christina Fragouli, Athina Markopoulou

Abstract: We consider a group of mobile users, within proximity of each other, who are interested in watching the same online video at roughly the same time. The common practice today is that each user downloads the video independently on her mobile device using her own cellular connection, which wastes access bandwidth and may also lead to poor video quality. We propose a novel cooperative system where eac… ▽ More We consider a group of mobile users, within proximity of each other, who are interested in watching the same online video at roughly the same time. The common practice today is that each user downloads the video independently on her mobile device using her own cellular connection, which wastes access bandwidth and may also lead to poor video quality. We propose a novel cooperative system where each mobile device uses simultaneously two network interfaces: (i) the cellular to connect to the video server and download parts of the video and (ii) WiFi to connect locally to all other devices in the group and exchange those parts. Devices cooperate to efficiently utilize all network resources and are able to adapt to varying wireless network conditions. In the local WiFi network, we exploit overhearing, and we further combine it with network coding. The end result is savings in cellular bandwidth and improved user experience (faster download) by a factor on the order up to the group size. We follow a complete approach, from theory to practice. First, we formulate the problem using a network utility maximization (NUM) framework, decompose the problem, and provide a distributed solution. Then, based on the structure of the NUM solution, we design a modular system called MicroCast and we implement it as an Android application. We provide both simulation results of the NUM solution and experimental evaluation of MicroCast on a testbed consisting of Android phones. We demonstrate that the proposed approach brings significant performance benefits without battery penalty. △ Less

Submitted 14 May, 2014; originally announced May 2014.

arXiv:1401.8244 [pdf, other]

On Routing-Optimal Network for Multiple Unicasts

Authors: Chun Meng, Minghua Chen, Athina Markopoulou

Abstract: In this paper, we consider networks with multiple unicast sessions. Generally, non-linear network coding is needed to achieve the whole rate region of network coding. Yet, there exist networks for which routing is sufficient to achieve the whole rate region, and we refer to them as routing-optimal networks. We identify a class of routing-optimal networks, which we refer to as information-distribut… ▽ More In this paper, we consider networks with multiple unicast sessions. Generally, non-linear network coding is needed to achieve the whole rate region of network coding. Yet, there exist networks for which routing is sufficient to achieve the whole rate region, and we refer to them as routing-optimal networks. We identify a class of routing-optimal networks, which we refer to as information-distributive networks, defined by three topological features. Due to these features, for each rate vector achieved by network coding, there is always a routing scheme such that it achieves the same rate vector, and the traffic transmitted through the network is exactly the information transmitted over the cut-sets between the sources and the sinks in the corresponding network coding scheme. We present more examples of information-distributive networks, including some examples from index coding and single unicast with hard deadline constraint. △ Less

Submitted 30 April, 2014; v1 submitted 31 January, 2014; originally announced January 2014.

arXiv:1305.3876 [pdf, ps, other]

Assessing the Potential of Ride-Sharing Using Mobile and Social Data

Authors: Blerim Cici, Athina Markopoulou, Enrique Frías-Martínez, Nikolaos Laoutaris

Abstract: Ride-sharing on the daily home-work-home commute can help individuals save on gasoline and other car-related costs, while at the same time it can reduce traffic and pollution. This paper assesses the potential of ride-sharing for reducing traffic in a city, based on mobility data extracted from 3G Call Description Records (CDRs, for the cities of Barcelona and Madrid) and from Online Social Networ… ▽ More Ride-sharing on the daily home-work-home commute can help individuals save on gasoline and other car-related costs, while at the same time it can reduce traffic and pollution. This paper assesses the potential of ride-sharing for reducing traffic in a city, based on mobility data extracted from 3G Call Description Records (CDRs, for the cities of Barcelona and Madrid) and from Online Social Networks (Twitter, collected for the cities of New York and Los Angeles). We first analyze these data sets to understand mobility patterns, home and work locations, and social ties between users. We then develop an efficient algorithm for matching users with similar mobility patterns, considering a range of constraints. The solution provides an upper bound to the potential reduction of cars in a city that can be achieved by ride-sharing. We use our framework to understand the different constraints and city characteristics on this potential benefit. For example, our study shows that traffic in the city of Madrid can be reduced by 59% if users are willing to share a ride with people who live and work within 1 km; if they can only accept a pick-up and drop-off delay up to 10 minutes, this potential benefit drops to 24%; if drivers also pick up passengers along the way, this number increases to 53%. If users are willing to ride only with people they know ("friends" in the CDR and OSN data sets), the potential of ride-sharing becomes negligible; if they are willing to ride with friends of friends, the potential reduction is up to 31%. △ Less

Submitted 20 March, 2014; v1 submitted 16 May, 2013; originally announced May 2013.

Comments: 11 pages

arXiv:1305.0868 [pdf, other]

Precoding-Based Network Alignment For Three Unicast Sessions

Authors: Chun Meng, Abhik Kumar Das, Abinesh Ramakrishnan, Syed Ali Jafar, Athina Markopoulou, Sriram Vishwanath

Abstract: We consider the problem of network coding across three unicast sessions over a directed acyclic graph, where each sender and the receiver is connected to the network via a single edge of unit capacity. We consider a network model in which the middle of the network only performs random linear network coding, and restrict our approaches to precoding-based linear schemes, where the senders use precod… ▽ More We consider the problem of network coding across three unicast sessions over a directed acyclic graph, where each sender and the receiver is connected to the network via a single edge of unit capacity. We consider a network model in which the middle of the network only performs random linear network coding, and restrict our approaches to precoding-based linear schemes, where the senders use precoding matrices to encode source symbols. We adapt a precoding-based interference alignment technique, originally developed for the wireless interference channel, to construct a precoding-based linear scheme, which we refer to as as a {\em precoding-based network alignment scheme (PBNA)}. A primary difference between this setting and the wireless interference channel is that the network topology can introduce dependencies between elements of the transfer matrix, which we refer to as coupling relations, and can potentially affect the achievable rate of PBNA. We identify all possible such coupling relations, and interpret these coupling relations in terms of network topology and present polynomial-time algorithms to check the presence of these coupling relations. Finally, we show that, depending on the coupling relations present in the network, the optimal symmetric rate achieved by precoding-based linear scheme can take only three possible values, all of which can be achieved by PBNA. △ Less

Submitted 21 May, 2014; v1 submitted 3 May, 2013; originally announced May 2013.

Comments: arXiv admin note: text overlap with arXiv:1202.3405

arXiv:1303.7197 [pdf, other]

Network Codes for Real-Time Applications

Authors: Anh Le, Arash S. Tehrani, Alexandros G. Dimakis, Athina Markopoulou

Abstract: We consider the scenario of broadcasting for real-time applications and loss recovery via instantly decodable network coding. Past work focused on minimizing the completion delay, which is not the right objective for real-time applications that have strict deadlines. In this work, we are interested in finding a code that is instantly decodable by the maximum number of users. First, we prove that t… ▽ More We consider the scenario of broadcasting for real-time applications and loss recovery via instantly decodable network coding. Past work focused on minimizing the completion delay, which is not the right objective for real-time applications that have strict deadlines. In this work, we are interested in finding a code that is instantly decodable by the maximum number of users. First, we prove that this problem is NP-Hard in the general case. Then we consider the practical probabilistic scenario, where users have i.i.d. loss probability and the number of packets is linear or polynomial in the number of users. In this scenario, we provide a polynomial-time (in the number of users) algorithm that finds the optimal coded packet. The proposed algorithm is evaluated using both simulation and real network traces of a real-time Android application. Both results show that the proposed coding scheme significantly outperforms the state-of-the-art baselines: an optimal repetition code and a COPE-like greedy scheme. △ Less

Submitted 12 May, 2014; v1 submitted 28 March, 2013; originally announced March 2013.

Comments: ToN 2013 Submission Version

arXiv:1212.2310 [pdf, ps, other]

doi 10.1109/TSP.2014.2304431

Active Learning of Multiple Source Multiple Destination Topologies

Authors: Pegah Sattari, Maciej Kurant, Animashree Anandkumar, Athina Markopoulou, Michael Rabbat

Abstract: We consider the problem of inferring the topology of a network with $M$ sources and $N$ receivers (hereafter referred to as an $M$-by-$N$ network), by sending probes between the sources and receivers. Prior work has shown that this problem can be decomposed into two parts: first, infer smaller subnetwork components (i.e., $1$-by-$N$'s or $2$-by-$2$'s) and then merge these components to identify th… ▽ More We consider the problem of inferring the topology of a network with $M$ sources and $N$ receivers (hereafter referred to as an $M$-by-$N$ network), by sending probes between the sources and receivers. Prior work has shown that this problem can be decomposed into two parts: first, infer smaller subnetwork components (i.e., $1$-by-$N$'s or $2$-by-$2$'s) and then merge these components to identify the $M$-by-$N$ topology. In this paper, we focus on the second part, which had previously received less attention in the literature. In particular, we assume that a $1$-by-$N$ topology is given and that all $2$-by-$2$ components can be queried and learned using end-to-end probes. The problem is which $2$-by-$2$'s to query and how to merge them with the given $1$-by-$N$, so as to exactly identify the $2$-by-$N$ topology, and optimize a number of performance metrics, including the number of queries (which directly translates into measurement bandwidth), time complexity, and memory usage. We provide a lower bound, $\lceil \frac{N}{2} \rceil$, on the number of $2$-by-$2$'s required by any active learning algorithm and propose two greedy algorithms. The first algorithm follows the framework of multiple hypothesis testing, in particular Generalized Binary Search (GBS), since our problem is one of active learning, from $2$-by-$2$ queries. The second algorithm is called the Receiver Elimination Algorithm (REA) and follows a bottom-up approach: at every step, it selects two receivers, queries the corresponding $2$-by-$2$, and merges it with the given $1$-by-$N$; it requires exactly $N-1$ steps, which is much less than all $\binom{N}{2}$ possible $2$-by-$2$'s. Simulation results over synthetic and realistic topologies demonstrate that both algorithms correctly identify the $2$-by-$N$ topology and are near-optimal, but REA is more efficient in practice. △ Less

Submitted 19 March, 2014; v1 submitted 11 December, 2012; originally announced December 2012.

Journal ref: IEEE Transactions on Signal Processing, Vol. 62, Issue 8, pp. 1926-1937, April 2014

arXiv:1211.4206 [pdf, other]

doi 10.1109/TMM.2013.2241415

Network Coding Meets Multimedia: a Review

Authors: Enrico Magli, Mea Wang, Pascal Frossard, Athina Markopoulou

Abstract: While every network node only relays messages in a traditional communication system, the recent network coding (NC) paradigm proposes to implement simple in-network processing with packet combinations in the nodes. NC extends the concept of "encoding" a message beyond source coding (for compression) and channel coding (for protection against errors and losses). It has been shown to increase networ… ▽ More While every network node only relays messages in a traditional communication system, the recent network coding (NC) paradigm proposes to implement simple in-network processing with packet combinations in the nodes. NC extends the concept of "encoding" a message beyond source coding (for compression) and channel coding (for protection against errors and losses). It has been shown to increase network throughput compared to traditional networks implementation, to reduce delay and to provide robustness to transmission errors and network dynamics. These features are so appealing for multimedia applications that they have spurred a large research effort towards the development of multimedia-specific NC techniques. This paper reviews the recent work in NC for multimedia applications and focuses on the techniques that fill the gap between NC theory and practical applications. It outlines the benefits of NC and presents the open challenges in this area. The paper initially focuses on multimedia-specific aspects of network coding, in particular delay, in-network error control, and media-specific error control. These aspects permit to handle varying network conditions as well as client heterogeneity, which are critical to the design and deployment of multimedia systems. After introducing these general concepts, the paper reviews in detail two applications that lend themselves naturally to NC via the cooperation and broadcast models, namely peer-to-peer multimedia streaming and wireless networking. △ Less

Submitted 18 November, 2012; originally announced November 2012.

Comments: Part of this work is under publication in IEEE Transactions on Multimedia

arXiv:1210.0460 [pdf, ps, other]

Graph Size Estimation

Authors: Maciej Kurant, Carter T. Butts, Athina Markopoulou

Abstract: Many online networks are not fully known and are often studied via sampling. Random Walk (RW) based techniques are the current state-of-the-art for estimating nodal attributes and local graph properties, but estimating global properties remains a challenge. In this paper, we are interested in a fundamental property of this type - the graph size N, i.e., the number of its nodes. Existing methods fo… ▽ More Many online networks are not fully known and are often studied via sampling. Random Walk (RW) based techniques are the current state-of-the-art for estimating nodal attributes and local graph properties, but estimating global properties remains a challenge. In this paper, we are interested in a fundamental property of this type - the graph size N, i.e., the number of its nodes. Existing methods for estimating N are (i) inefficient and (ii) cannot be easily used with RW sampling due to dependence between successive samples. In this paper, we address both problems. First, we propose IE (Induced Edges), an efficient technique for estimating N from an independence sample of graph's nodes. IE exploits the edges induced on the sampled nodes. Second, we introduce SafetyMargin, a method that corrects estimators for dependence in RW samples. Finally, we combine these two stand-alone techniques to obtain a RW-based graph size estimator. We evaluate our approach in simulations on a wide range of real-life topologies, and on several samples of Facebook. IE with SafetyMargin typically requires at least 10 times fewer samples than the state-of-the-art techniques (over 100 times in the case of Facebook) for the same estimation error. △ Less

Submitted 1 October, 2012; originally announced October 2012.

arXiv:1208.3667 [pdf, ps, other]

2.5K-Graphs: from Sampling to Generation

Authors: Minas Gjoka, Maciej Kurant, Athina Markopoulou

Abstract: Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, and practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology we target to match are the joint degree distribution (JDD) and the degree-dependent average clustering c… ▽ More Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, and practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology we target to match are the joint degree distribution (JDD) and the degree-dependent average clustering coefficient ($\bar{c}(k)$). We start by develo** efficient estimators for these two metrics based on a node sample collected via either independence sampling or random walks. Then, we process the output of the estimators to ensure that the target properties are realizable. Finally, we propose an efficient algorithm for generating topologies that have the exact target JDD and a $\bar{c}(k)$ close to the target. Extensive simulations using real-life graphs show that the graphs generated by our methodology are similar to the original graph with respect to, not only the two target metrics, but also a wide range of other topological metrics; furthermore, our generator is order of magnitudes faster than state-of-the-art techniques. △ Less

Submitted 17 August, 2012; originally announced August 2012.

arXiv:1203.1730 [pdf, other]

Auditing for Distributed Storage Systems

Authors: Anh Le, Athina Markopoulou, Alexandros G. Dimakis

Abstract: Distributed storage codes have recently received a lot of attention in the community. Independently, another body of work has proposed integrity checking schemes for cloud storage, none of which, however, is customized for coding-based storage or can efficiently support repair. In this work, we bridge the gap between these two currently disconnected bodies of work. We propose NC-Audit, a novel cry… ▽ More Distributed storage codes have recently received a lot of attention in the community. Independently, another body of work has proposed integrity checking schemes for cloud storage, none of which, however, is customized for coding-based storage or can efficiently support repair. In this work, we bridge the gap between these two currently disconnected bodies of work. We propose NC-Audit, a novel cryptography-based remote data integrity checking scheme, designed specifically for network coding-based distributed storage systems. NC-Audit combines, for the first time, the following desired properties: (i) efficient checking of data integrity, (ii) efficient support for repairing failed nodes, and (iii) protection against information leakage when checking is performed by a third party. The key ingredient of the design of NC-Audit is a novel combination of SpaceMac, a homomorphic message authentication code (MAC) scheme for network coding, and NCrypt, a novel chosen-plaintext attack (CPA) secure encryption scheme that is compatible with SpaceMac. Our evaluation of a Java implementation of NC-Audit shows that an audit costs the storage node and the auditor a modest amount computation time and lower bandwidth than prior work. △ Less

Submitted 12 May, 2014; v1 submitted 8 March, 2012; originally announced March 2012.

Comments: ToN 2014 Submission with Data Dynamics

arXiv:1202.3405 [pdf, other]

On the Feasibility of Precoding-Based Network Alignment for Three Unicast Sessions

Authors: Chun Meng, Abinesh Ramakrishnan, Athina Markopoulou, Syed Ali Jafar

Abstract: We consider the problem of network coding across three unicast sessions over a directed acyclic graph, when each session has min-cut one. Previous work by Das et al. adapted a precoding-based interference alignment technique, originally developed for the wireless interference channel, specifically to this problem. We refer to this approach as precoding-based network alignment (PBNA). Similar to th… ▽ More We consider the problem of network coding across three unicast sessions over a directed acyclic graph, when each session has min-cut one. Previous work by Das et al. adapted a precoding-based interference alignment technique, originally developed for the wireless interference channel, specifically to this problem. We refer to this approach as precoding-based network alignment (PBNA). Similar to the wireless setting, PBNA asymptotically achieves half the minimum cut; different from the wireless setting, its feasibility depends on the graph structure. Das et al. provided a set of feasibility conditions for PBNA with respect to a particular precoding matrix. However, the set consisted of an infinite number of conditions, which is impossible to check in practice. Furthermore, the conditions were purely algebraic, without interpretation with regards to the graph structure. In this paper, we first prove that the set of conditions provided by Das. et al are also necessary for the feasibility of PBNA with respect to any precoding matrix. Then, using two graph-related properties and a degree-counting technique, we reduce the set to just four conditions. This reduction enables an efficient algorithm for checking the feasibility of PBNA on a given graph. △ Less

Submitted 18 May, 2012; v1 submitted 15 February, 2012; originally announced February 2012.

Comments: Technical report for ISIT 2012, 10 pages, 6 figures

arXiv:1108.0377 [pdf, other]

On Detecting Pollution Attacks in Inter-Session Network Coding

Authors: Anh Le, Athina Markopoulou

Abstract: Dealing with pollution attacks in inter-session network coding is challenging due to the fact that sources, in addition to intermediate nodes, can be malicious. In this work, we precisely define corrupted packets in inter-session pollution based on the commitment of the source packets. We then propose three detection schemes: one hash-based and two MAC-based schemes: InterMacCPK and SpaceMacPM. In… ▽ More Dealing with pollution attacks in inter-session network coding is challenging due to the fact that sources, in addition to intermediate nodes, can be malicious. In this work, we precisely define corrupted packets in inter-session pollution based on the commitment of the source packets. We then propose three detection schemes: one hash-based and two MAC-based schemes: InterMacCPK and SpaceMacPM. InterMacCPK is the first multi-source homomorphic MAC scheme that supports multiple keys. Both MAC schemes can replace traditional MACs, e.g., HMAC, in networks that employ inter-session coding. All three schemes provide in-network detection, are collusion-resistant, and have very low online bandwidth and computation overhead. △ Less

Submitted 1 August, 2011; originally announced August 2011.

arXiv:1105.5488 [pdf, ps, other]

Coarse-Grained Topology Estimation via Graph Sampling

Authors: Maciej Kurant, Minas Gjoka, Yan Wang, Zack W. Almquist, Carter T. Butts, Athina Markopoulou

Abstract: Many online networks are measured and studied via sampling techniques, which typically collect a relatively small fraction of nodes and their associated edges. Past work in this area has primarily focused on obtaining a representative sample of nodes and on efficient estimation of local graph properties (such as node degree distribution or any node attribute) based on that sample. However, less is… ▽ More Many online networks are measured and studied via sampling techniques, which typically collect a relatively small fraction of nodes and their associated edges. Past work in this area has primarily focused on obtaining a representative sample of nodes and on efficient estimation of local graph properties (such as node degree distribution or any node attribute) based on that sample. However, less is known about estimating the global topology of the underlying graph. In this paper, we show how to efficiently estimate the coarse-grained topology of a graph from a probability sample of nodes. In particular, we consider that nodes are partitioned into categories (e.g., countries or work/study places in OSNs), which naturally defines a weighted category graph. We are interested in estimating (i) the size of categories and (ii) the probability that nodes from two different categories are connected. For each of the above, we develop a family of estimators for design-based inference under uniform or non-uniform sampling, employing either of two measurement strategies: induced subgraph sampling, which relies only on information about the sampled nodes; and star sampling, which also exploits category information about the neighbors of sampled nodes. We prove consistency of these estimators and evaluate their efficiency via simulation on fully known graphs. We also apply our methodology to a sample of Facebook users to obtain a number of category graphs, such as the college friendship graph and the country friendship graph; we share and visualize the resulting data at www.geosocialmap.com. △ Less

Submitted 27 May, 2011; originally announced May 2011.

arXiv:1102.4599 [pdf, ps, other]

Towards Unbiased BFS Sampling

Authors: Maciej Kurant, Athina Markopoulou, Patrick Thiran

Abstract: Breadth First Search (BFS) is a widely used approach for sampling large unknown Internet topologies. Its main advantage over random walks and other exploration techniques is that a BFS sample is a plausible graph on its own, and therefore we can study its topological characteristics. However, it has been empirically observed that incomplete BFS is biased toward high-degree nodes, which may strongl… ▽ More Breadth First Search (BFS) is a widely used approach for sampling large unknown Internet topologies. Its main advantage over random walks and other exploration techniques is that a BFS sample is a plausible graph on its own, and therefore we can study its topological characteristics. However, it has been empirically observed that incomplete BFS is biased toward high-degree nodes, which may strongly affect the measurements. In this paper, we first analytically quantify the degree bias of BFS sampling. In particular, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction f of covered nodes, in a random graph RG(pk) with an arbitrary degree distribution pk. We also show that, for RG(pk), all commonly used graph traversal techniques (BFS, DFS, Forest Fire, Snowball Sampling, RDS) suffer from exactly the same bias. Next, based on our theoretical analysis, we propose a practical BFS-bias correction procedure. It takes as input a collected BFS sample together with its fraction f. Even though RG(pk) does not capture many graph properties common in real-life graphs (such as assortativity), our RG(pk)-based correction technique performs well on a broad range of Internet topologies and on two large BFS samples of Facebook and Orkut networks. Finally, we consider and evaluate a family of alternative correction procedures, and demonstrate that, although they are unbiased for an arbitrary topology, their large variance makes them far less effective than the RG(pk)-based technique. △ Less

Submitted 22 February, 2011; originally announced February 2011.

Comments: BFS, RDS, graph traversal, sampling bias correction

Journal ref: arXiv:1004.1729, 2010

arXiv:1102.3504 [pdf, other]

Cooperative Defense against Pollution Attacks in Network Coding Using SpaceMac

Authors: Anh Le, Athina Markopoulou

Abstract: Intra-session network coding is known to be vulnerable to pollution attacks. In this work, first, we introduce a novel homomorphic MAC scheme called SpaceMac, which allows an intermediate node to verify if its received packets belong to a specific subspace, even if the subspace is expanding over time. Then, we use SpaceMac as a building block to design a cooperative scheme that provides complete d… ▽ More Intra-session network coding is known to be vulnerable to pollution attacks. In this work, first, we introduce a novel homomorphic MAC scheme called SpaceMac, which allows an intermediate node to verify if its received packets belong to a specific subspace, even if the subspace is expanding over time. Then, we use SpaceMac as a building block to design a cooperative scheme that provides complete defense against pollution attacks: (i) it can detect polluted packets early at intermediate nodes and (ii) it can identify the exact location of all, even colluding, attackers, thus making it possible to eliminate them. Our scheme is cooperative: parents and children of any node cooperate to detect any corrupted packets sent by the node, and nodes in the network cooperate with a central controller to identify the exact location of all attackers. We implement SpaceMac in both C/C++ and Java as a library, and we make the library available online. Our evaluation on both a PC and an Android device shows that (i) SpaceMac's algorithms can be computed quickly and efficiently, and (ii) our cooperative defense scheme has low computation and significantly lower communication overhead than other comparable state-of-the-art schemes. △ Less

Submitted 15 September, 2011; v1 submitted 17 February, 2011; originally announced February 2011.

Comments: This is an extended version of a short version to appear in IEEE JSAC on Cooperative Networking - Challenges and Applications 2011

arXiv:1101.5463 [pdf, ps, other]

Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks

Authors: M. Kurant, M. Gjoka, C. T. Butts, A. Markopoulou

Abstract: Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification t… ▽ More Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or S-WRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that S-WRW requires 13-15 times fewer samples than the simple re-weighted random walk (RW) to achieve the same estimation accuracy for a range of metrics. △ Less

Submitted 27 March, 2011; v1 submitted 28 January, 2011; originally announced January 2011.

Comments: To appear in SIGMETRICS 2011

arXiv:1009.2275 [pdf, other]

doi 10.1109/INFCOM.2011.5934995

PhishDef: URL Names Say It All

Authors: Anh Le, Athina Markopoulou, Michalis Faloutsos

Abstract: Phishing is an increasingly sophisticated method to steal personal user information using sites that pretend to be legitimate. In this paper, we take the following steps to identify phishing URLs. First, we carefully select lexical features of the URLs that are resistant to obfuscation techniques used by attackers. Second, we evaluate the classification accuracy when using only lexical features, b… ▽ More Phishing is an increasingly sophisticated method to steal personal user information using sites that pretend to be legitimate. In this paper, we take the following steps to identify phishing URLs. First, we carefully select lexical features of the URLs that are resistant to obfuscation techniques used by attackers. Second, we evaluate the classification accuracy when using only lexical features, both automatically and hand-selected, vs. when using additional features. We show that lexical features are sufficient for all practical purposes. Third, we thoroughly compare several classification algorithms, and we propose to use an online method (AROW) that is able to overcome noisy training data. Based on the insights gained from our analysis, we propose PhishDef, a phishing detection system that uses only URL names and combines the above three elements. PhishDef is a highly accurate method (when compared to state-of-the-art approaches over real datasets), lightweight (thus appropriate for online and client-side deployment), proactive (based on online classification rather than blacklists), and resilient to training data inaccuracies (thus enabling the use of large noisy training data). △ Less

Submitted 12 September, 2010; originally announced September 2010.

Comments: 9 pages, submitted to IEEE INFOCOM 2011

arXiv:1008.5217 [pdf, ps, other]

Intra- and Inter-Session Network Coding in Wireless Networks

Authors: Hulya Seferoglu, Athina Markopoulou, K. K. Ramakrishnan

Abstract: In this paper, we are interested in improving the performance of constructive network coding schemes in lossy wireless environments.We propose I2NC - a cross-layer approach that combines inter-session and intra-session network coding and has two strengths. First, the error-correcting capabilities of intra-session network coding make our scheme resilient to loss. Second, redundancy allows intermedi… ▽ More In this paper, we are interested in improving the performance of constructive network coding schemes in lossy wireless environments.We propose I2NC - a cross-layer approach that combines inter-session and intra-session network coding and has two strengths. First, the error-correcting capabilities of intra-session network coding make our scheme resilient to loss. Second, redundancy allows intermediate nodes to operate without knowledge of the decoding buffers of their neighbors. Based only on the knowledge of the loss rates on the direct and overhearing links, intermediate nodes can make decisions for both intra-session (i.e., how much redundancy to add in each flow) and inter-session (i.e., what percentage of flows to code together) coding. Our approach is grounded on a network utility maximization (NUM) formulation of the problem. We propose two practical schemes, I2NC-state and I2NC-stateless, which mimic the structure of the NUM optimal solution. We also address the interaction of our approach with the transport layer. We demonstrate the benefits of our schemes through simulations. △ Less

Submitted 23 February, 2012; v1 submitted 31 August, 2010; originally announced August 2010.

arXiv:1008.2565 [pdf, ps, other]

Multigraph Sampling of Online Social Networks

Authors: Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou

Abstract: State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there ofte… ▽ More State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm - an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered. △ Less

Submitted 16 June, 2011; v1 submitted 15 August, 2010; originally announced August 2010.

Comments: IEEE Journal on Selected Areas in Communications (JSAC), Special Issue on Measurement of Internet Topologies, 2011

arXiv:1008.0235 [pdf, ps, other]

Network Coding for Multiple Unicasts: An Interference Alignment Approach

Authors: Abhik Das, Sriram Vishwanath, Syed Jafar, Athina Markopoulou

Abstract: This paper considers the problem of network coding for multiple unicast connections in networks represented by directed acyclic graphs. The concept of interference alignment, traditionally used in interference networks, is extended to analyze the performance of linear network coding in this setup and to provide a systematic code design approach. It is shown that, for a broad class of three-source… ▽ More This paper considers the problem of network coding for multiple unicast connections in networks represented by directed acyclic graphs. The concept of interference alignment, traditionally used in interference networks, is extended to analyze the performance of linear network coding in this setup and to provide a systematic code design approach. It is shown that, for a broad class of three-source three-destination unicast networks, a rate corresponding to half the individual source-destination min-cut is achievable via alignment strategies. △ Less

Submitted 2 August, 2010; originally announced August 2010.

Comments: 5 pages, appeared in ISIT 2010

arXiv:1007.3336 [pdf, ps, other]

doi 10.1016/j.phycom.2012.02.006

Active Topology Inference using Network Coding

Authors: Pegah Sattari, Christina Fragouli, Athina Markopoulou

Abstract: Our goal is to infer the topology of a network when (i) we can send probes between sources and receivers at the edge of the network and (ii) intermediate nodes can perform simple network coding operations, i.e., additions. Our key intuition is that network coding introduces topology-dependent correlation in the observations at the receivers, which can be exploited to infer the topology. For undire… ▽ More Our goal is to infer the topology of a network when (i) we can send probes between sources and receivers at the edge of the network and (ii) intermediate nodes can perform simple network coding operations, i.e., additions. Our key intuition is that network coding introduces topology-dependent correlation in the observations at the receivers, which can be exploited to infer the topology. For undirected tree topologies, we design hierarchical clustering algorithms, building on our prior work. For directed acyclic graphs (DAGs), first we decompose the topology into a number of two-source, two-receiver (2-by-2) subnetwork components and then we merge these components to reconstruct the topology. Our approach for DAGs builds on prior work on tomography, and improves upon it by employing network coding to accurately distinguish among all different 2-by-2 components. We evaluate our algorithms through simulation of a number of realistic topologies and compare them to active tomographic techniques without network coding. We also make connections between our approach and alternatives, including passive inference, traceroute, and packet marking. △ Less

Submitted 6 February, 2013; v1 submitted 20 July, 2010; originally announced July 2010.

arXiv:1006.1165 [pdf, ps, other]

Optimal Source-Based Filtering of Malicious Traffic

Authors: Fabio Soldo, Katerina Argyraki, Athina Markopoulou

Abstract: In this paper, we consider the problem of blocking malicious traffic on the Internet, via source-based filtering. In particular, we consider filtering via access control lists (ACLs): these are already available at the routers today but are a scarce resource because they are stored in the expensive ternary content addressable memory (TCAM). Aggregation (by filtering source prefixes instead of indi… ▽ More In this paper, we consider the problem of blocking malicious traffic on the Internet, via source-based filtering. In particular, we consider filtering via access control lists (ACLs): these are already available at the routers today but are a scarce resource because they are stored in the expensive ternary content addressable memory (TCAM). Aggregation (by filtering source prefixes instead of individual IP addresses) helps reduce the number of filters, but comes also at the cost of blocking legitimate traffic originating from the filtered prefixes. We show how to optimally choose which source prefixes to filter, for a variety of realistic attack scenarios and operators' policies. In each scenario, we design optimal, yet computationally efficient, algorithms. Using logs from Dshield.org, we evaluate the algorithms and demonstrate that they bring significant benefit in practice. △ Less

Submitted 6 June, 2010; originally announced June 2010.

Comments: Conference version appeared in Infocom 2009. Journal version submitted to ToN

arXiv:1005.4769 [pdf, ps, other]

doi 10.1109/TIT.2012.2236916

A Network Coding Approach to Loss Tomography

Authors: Pegah Sattari, Athina Markopoulou, Christina Fragouli, Minas Gjoka

Abstract: Network tomography aims at inferring internal network characteristics based on measurements at the edge of the network. In loss tomography, in particular, the characteristic of interest is the loss rate of individual links and multicast and/or unicast end-to-end probes are typically used. Independently, recent advances in network coding have shown that there are advantages from allowing intermedia… ▽ More Network tomography aims at inferring internal network characteristics based on measurements at the edge of the network. In loss tomography, in particular, the characteristic of interest is the loss rate of individual links and multicast and/or unicast end-to-end probes are typically used. Independently, recent advances in network coding have shown that there are advantages from allowing intermediate nodes to process and combine, in addition to just forward, packets. In this paper, we study the problem of loss tomography in networks with network coding capabilities. We design a framework for estimating link loss rates, which leverages network coding capabilities, and we show that it improves several aspects of tomography including the identifiability of links, the trade-off between estimation accuracy and bandwidth efficiency, and the complexity of probe path selection. We discuss the cases of inferring link loss rates in a tree topology and in a general topology. In the latter case, the benefits of our approach are even more pronounced compared to standard techniques, but we also face novel challenges, such as dealing with cycles and multiple paths between sources and receivers. Overall, this work makes the connection between active network tomography and network coding. △ Less

Submitted 9 February, 2013; v1 submitted 26 May, 2010; originally announced May 2010.

arXiv:1004.1729 [pdf, ps, other]

On the bias of BFS

Authors: Maciej Kurant, Athina Markopoulou, Patrick Thiran

Abstract: Breadth First Search (BFS) and other graph traversal techniques are widely used for measuring large unknown graphs, such as online social networks. It has been empirically observed that an incomplete BFS is biased toward high degree nodes. In contrast to more studied sampling techniques, such as random walks, the precise bias of BFS has not been characterized to date. In this paper, we quantify th… ▽ More Breadth First Search (BFS) and other graph traversal techniques are widely used for measuring large unknown graphs, such as online social networks. It has been empirically observed that an incomplete BFS is biased toward high degree nodes. In contrast to more studied sampling techniques, such as random walks, the precise bias of BFS has not been characterized to date. In this paper, we quantify the degree bias of BFS sampling. In particular, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction of covered nodes, in a random graph $RG(p_k)$ with a given degree distribution $p_k$. Furthermore, we also show that, for $RG(p_k)$, all commonly used graph traversal techniques (BFS, DFS, Forest Fire, and Snowball Sampling) lead to the same bias, and we show how to correct for this bias. To give a broader perspective, we compare this class of exploration techniques to random walks that are well-studied and easier to analyze. Next, we study by simulation the effect of graph properties not captured directly by our model. We find that the bias gets amplified in graphs with strong positive assortativity. Finally, we demonstrate the above results by sampling the Facebook social network, and we provide some practical guidelines for graph sampling in practice. △ Less

Submitted 10 April, 2010; originally announced April 2010.

Comments: 9 pages

Journal ref: International Teletraffic Congress (ITC 22), 2010

arXiv:1002.4885 [pdf, ps, other]

Network Coding-Aware Queue Management for TCP Flows over Coded Wireless Networks

Authors: Hulya Seferoglu, Athina Markopoulou

Abstract: We are interested in unicast traffic over wireless networks that employ constructive inter-session network coding, including single-hop and multi-hop schemes. In this setting, TCP flows do not fully exploit the network coding opportunities due to their bursty behavior and due to the fact that TCP is agnostic to the underlying network coding. In order to improve the performance of TCP flows over co… ▽ More We are interested in unicast traffic over wireless networks that employ constructive inter-session network coding, including single-hop and multi-hop schemes. In this setting, TCP flows do not fully exploit the network coding opportunities due to their bursty behavior and due to the fact that TCP is agnostic to the underlying network coding. In order to improve the performance of TCP flows over coded wireless networks, we take the following steps. First, we formulate the problem as network utility maximization and we present a distributed solution. Second, mimicking the structure of the optimal solution, we propose a "network-coding aware" queue management scheme (NCAQM) at intermediate nodes; we make no changes to TCP or to the MAC protocol (802.11). We demonstrate, via simulation, that NCAQM significantly improves TCP performance compared to TCP over baseline schemes. △ Less

Submitted 3 October, 2010; v1 submitted 26 February, 2010; originally announced February 2010.

arXiv:0908.2007 [pdf, other]

Predictive Blacklisting as an Implicit Recommendation System

Authors: Fabio Soldo, Anh Le, Athina Markopoulou

Abstract: A widely used defense practice against malicious traffic on the Internet is through blacklists: lists of prolific attack sources are compiled and shared. The goal of blacklists is to predict and block future attack sources. Existing blacklisting techniques have focused on the most prolific attack sources and, more recently, on collaborative blacklisting. In this paper, we formulate the problem o… ▽ More A widely used defense practice against malicious traffic on the Internet is through blacklists: lists of prolific attack sources are compiled and shared. The goal of blacklists is to predict and block future attack sources. Existing blacklisting techniques have focused on the most prolific attack sources and, more recently, on collaborative blacklisting. In this paper, we formulate the problem of forecasting attack sources (also referred to as predictive blacklisting) based on shared attack logs as an implicit recommendation system. We compare the performance of existing approaches against the upper bound for prediction, and we demonstrate that there is much room for improvement. Inspired by the recent Netflix competition, we propose a multi-level prediction model that is adjusted and tuned specifically for the attack forecasting problem. Our model captures and combines various factors, namely: attacker-victim history (using time-series) and attackers and/or victims interactions (using neighborhood models). We evaluate our combined method on one month of logs from Dshield.org and demonstrate that it improves significantly the state-of-the-art. △ Less

Submitted 13 August, 2009; originally announced August 2009.

Comments: Comments: 11 pages; Submitted to INFOCOM 2010

arXiv:0906.0060 [pdf, ps, other]

A Walk in Facebook: Uniform Sampling of Users in Online Social Networks

Authors: Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou

Abstract: Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples… ▽ More Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the "ground truth." In contrast, using Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook. △ Less

Submitted 17 June, 2011; v1 submitted 30 May, 2009; originally announced June 2009.

Comments: published in IEEE INFOCOM '10; IEEE Journal on Selected Areas in Communications (JSAC), Special Issue on Measurement of Internet Topologies, 2011 under the title "Practical Recommendations on Crawling Online Social Networks"

arXiv:0811.3828 [pdf, other]

Optimal Filtering of Malicious IP Sources

Authors: Fabio Soldo, Athina Markopoulou, Katerina Argyraki

Abstract: How can we protect the network infrastructure from malicious traffic, such as scanning, malicious code propagation, and distributed denial-of-service (DDoS) attacks? One mechanism for blocking malicious traffic is filtering: access control lists (ACLs) can selectively block traffic based on fields of the IP header. Filters (ACLs) are already available in the routers today but are a scarce resour… ▽ More How can we protect the network infrastructure from malicious traffic, such as scanning, malicious code propagation, and distributed denial-of-service (DDoS) attacks? One mechanism for blocking malicious traffic is filtering: access control lists (ACLs) can selectively block traffic based on fields of the IP header. Filters (ACLs) are already available in the routers today but are a scarce resource because they are stored in the expensive ternary content addressable memory (TCAM). In this paper, we develop, for the first time, a framework for studying filter selection as a resource allocation problem. Within this framework, we study five practical cases of source address/prefix filtering, which correspond to different attack scenarios and operator's policies. We show that filter selection optimization leads to novel variations of the multidimensional knapsack problem and we design optimal, yet computationally efficient, algorithms to solve them. We also evaluate our approach using data from Dshield.org and demonstrate that it brings significant benefits in practice. Our set of algorithms is a building block that can be immediately used by operators and manufacturers to block malicious traffic in a cost-efficient way. △ Less

Submitted 24 November, 2008; originally announced November 2008.

Comments: submitted to Infocom 09

Showing 1–50 of 53 results for author: Markopoulou, A