-
FreqyWM: Frequency Watermarking for the New Data Economy
Authors:
Devriş İşler,
Elisa Cabana,
Alvaro Garcia-Recuero,
Georgia Koutrika,
Nikolaos Laoutaris
Abstract:
We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuristic algorithms for creating and verifying such watermarks. We also demonstrate the robustness of our technique against various attacks and derive analytical bounds f…
▽ More
We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuristic algorithms for creating and verifying such watermarks. We also demonstrate the robustness of our technique against various attacks and derive analytical bounds for the false positive probability of erroneously detecting a watermark on a dataset that does not carry it. Our technique is applicable to both single dimensional and multidimensional datasets, is independent of token type, allows for a fine control of the introduced distortion, and can be used in a variety of use cases that involve buying and selling data in contemporary data marketplaces.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Securing Federated Sensitive Topic Classification against Poisoning Attacks
Authors:
Tianyue Chu,
Alvaro Garcia-Recuero,
Costas Iordanou,
Georgios Smaragdakis,
Nikolaos Laoutaris
Abstract:
We present a Federated Learning (FL) based solution for building a distributed classifier capable of detecting URLs containing GDPR-sensitive content related to categories such as health, sexual preference, political beliefs, etc. Although such a classifier addresses the limitations of previous offline/centralised classifiers,it is still vulnerable to poisoning attacks from malicious users that ma…
▽ More
We present a Federated Learning (FL) based solution for building a distributed classifier capable of detecting URLs containing GDPR-sensitive content related to categories such as health, sexual preference, political beliefs, etc. Although such a classifier addresses the limitations of previous offline/centralised classifiers,it is still vulnerable to poisoning attacks from malicious users that may attempt to reduce the accuracy for benign users by disseminating faulty model updates. To guard against this, we develop a robust aggregation scheme based on subjective logic and residual-based attack detection. Employing a combination of theoretical analysis, trace-driven simulation, as well as experimental validation with a prototype and real users, we show that our classifier can detect sensitive content with high accuracy, learn new labels fast, and remain robust in view of poisoning attacks from malicious users, as well as imperfect input from non-malicious ones.
△ Less
Submitted 28 October, 2022; v1 submitted 31 January, 2022;
originally announced January 2022.
-
Approximate Privacy-Preserving Neighbourhood Estimations
Authors:
Alvaro Garcia-Recuero
Abstract:
Anonymous social networks present a number of new and challenging problems for existing Social Network Analysis techniques. Traditionally, existing methods for analysing graph structure, such as community detection, required global knowledge of the graph structure. That implies that a centralised entity must be given access to the edge list of each node in the graph. This is impossible for anonymo…
▽ More
Anonymous social networks present a number of new and challenging problems for existing Social Network Analysis techniques. Traditionally, existing methods for analysing graph structure, such as community detection, required global knowledge of the graph structure. That implies that a centralised entity must be given access to the edge list of each node in the graph. This is impossible for anonymous social networks and other settings where privacy is valued by its participants. In addition, using their graph structure inputs for learning tasks defeats the purpose of anonymity. In this work, we hypothesise that one can re-purpose the use of the HyperANF a.k.a HyperBall algorithm -- intended for approximate diameter estimation -- to the task of privacy-preserving community detection for friend recommending systems that learn from an anonymous representation of the social network graph structure with limited privacy impact. This is possible because the core data structure maintained by HyperBall is a HyperLogLog with a counter of the number of reachable neighbours from a given node. Exchanging this data structure in future decentralised learning deployments gives away no information about the neighbours of the node and therefore does preserve the privacy of the graph structure.
△ Less
Submitted 19 June, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
Trollslayer: Crowdsourcing and Characterization of Abusive Birds in Twitter
Authors:
Alvaro Garcia-Recuero,
Aneta Morawin,
Gareth Tyson
Abstract:
As of today, abuse is a pressing issue to participants and administrators of Online Social Networks (OSN). Abuse in Twitter can spawn from arguments generated for influencing outcomes of a political election, the use of bots to automatically spread misinformation, and generally speaking, activities that deny, disrupt, degrade or deceive other participants and, or the network. Given the difficulty…
▽ More
As of today, abuse is a pressing issue to participants and administrators of Online Social Networks (OSN). Abuse in Twitter can spawn from arguments generated for influencing outcomes of a political election, the use of bots to automatically spread misinformation, and generally speaking, activities that deny, disrupt, degrade or deceive other participants and, or the network. Given the difficulty in finding and accessing a large enough sample of abuse ground truth from the Twitter platform, we built and deployed a custom crawler that we use to judiciously collect a new dataset from the Twitter platform with the aim of characterizing the nature of abusive users, a.k.a abusive birds, in the wild. We provide a comprehensive set of features based on users' attributes, as well as social-graph metadata. The former includes metadata about the account itself, while the latter is computed from the social graph among the sender and the receiver of each message. Attribute-based features are useful to characterize user's accounts in OSN, while graph-based features can reveal the dynamics of information dissemination across the network. In particular, we derive the Jaccard index as a key feature to reveal the benign or malicious nature of directed messages in Twitter. To the best of our knowledge, we are the first to propose such a similarity metric to characterize abuse in Twitter.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
Movie Pirates of the Caribbean: Exploring Illegal Streaming Cyberlockers
Authors:
Damilola Ibosiola,
Benjamin Steer,
Alvaro Garcia-Recuero,
Gianluca Stringhini,
Steve Uhlig,
Gareth Tyson
Abstract:
Online video piracy (OVP) is a contentious topic, with strong proponents on both sides of the argument. Recently, a number of illegal websites, called streaming cyberlockers, have begun to dominate OVP. These websites specialise in distributing pirated content, underpinned by third party indexing services offering easy-to-access directories of content. This paper performs the first exploration of…
▽ More
Online video piracy (OVP) is a contentious topic, with strong proponents on both sides of the argument. Recently, a number of illegal websites, called streaming cyberlockers, have begun to dominate OVP. These websites specialise in distributing pirated content, underpinned by third party indexing services offering easy-to-access directories of content. This paper performs the first exploration of this new ecosystem. It characterises the content, as well the streaming cyberlockers' individual attributes. We find a remarkably centralised system with just a few networks, countries and cyberlockers underpinning most provisioning. We also investigate the actions of copyright enforcers. We find they tend to target small subsets of the ecosystem, although they appear quite successful. 84% of copyright notices see content removed.
△ Less
Submitted 8 April, 2018;
originally announced April 2018.
-
On the energy efficiency of client-centric data consistency management under random read/write access to Big Data with Apache HBase
Authors:
Álvaro García-Recuero
Abstract:
The total estimated energy bill for data centers in 2010 was \$11.5 billion, and experts estimate that the energy cost of a typical data center doubles every five years. On the other hand, computational developments have started to lag behind storage advancements, therein becoming a future bottleneck for the ongoing data growth which already approaches Exascale levels. We investigate the relations…
▽ More
The total estimated energy bill for data centers in 2010 was \$11.5 billion, and experts estimate that the energy cost of a typical data center doubles every five years. On the other hand, computational developments have started to lag behind storage advancements, therein becoming a future bottleneck for the ongoing data growth which already approaches Exascale levels. We investigate the relationship among data throughput and energy footprint on a large storage cluster, with the goal of formalizing it as a metric that reflects the trading among consistency and energy. Employing a client-centric consistency approach, and while honouring ACID properties of the chosen columnar store for the case study (Apache HBase), we present the factors involved in the energy consumption of the system as well as lessons learned to underpin further design of energy-efficient cluster scale storage systems.
△ Less
Submitted 30 November, 2016; v1 submitted 9 September, 2015;
originally announced September 2015.
-
Quality-of-Data for Consistency Levels in Geo-replicated Cloud Data Stores
Authors:
Álvaro García-Recuero,
Sérgio Esteves,
Luís Veiga
Abstract:
Cloud computing has recently emerged as a key technology to provide individuals and companies with access to remote computing and storage infrastructures. In order to achieve highly-available yet high-performing services, cloud data stores rely on data replication. However, providing replication brings with it the issue of consistency. Given that data are replicated in multiple geographically dist…
▽ More
Cloud computing has recently emerged as a key technology to provide individuals and companies with access to remote computing and storage infrastructures. In order to achieve highly-available yet high-performing services, cloud data stores rely on data replication. However, providing replication brings with it the issue of consistency. Given that data are replicated in multiple geographically distributed data centers, and to meet the increasing requirements of distributed applications, many cloud data stores adopt eventual consistency and therefore allow to run data intensive operations under low latency. This comes at the cost of data staleness. In this paper, we prioritize data replication based on a set of flexible data semantics that can best suit all types of Big Data applications, avoiding overloading both network and systems during large periods of disconnection or partitions in the network. Therefore we integrated these data semantics into the core architecture of a well-known NoSQL data store (e.g., HBase), which leverages a three-dimensional vector-field model (regarding timeliness, number of pending updates and divergence bounds) to provision data selectively in an on-demand fashion to applications. This enhances the former consistency model by providing a number of required levels of consistency to different applications such as, social networks or e-commerce sites, where priority of updates also differ. In addition, our implementation of the model into HBase allows updates to be tagged and grouped atomically in logical batches, akin to transactions, ensuring atomic changes and correctness of updates as they are propagated.
△ Less
Submitted 12 October, 2014;
originally announced October 2014.