-
Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?
Authors:
Ileana Montoya Perez,
Parisa Movahedi,
Valtteri Nieminen,
Antti Airola,
Tapio Pahikkala
Abstract:
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off.
Object…
▽ More
Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off.
Objectives: The aim of this study is to evaluate the Mann-Whitney U test on DP-synthetic biomedical data in terms of Type I and Type II errors, in order to establish whether statistical hypothesis testing performed on privacy preserving synthetic data is likely to lead to loss of test's validity or decreased power.
Methods: We evaluate the Mann-Whitney U test on DP-synthetic data generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on data drawn from two Gaussian distributions. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.
Conclusion: Most of the tested DP-synthetic data generation methods showed inflated Type I error, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error levels.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Locally Differentially Private Embedding Models in Distributed Fraud Prevention Systems
Authors:
Iker Perez,
Jason Wong,
Piotr Skalski,
Stuart Burrell,
Richard Mortier,
Derek McAuley,
David Sutton
Abstract:
Global financial crime activity is driving demand for machine learning solutions in fraud prevention. However, prevention systems are commonly serviced to financial institutions in isolation, and few provisions exist for data sharing due to fears of unintentional leaks and adversarial attacks. Collaborative learning advances in finance are rare, and it is hard to find real-world insights derived f…
▽ More
Global financial crime activity is driving demand for machine learning solutions in fraud prevention. However, prevention systems are commonly serviced to financial institutions in isolation, and few provisions exist for data sharing due to fears of unintentional leaks and adversarial attacks. Collaborative learning advances in finance are rare, and it is hard to find real-world insights derived from privacy-preserving data processing systems. In this paper, we present a collaborative deep learning framework for fraud prevention, designed from a privacy standpoint, and awarded at the recent PETs Prize Challenges. We leverage latent embedded representations of varied-length transaction sequences, along with local differential privacy, in order to construct a data release mechanism which can securely inform externally hosted fraud and anomaly detection models. We assess our contribution on two distributed data sets donated by large payment networks, and demonstrate robustness to popular inference-time attacks, along with utility-privacy trade-offs analogous to published work in alternative application domains.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences
Authors:
Piotr Skalski,
David Sutton,
Stuart Burrell,
Iker Perez,
Jason Wong
Abstract:
Machine learning models underpin many modern financial systems for use cases such as fraud detection and churn prediction. Most are based on supervised learning with hand-engineered features, which relies heavily on the availability of labelled data. Large self-supervised generative models have shown tremendous success in natural language processing and computer vision, yet so far they haven't bee…
▽ More
Machine learning models underpin many modern financial systems for use cases such as fraud detection and churn prediction. Most are based on supervised learning with hand-engineered features, which relies heavily on the availability of labelled data. Large self-supervised generative models have shown tremendous success in natural language processing and computer vision, yet so far they haven't been adapted to multivariate time series of financial transactions. In this paper, we present a generative pretraining method that can be used to obtain contextualised embeddings of financial transactions. Benchmarks on public datasets demonstrate that it outperforms state-of-the-art self-supervised methods on a range of downstream tasks. We additionally perform large-scale pretraining of an embedding model using a corpus of data from 180 issuing banks containing 5.1 billion transactions and apply it to the card fraud detection problem on hold-out datasets. The embedding model significantly improves value detection rate at high precision thresholds and transfers well to out-of-domain distributions.
△ Less
Submitted 4 January, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Multimodal Wildland Fire Smoke Detection
Authors:
Siddhant Baldota,
Shreyas Anantha Ramaprasad,
Jaspreet Kaur Bhamra,
Shane Luna,
Ravi Ramachandra,
Eugene Zen,
Harrison Kim,
Daniel Crawl,
Ismael Perez,
Ilkay Altintas,
Garrison W. Cottrell,
Mai H. Nguyen
Abstract:
Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency…
▽ More
Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
Towards a Dynamic Composability Approach for using Heterogeneous Systems in Remote Sensing
Authors:
Ilkay Altintas,
Ismael Perez,
Dmitry Mishin,
Adrien Trouillaud,
Christopher Irving,
John Graham,
Mahidhar Tatineni,
Thomas DeFanti,
Shawn Strande,
Larry Smarr,
Michael L. Norman
Abstract:
Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric app…
▽ More
Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric applications led to the emergence of a new ecosystem for container coordination and integration. However, there is still a divide between the application development pipelines of existing supercomputing environments, and these new dynamic environments that disaggregate fluid resource pools through accessible, portable and re-programmable interfaces. New approaches for dynamic composability of heterogeneous systems are needed to further advance the data-driven scientific practice for the purpose of more efficient computing and usable tools for specific scientific domains. In this paper, we present a novel approach for using composable systems in the intersection between scientific computing, artificial intelligence (AI), and remote sensing domain. We describe the architecture of a first working example of a composable infrastructure that federates Expanse, an NSF-funded supercomputer, with Nautilus, a Kubernetes-based GPU geo-distributed cluster. We also summarize a case study in wildfire modeling, that demonstrates the application of this new infrastructure in scientific workflows: a composed system that bridges the insights from edge sensing, AI and computing capabilities with a physics-driven simulation.
△ Less
Submitted 13 November, 2022;
originally announced November 2022.
-
Monitoring ROS2: from Requirements to Autonomous Robots
Authors:
Ivan Perez,
Anastasia Mavridou,
Tom Pressburger,
Alexander Will,
Patrick J. Martin
Abstract:
Runtime verification (RV) has the potential to enable the safe operation of safety-critical systems that are too complex to formally verify, such as Robot Operating System 2 (ROS2) applications. Writing correct monitors can itself be complex, and errors in the monitoring subsystem threaten the mission as a whole. This paper provides an overview of a formal approach to generating runtime monitors f…
▽ More
Runtime verification (RV) has the potential to enable the safe operation of safety-critical systems that are too complex to formally verify, such as Robot Operating System 2 (ROS2) applications. Writing correct monitors can itself be complex, and errors in the monitoring subsystem threaten the mission as a whole. This paper provides an overview of a formal approach to generating runtime monitors for autonomous robots from requirements written in a structured natural language. Our approach integrates the Formal Requirement Elicitation Tool (FRET) with Copilot, a runtime verification framework, through the Ogma integration tool. FRET is used to specify requirements with unambiguous semantics, which are then automatically translated into temporal logic formulae. Ogma generates monitor specifications from the FRET output, which are compiled into hard-real time C99. To facilitate integration of the monitors in ROS2, we have extended Ogma to generate ROS2 packages defining monitoring nodes, which run the monitors when new data becomes available, and publish the results of any violations. The goal of our approach is to treat the generated ROS2 packages as black boxes and integrate them into larger ROS2 systems with minimal effort.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
The Topological BERT: Transforming Attention into Topology for Natural Language Processing
Authors:
Ilan Perez,
Raphael Reinauer
Abstract:
In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks.
This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into at…
▽ More
In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks.
This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into attention graphs as the only input to that classifier. The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive. It performs comparably to the BERT baseline and outperforms it on some tasks.
Additionally, we propose a new method to reduce the number of BERT's attention heads considered by the topological classifier, which allows us to prune the number of heads from 144 down to as few as ten with no reduction in performance. Our work also shows that the topological model displays higher robustness against adversarial attacks than the original BERT model, which is maintained during the pruning process. To the best of our knowledge, this work is the first to confront topological-based models with adversarial attacks in the context of NLP.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
FIgLib & SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection
Authors:
Anshuman Dewangan,
Yash Pande,
Hans-Werner Braun,
Frank Vernon,
Ismael Perez,
Ilkay Altintas,
Garrison W. Cottrell,
Mai H. Nguyen
Abstract:
The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have…
▽ More
The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have suffered from small or unreliable datasets that make it difficult to extrapolate performance to real-world scenarios. In this work, we present the Fire Ignition Library (FIgLib), a publicly available dataset of nearly 25,000 labeled wildfire smoke images as seen from fixed-view cameras deployed in Southern California. We also introduce SmokeyNet, a novel deep learning architecture using spatiotemporal information from camera imagery for real-time wildfire smoke detection. When trained on the FIgLib dataset, SmokeyNet outperforms comparable baselines and rivals human performance. We hope that the availability of the FIgLib dataset and the SmokeyNet architecture will inspire further research into deep learning methods for wildfire smoke detection, leading to automated notification systems that reduce the time to wildfire response.
△ Less
Submitted 14 May, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Attribution of Predictive Uncertainties in Classification Models
Authors:
Iker Perez,
Piotr Skalski,
Alec Barns-Graham,
Jason Wong,
David Sutton
Abstract:
Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. V…
▽ More
Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
△ Less
Submitted 8 June, 2022; v1 submitted 19 July, 2021;
originally announced July 2021.
-
Efficient Bypass in Mesh and Torus NoCs
Authors:
Iván Pérez,
Enrique Vallejo,
Ramón Beivide
Abstract:
Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypass pipeline especially at medium and high loads, i…
▽ More
Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypass pipeline especially at medium and high loads, increasing latency and power.
This paper presents NEBB, Non-Empty Buffer Bypass, a mechanism that allows to bypass flits even if the buffers to bypass are not empty. The mechanism applies to wormhole and virtual-cut-through, each of them with different advantages. NEBB-Hybrid is proposed to employ the best flow control in each situation. The mechanism is extended to torus topologies, using FBFC and shared buffers.
The proposals have been evaluated using Booksim, showing up to 75% reduction of the buffered flits for single-flit packets, which translates into latency and dynamic power reductions of up to 30% and 23% respectively. For bimodal traffic, these improvements are 20 and 21% respectively. Additionally, the bypass utilization is largely independent of the number of VCs when using shared buffers and very competitive with few private ones, allowing to simplify the allocation mechanisms.
△ Less
Submitted 10 December, 2020; v1 submitted 9 December, 2020;
originally announced December 2020.
-
From Requirements to Autonomous Flight: An Overview of the Monitoring ICAROUS Project
Authors:
Aaron Dutle,
César Muñoz,
Esther Conrad,
Alwyn Goodloe,
Laura Titolo,
Ivan Perez,
Swee Balachandran,
Dimitra Giannakopoulou,
Anastasia Mavridou,
Thomas Pressburger
Abstract:
The Independent Configurable Architecture for Reliable Operations of Unmanned Systems (ICAROUS) is a software architecture incorporating a set of algorithms to enable autonomous operations of unmanned aircraft applications. This paper provides an overview of Monitoring ICAROUS, a project whose objective is to provide a formal approach to generating runtime monitors for autonomous systems from requ…
▽ More
The Independent Configurable Architecture for Reliable Operations of Unmanned Systems (ICAROUS) is a software architecture incorporating a set of algorithms to enable autonomous operations of unmanned aircraft applications. This paper provides an overview of Monitoring ICAROUS, a project whose objective is to provide a formal approach to generating runtime monitors for autonomous systems from requirements written in a structured natural language. This approach integrates FRET, a formal requirement elicitation and authoring tool, and Copilot, a runtime verification framework. FRET is used to specify formal requirements in structured natural language. These requirements are translated into temporal logic formulae. Copilot is then used to generate executable runtime monitors from these temporal logic specifications. The generated monitors are directly integrated into ICAROUS to perform runtime verification during flight.
△ Less
Submitted 2 December, 2020;
originally announced December 2020.
-
BigO: A public health decision support system for measuring obesogenic behaviors of children in relation to their local environment
Authors:
Christos Diou,
Ioannis Sarafis,
Vasileios Papapanagiotou,
Leonidas Alagialoglou,
Irini Lekka,
Dimitrios Filos,
Leandros Stefanopoulos,
Vasileios Kilintzis,
Christos Maramis,
Youla Karavidopoulou,
Nikos Maglaveras,
Ioannis Ioakimidis,
Evangelia Charmandari,
Penio Kassari,
Athanasia Tragomalou,
Monica Mars,
Thien-An Ngoc Nguyen,
Tahar Kechadi,
Shane O' Donnell,
Gerardine Doyle,
Sarah Browne,
Grace O' Malley,
Rachel Heimeier,
Katerina Riviou,
Evangelia Koukoula
, et al. (6 additional authors not shown)
Abstract:
Obesity is a complex disease and its prevalence depends on multiple factors related to the local socioeconomic, cultural and urban context of individuals. Many obesity prevention strategies and policies, however, are horizontal measures that do not depend on context-specific evidence. In this paper we present an overview of BigO (http://bigoprogram.eu), a system designed to collect objective behav…
▽ More
Obesity is a complex disease and its prevalence depends on multiple factors related to the local socioeconomic, cultural and urban context of individuals. Many obesity prevention strategies and policies, however, are horizontal measures that do not depend on context-specific evidence. In this paper we present an overview of BigO (http://bigoprogram.eu), a system designed to collect objective behavioral data from children and adolescent populations as well as their environment in order to support public health authorities in formulating effective, context-specific policies and interventions addressing childhood obesity. We present an overview of the data acquisition, indicator extraction, data exploration and analysis components of the BigO system, as well as an account of its preliminary pilot application in 33 schools and 2 clinics in four European countries, involving over 4,200 participants.
△ Less
Submitted 6 May, 2020;
originally announced May 2020.
-
Variational inequalities and mean-field approximations for partially observed systems of queueing networks
Authors:
Iker Perez,
Giuliano Casale
Abstract:
Queueing networks are systems of theoretical interest that find widespread use in the performance evaluation of interconnected resources. In comparison to counterpart models in genetics or mathematical biology, the stochastic (jump) processes induced by queueing networks have distinctive coupling and synchronization properties. This has prevented the derivation of variational approximations for co…
▽ More
Queueing networks are systems of theoretical interest that find widespread use in the performance evaluation of interconnected resources. In comparison to counterpart models in genetics or mathematical biology, the stochastic (jump) processes induced by queueing networks have distinctive coupling and synchronization properties. This has prevented the derivation of variational approximations for conditional representations of transient dynamics, which rely on simplifying independence assumptions. Here, we present a model augmentation to a multivariate counting process for interactions across service stations, and we enable the variational evaluation of mean-field measures for partially-observed multi-class networks. We also show that our framework offers an efficient and improved alternative for inference tasks, where existing variational or numerically intensive solutions do not work.
△ Less
Submitted 27 June, 2019; v1 submitted 23 July, 2018;
originally announced July 2018.
-
Solving the "false positives" problem in fraud prediction
Authors:
Roy Wedge,
James Max Kanter,
Santiago Moral Rubio,
Sergio Iglesias Perez,
Kalyan Veeramachaneni
Abstract:
In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Sy…
▽ More
In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Synthesis algorithm to automatically derive behavioral features based on the historical data of the card associated with a transaction. We generate 237 features (>100 behavioral patterns) for each transaction, and use a random forest to learn a classifier. We tested our machine learning model on data from a large multinational bank and compared it to their existing solution. On an unseen data of 1.852 million transactions, we were able to reduce the false positives by 54% and provide a savings of 190K euros. We also assess how to deploy this solution, and whether it necessitates streaming computation for real time scoring. We found that our solution can maintain similar benefits even when historical features are computed once every 7 days.
△ Less
Submitted 20 October, 2017;
originally announced October 2017.
-
Comparison of Selection Methods in On-line Distributed Evolutionary Robotics
Authors:
Iñaki Fernández Pérez,
Amine Boumaza,
François Charpillet
Abstract:
In this paper, we study the impact of selection methods in the context of on-line on-board distributed evolutionary algorithms. We propose a variant of the mEDEA algorithm in which we add a selection operator, and we apply it in a taskdriven scenario. We evaluate four selection methods that induce different intensity of selection pressure in a multi-robot navigation with obstacle avoidance task an…
▽ More
In this paper, we study the impact of selection methods in the context of on-line on-board distributed evolutionary algorithms. We propose a variant of the mEDEA algorithm in which we add a selection operator, and we apply it in a taskdriven scenario. We evaluate four selection methods that induce different intensity of selection pressure in a multi-robot navigation with obstacle avoidance task and a collective foraging task. Experiments show that a small intensity of selection pressure is sufficient to rapidly obtain good performances on the tasks at hand. We introduce different measures to compare the selection methods, and show that the higher the selection pressure, the better the performances obtained, especially for the more challenging food foraging task.
△ Less
Submitted 7 January, 2015;
originally announced January 2015.