Search | arXiv e-print repository

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Authors: Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala

Abstract: Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Object… ▽ More Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: The aim of this study is to evaluate the Mann-Whitney U test on DP-synthetic biomedical data in terms of Type I and Type II errors, in order to establish whether statistical hypothesis testing performed on privacy preserving synthetic data is likely to lead to loss of test's validity or decreased power. Methods: We evaluate the Mann-Whitney U test on DP-synthetic data generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on data drawn from two Gaussian distributions. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: Most of the tested DP-synthetic data generation methods showed inflated Type I error, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error levels. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2401.02450 [pdf, other]

Locally Differentially Private Embedding Models in Distributed Fraud Prevention Systems

Authors: Iker Perez, Jason Wong, Piotr Skalski, Stuart Burrell, Richard Mortier, Derek McAuley, David Sutton

Abstract: Global financial crime activity is driving demand for machine learning solutions in fraud prevention. However, prevention systems are commonly serviced to financial institutions in isolation, and few provisions exist for data sharing due to fears of unintentional leaks and adversarial attacks. Collaborative learning advances in finance are rare, and it is hard to find real-world insights derived f… ▽ More Global financial crime activity is driving demand for machine learning solutions in fraud prevention. However, prevention systems are commonly serviced to financial institutions in isolation, and few provisions exist for data sharing due to fears of unintentional leaks and adversarial attacks. Collaborative learning advances in finance are rare, and it is hard to find real-world insights derived from privacy-preserving data processing systems. In this paper, we present a collaborative deep learning framework for fraud prevention, designed from a privacy standpoint, and awarded at the recent PETs Prize Challenges. We leverage latent embedded representations of varied-length transaction sequences, along with local differential privacy, in order to construct a data release mechanism which can securely inform externally hosted fraud and anomaly detection models. We assess our contribution on two distributed data sets donated by large payment networks, and demonstrate robustness to popular inference-time attacks, along with utility-privacy trade-offs analogous to published work in alternative application domains. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01641 [pdf, other]

doi 10.1145/3604237.3626850

Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences

Authors: Piotr Skalski, David Sutton, Stuart Burrell, Iker Perez, Jason Wong

Abstract: Machine learning models underpin many modern financial systems for use cases such as fraud detection and churn prediction. Most are based on supervised learning with hand-engineered features, which relies heavily on the availability of labelled data. Large self-supervised generative models have shown tremendous success in natural language processing and computer vision, yet so far they haven't bee… ▽ More Machine learning models underpin many modern financial systems for use cases such as fraud detection and churn prediction. Most are based on supervised learning with hand-engineered features, which relies heavily on the availability of labelled data. Large self-supervised generative models have shown tremendous success in natural language processing and computer vision, yet so far they haven't been adapted to multivariate time series of financial transactions. In this paper, we present a generative pretraining method that can be used to obtain contextualised embeddings of financial transactions. Benchmarks on public datasets demonstrate that it outperforms state-of-the-art self-supervised methods on a range of downstream tasks. We additionally perform large-scale pretraining of an embedding model using a corpus of data from 180 issuing banks containing 5.1 billion transactions and apply it to the card fraud detection problem on hold-out datasets. The embedding model significantly improves value detection rate at high precision thresholds and transfers well to out-of-domain distributions. △ Less

Submitted 4 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Journal ref: 4th ACM International Conference on AI in Finance (ICAIF '23), November 27-29, 2023, Brooklyn, NY, USA

arXiv:2212.14143 [pdf, other]

Multimodal Wildland Fire Smoke Detection

Authors: Siddhant Baldota, Shreyas Anantha Ramaprasad, Jaspreet Kaur Bhamra, Shane Luna, Ravi Ramachandra, Eugene Zen, Harrison Kim, Daniel Crawl, Ismael Perez, Ilkay Altintas, Garrison W. Cottrell, Mai H. Nguyen

Abstract: Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency… ▽ More Research has shown that climate change creates warmer temperatures and drier conditions, leading to longer wildfire seasons and increased wildfire risks in the United States. These factors have in turn led to increases in the frequency, extent, and severity of wildfires in recent years. Given the danger posed by wildland fires to people, property, wildlife, and the environment, there is an urgency to provide tools for effective wildfire management. Early detection of wildfires is essential to minimizing potentially catastrophic destruction. In this paper, we present our work on integrating multiple data sources in SmokeyNet, a deep learning model using spatio-temporal information to detect smoke from wildland fires. Camera image data is integrated with weather sensor measurements and processed by SmokeyNet to create a multimodal wildland fire smoke detection system. We present our results comparing performance in terms of both accuracy and time-to-detection for multimodal data vs. a single data source. With a time-to-detection of only a few minutes, SmokeyNet can serve as an automated early notification system, providing a useful tool in the fight against destructive wildfires. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2211.06918 [pdf, other]

Towards a Dynamic Composability Approach for using Heterogeneous Systems in Remote Sensing

Authors: Ilkay Altintas, Ismael Perez, Dmitry Mishin, Adrien Trouillaud, Christopher Irving, John Graham, Mahidhar Tatineni, Thomas DeFanti, Shawn Strande, Larry Smarr, Michael L. Norman

Abstract: Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric app… ▽ More Influenced by the advances in data and computing, the scientific practice increasingly involves machine learning and artificial intelligence driven methods which requires specialized capabilities at the system-, science- and service-level in addition to the conventional large-capacity supercomputing approaches. The latest distributed architectures built around the composability of data-centric applications led to the emergence of a new ecosystem for container coordination and integration. However, there is still a divide between the application development pipelines of existing supercomputing environments, and these new dynamic environments that disaggregate fluid resource pools through accessible, portable and re-programmable interfaces. New approaches for dynamic composability of heterogeneous systems are needed to further advance the data-driven scientific practice for the purpose of more efficient computing and usable tools for specific scientific domains. In this paper, we present a novel approach for using composable systems in the intersection between scientific computing, artificial intelligence (AI), and remote sensing domain. We describe the architecture of a first working example of a composable infrastructure that federates Expanse, an NSF-funded supercomputer, with Nautilus, a Kubernetes-based GPU geo-distributed cluster. We also summarize a case study in wildfire modeling, that demonstrates the application of this new infrastructure in scientific workflows: a composed system that bridges the insights from edge sensing, AI and computing capabilities with a physics-driven simulation. △ Less

Submitted 13 November, 2022; originally announced November 2022.

Comments: 18th IEEE International Conference on eScience (2022)

arXiv:2209.14030 [pdf, other]

doi 10.4204/EPTCS.371.15

Monitoring ROS2: from Requirements to Autonomous Robots

Authors: Ivan Perez, Anastasia Mavridou, Tom Pressburger, Alexander Will, Patrick J. Martin

Abstract: Runtime verification (RV) has the potential to enable the safe operation of safety-critical systems that are too complex to formally verify, such as Robot Operating System 2 (ROS2) applications. Writing correct monitors can itself be complex, and errors in the monitoring subsystem threaten the mission as a whole. This paper provides an overview of a formal approach to generating runtime monitors f… ▽ More Runtime verification (RV) has the potential to enable the safe operation of safety-critical systems that are too complex to formally verify, such as Robot Operating System 2 (ROS2) applications. Writing correct monitors can itself be complex, and errors in the monitoring subsystem threaten the mission as a whole. This paper provides an overview of a formal approach to generating runtime monitors for autonomous robots from requirements written in a structured natural language. Our approach integrates the Formal Requirement Elicitation Tool (FRET) with Copilot, a runtime verification framework, through the Ogma integration tool. FRET is used to specify requirements with unambiguous semantics, which are then automatically translated into temporal logic formulae. Ogma generates monitor specifications from the FRET output, which are compiled into hard-real time C99. To facilitate integration of the monitors in ROS2, we have extended Ogma to generate ROS2 packages defining monitoring nodes, which run the monitors when new data becomes available, and publish the results of any violations. The goal of our approach is to treat the generated ROS2 packages as black boxes and integrate them into larger ROS2 systems with minimal effort. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: In Proceedings FMAS2022 ASYDE2022, arXiv:2209.13181

ACM Class: D.2.1; D.2.4; I.2.9;

Journal ref: EPTCS 371, 2022, pp. 208-216

arXiv:2206.15195 [pdf, other]

The Topological BERT: Transforming Attention into Topology for Natural Language Processing

Authors: Ilan Perez, Raphael Reinauer

Abstract: In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks. This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into at… ▽ More In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks. This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into attention graphs as the only input to that classifier. The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive. It performs comparably to the BERT baseline and outperforms it on some tasks. Additionally, we propose a new method to reduce the number of BERT's attention heads considered by the topological classifier, which allows us to prune the number of heads from 144 down to as few as ten with no reduction in performance. Our work also shows that the topological model displays higher robustness against adversarial attacks than the original BERT model, which is maintained during the pruning process. To the best of our knowledge, this work is the first to confront topological-based models with adversarial attacks in the context of NLP. △ Less

Submitted 30 June, 2022; originally announced June 2022.

arXiv:2112.08598 [pdf, other]

doi 10.3390/rs14041007

FIgLib & SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection

Authors: Anshuman Dewangan, Yash Pande, Hans-Werner Braun, Frank Vernon, Ismael Perez, Ilkay Altintas, Garrison W. Cottrell, Mai H. Nguyen

Abstract: The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have… ▽ More The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high-fire-risk days, a small fire ignition can rapidly grow and become out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have suffered from small or unreliable datasets that make it difficult to extrapolate performance to real-world scenarios. In this work, we present the Fire Ignition Library (FIgLib), a publicly available dataset of nearly 25,000 labeled wildfire smoke images as seen from fixed-view cameras deployed in Southern California. We also introduce SmokeyNet, a novel deep learning architecture using spatiotemporal information from camera imagery for real-time wildfire smoke detection. When trained on the FIgLib dataset, SmokeyNet outperforms comparable baselines and rivals human performance. We hope that the availability of the FIgLib dataset and the SmokeyNet architecture will inspire further research into deep learning methods for wildfire smoke detection, leading to automated notification systems that reduce the time to wildfire response. △ Less

Submitted 14 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Journal ref: Remote Sensing. 2022; 14(4):1007

arXiv:2107.08756 [pdf, other]

Attribution of Predictive Uncertainties in Classification Models

Authors: Iker Perez, Piotr Skalski, Alec Barns-Graham, Jason Wong, David Sutton

Abstract: Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. V… ▽ More Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity. △ Less

Submitted 8 June, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

Journal ref: Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:1582-1591, 2022

arXiv:2012.05136 [pdf, other]

doi 10.1016/j.sysarc.2020.101832

Efficient Bypass in Mesh and Torus NoCs

Authors: Iván Pérez, Enrique Vallejo, Ramón Beivide

Abstract: Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypass pipeline especially at medium and high loads, i… ▽ More Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypass pipeline especially at medium and high loads, increasing latency and power. This paper presents NEBB, Non-Empty Buffer Bypass, a mechanism that allows to bypass flits even if the buffers to bypass are not empty. The mechanism applies to wormhole and virtual-cut-through, each of them with different advantages. NEBB-Hybrid is proposed to employ the best flow control in each situation. The mechanism is extended to torus topologies, using FBFC and shared buffers. The proposals have been evaluated using Booksim, showing up to 75% reduction of the buffered flits for single-flit packets, which translates into latency and dynamic power reductions of up to 30% and 23% respectively. For bimodal traffic, these improvements are 20 and 21% respectively. Additionally, the bypass utilization is largely independent of the number of VCs when using shared buffers and very competitive with few private ones, allowing to simplify the allocation mechanisms. △ Less

Submitted 10 December, 2020; v1 submitted 9 December, 2020; originally announced December 2020.

Comments: 14 pages, 16 figures, LaTeX; this review is an update of the preprint to the accepted manuscript of the paper; the final version of this work has been published in the Journal of SystemArchitecture, DOI: https://doi.org/10.1016/j.sysarc.2020.101832

Journal ref: Journal of Systems Architecture Volume 108, September 2020, 101832

arXiv:2012.03745 [pdf, other]

doi 10.4204/EPTCS.329.3

From Requirements to Autonomous Flight: An Overview of the Monitoring ICAROUS Project

Authors: Aaron Dutle, César Muñoz, Esther Conrad, Alwyn Goodloe, Laura Titolo, Ivan Perez, Swee Balachandran, Dimitra Giannakopoulou, Anastasia Mavridou, Thomas Pressburger

Abstract: The Independent Configurable Architecture for Reliable Operations of Unmanned Systems (ICAROUS) is a software architecture incorporating a set of algorithms to enable autonomous operations of unmanned aircraft applications. This paper provides an overview of Monitoring ICAROUS, a project whose objective is to provide a formal approach to generating runtime monitors for autonomous systems from requ… ▽ More The Independent Configurable Architecture for Reliable Operations of Unmanned Systems (ICAROUS) is a software architecture incorporating a set of algorithms to enable autonomous operations of unmanned aircraft applications. This paper provides an overview of Monitoring ICAROUS, a project whose objective is to provide a formal approach to generating runtime monitors for autonomous systems from requirements written in a structured natural language. This approach integrates FRET, a formal requirement elicitation and authoring tool, and Copilot, a runtime verification framework. FRET is used to specify formal requirements in structured natural language. These requirements are translated into temporal logic formulae. Copilot is then used to generate executable runtime monitors from these temporal logic specifications. The generated monitors are directly integrated into ICAROUS to perform runtime verification during flight. △ Less

Submitted 2 December, 2020; originally announced December 2020.

Comments: In Proceedings FMAS 2020, arXiv:2012.01176

Journal ref: EPTCS 329, 2020, pp. 23-30

arXiv:2005.02928 [pdf, other]

BigO: A public health decision support system for measuring obesogenic behaviors of children in relation to their local environment

Authors: Christos Diou, Ioannis Sarafis, Vasileios Papapanagiotou, Leonidas Alagialoglou, Irini Lekka, Dimitrios Filos, Leandros Stefanopoulos, Vasileios Kilintzis, Christos Maramis, Youla Karavidopoulou, Nikos Maglaveras, Ioannis Ioakimidis, Evangelia Charmandari, Penio Kassari, Athanasia Tragomalou, Monica Mars, Thien-An Ngoc Nguyen, Tahar Kechadi, Shane O' Donnell, Gerardine Doyle, Sarah Browne, Grace O' Malley, Rachel Heimeier, Katerina Riviou, Evangelia Koukoula , et al. (6 additional authors not shown)

Abstract: Obesity is a complex disease and its prevalence depends on multiple factors related to the local socioeconomic, cultural and urban context of individuals. Many obesity prevention strategies and policies, however, are horizontal measures that do not depend on context-specific evidence. In this paper we present an overview of BigO (http://bigoprogram.eu), a system designed to collect objective behav… ▽ More Obesity is a complex disease and its prevalence depends on multiple factors related to the local socioeconomic, cultural and urban context of individuals. Many obesity prevention strategies and policies, however, are horizontal measures that do not depend on context-specific evidence. In this paper we present an overview of BigO (http://bigoprogram.eu), a system designed to collect objective behavioral data from children and adolescent populations as well as their environment in order to support public health authorities in formulating effective, context-specific policies and interventions addressing childhood obesity. We present an overview of the data acquisition, indicator extraction, data exploration and analysis components of the BigO system, as well as an account of its preliminary pilot application in 33 schools and 2 clinics in four European countries, involving over 4,200 participants. △ Less

Submitted 6 May, 2020; originally announced May 2020.

Comments: Accepted version to be published in 2020, 42nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Montreal, Canada

arXiv:1807.08673 [pdf, ps, other]

Variational inequalities and mean-field approximations for partially observed systems of queueing networks

Authors: Iker Perez, Giuliano Casale

Abstract: Queueing networks are systems of theoretical interest that find widespread use in the performance evaluation of interconnected resources. In comparison to counterpart models in genetics or mathematical biology, the stochastic (jump) processes induced by queueing networks have distinctive coupling and synchronization properties. This has prevented the derivation of variational approximations for co… ▽ More Queueing networks are systems of theoretical interest that find widespread use in the performance evaluation of interconnected resources. In comparison to counterpart models in genetics or mathematical biology, the stochastic (jump) processes induced by queueing networks have distinctive coupling and synchronization properties. This has prevented the derivation of variational approximations for conditional representations of transient dynamics, which rely on simplifying independence assumptions. Here, we present a model augmentation to a multivariate counting process for interactions across service stations, and we enable the variational evaluation of mean-field measures for partially-observed multi-class networks. We also show that our framework offers an efficient and improved alternative for inference tasks, where existing variational or numerically intensive solutions do not work. △ Less

Submitted 27 June, 2019; v1 submitted 23 July, 2018; originally announced July 2018.

arXiv:1710.07709 [pdf, other]

Solving the "false positives" problem in fraud prediction

Authors: Roy Wedge, James Max Kanter, Santiago Moral Rubio, Sergio Iglesias Perez, Kalyan Veeramachaneni

Abstract: In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Sy… ▽ More In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Synthesis algorithm to automatically derive behavioral features based on the historical data of the card associated with a transaction. We generate 237 features (>100 behavioral patterns) for each transaction, and use a random forest to learn a classifier. We tested our machine learning model on data from a large multinational bank and compared it to their existing solution. On an unseen data of 1.852 million transactions, we were able to reduce the false positives by 54% and provide a savings of 190K euros. We also assess how to deploy this solution, and whether it necessitates streaming computation for real time scoring. We found that our solution can maintain similar benefits even when historical features are computed once every 7 days. △ Less

Submitted 20 October, 2017; originally announced October 2017.

arXiv:1501.01457 [pdf, ps, other]

doi 10.7551/978-0-262-32621-6-ch046

Comparison of Selection Methods in On-line Distributed Evolutionary Robotics

Authors: Iñaki Fernández Pérez, Amine Boumaza, François Charpillet

Abstract: In this paper, we study the impact of selection methods in the context of on-line on-board distributed evolutionary algorithms. We propose a variant of the mEDEA algorithm in which we add a selection operator, and we apply it in a taskdriven scenario. We evaluate four selection methods that induce different intensity of selection pressure in a multi-robot navigation with obstacle avoidance task an… ▽ More In this paper, we study the impact of selection methods in the context of on-line on-board distributed evolutionary algorithms. We propose a variant of the mEDEA algorithm in which we add a selection operator, and we apply it in a taskdriven scenario. We evaluate four selection methods that induce different intensity of selection pressure in a multi-robot navigation with obstacle avoidance task and a collective foraging task. Experiments show that a small intensity of selection pressure is sufficient to rapidly obtain good performances on the tasks at hand. We introduce different measures to compare the selection methods, and show that the higher the selection pressure, the better the performances obtained, especially for the more challenging food foraging task. △ Less

Submitted 7 January, 2015; originally announced January 2015.

Journal ref: ALIFE 14, Jul 2014, New York, United States. Artificial Life 14 in Complex Adaptive Systems, MIT Press, Artificial Life 14

Showing 1–15 of 15 results for author: Pérez, I