Search | arXiv e-print repository

Large Language Models for Constrained-Based Causal Discovery

Authors: Kai-Hendrik Cohrs, Gherardo Varando, Emiliano Diaz, Vasileios Sitokonstantinou, Gustau Camps-Valls

Abstract: Causality is essential for understanding complex systems, such as the economy, the brain, and the climate. Constructing causal graphs often relies on either data-driven or expert-driven approaches, both fraught with challenges. The former methods, like the celebrated PC algorithm, face issues with data requirements and assumptions of causal sufficiency, while the latter demand substantial time and… ▽ More Causality is essential for understanding complex systems, such as the economy, the brain, and the climate. Constructing causal graphs often relies on either data-driven or expert-driven approaches, both fraught with challenges. The former methods, like the celebrated PC algorithm, face issues with data requirements and assumptions of causal sufficiency, while the latter demand substantial time and domain knowledge. This work explores the capabilities of Large Language Models (LLMs) as an alternative to domain experts for causal graph generation. We frame conditional independence queries as prompts to LLMs and employ the PC algorithm with the answers. The performance of the LLM-based conditional independence oracle on systems with known causal graphs shows a high degree of variability. We improve the performance through a proposed statistical-inspired voting schema that allows some control over false-positive and false-negative rates. Inspecting the chain-of-thought argumentation, we find causal reasoning to justify its answer to a probabilistic query. We show evidence that knowledge-based CIT could eventually become a complementary tool for data-driven causal discovery. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2403.14228 [pdf, other]

Recovering Latent Confounders from High-dimensional Proxy Variables

Authors: Nathan Mankovich, Homer Durand, Emiliano Diaz, Gherardo Varando, Gustau Camps-Valls

Abstract: Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, m… ▽ More Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, mixed proxy variables. For specific sample sizes, our two-step PCF implementation, using Independent Component Analysis (ICA-PCF), and the end-to-end implementation, using Gradient Descent (GD-PCF), achieve high correlation with the latent confounder and low absolute error in causal effect estimation with synthetic datasets in the high sample size regime. Even when faced with climate data, ICA-PCF recovers four components that explain $75.9\%$ of the variance in the North Atlantic Oscillation, a known confounder of precipitation patterns in Europe. Code for our PCF implementations and experiments can be found here: https://github.com/IPL-UV/confound_it. The proposed methodology constitutes a step** stone towards discovering latent confounders and can be applied to many problems in disciplines dealing with high-dimensional observed proxies, e.g., spatiotemporal fields. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2401.12768 [pdf, other]

doi 10.1145/3643991.3644909

What Can Self-Admitted Technical Debt Tell Us About Security? A Mixed-Methods Study

Authors: Nicolás E. Díaz Ferreyra, Mojtaba Shahin, Mansooreh Zahedi, Sodiq Quadri, Ricardo Scandariato

Abstract: Self-Admitted Technical Debt (SATD) encompasses a wide array of sub-optimal design and implementation choices reported in software artefacts (e.g., code comments and commit messages) by developers themselves. Such reports have been central to the study of software maintenance and evolution over the last decades. However, they can also be deemed as dreadful sources of information on potentially exp… ▽ More Self-Admitted Technical Debt (SATD) encompasses a wide array of sub-optimal design and implementation choices reported in software artefacts (e.g., code comments and commit messages) by developers themselves. Such reports have been central to the study of software maintenance and evolution over the last decades. However, they can also be deemed as dreadful sources of information on potentially exploitable vulnerabilities and security flaws. This work investigates the security implications of SATD from a technical and developer-centred perspective. On the one hand, it analyses whether security pointers disclosed inside SATD sources can be used to characterise vulnerabilities in Open-Source Software (OSS) projects and repositories. On the other hand, it delves into developers' perspectives regarding the motivations behind this practice, its prevalence, and its potential negative consequences. We followed a mixed-methods approach consisting of (i) the analysis of a preexisting dataset containing 8,812 SATD instances and (ii) an online survey with 222 OSS practitioners. We gathered 201 SATD instances through the dataset analysis and mapped them to different Common Weakness Enumeration (CWE) identifiers. Overall, 25 different types of CWEs were spotted across commit messages, pull requests, code comments, and issue sections, from which 8 appear among MITRE's Top-25 most dangerous ones. The survey shows that software practitioners often place security pointers across SATD artefacts to promote a security culture among their peers and help them spot flaky code sections, among other motives. However, they also consider such a practice risky as it may facilitate vulnerability exploits. Our findings suggest that preserving the contextual integrity of security pointers disseminated across SATD artefacts is critical to safeguard both commercial and OSS solutions against zero-day attacks. △ Less

Submitted 2 March, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted in the 21th International Conference on Mining Software Repositories (MSR '24)

arXiv:2305.13341 [pdf, other]

Discovering Causal Relations and Equations from Data

Authors: Gustau Camps-Valls, Andreas Gerhardus, Urmi Ninad, Gherardo Varando, Georg Martius, Emili Balaguer-Ballester, Ricardo Vinuesa, Emiliano Diaz, Laure Zanna, Jakob Runge

Abstract: Physics is a field of science that has traditionally used the scientific method to answer questions about why natural phenomena occur and to make testable models that explain the phenomena. Discovering equations, laws and principles that are invariant, robust and causal explanations of the world has been fundamental in physical sciences throughout the centuries. Discoveries emerge from observing t… ▽ More Physics is a field of science that has traditionally used the scientific method to answer questions about why natural phenomena occur and to make testable models that explain the phenomena. Discovering equations, laws and principles that are invariant, robust and causal explanations of the world has been fundamental in physical sciences throughout the centuries. Discoveries emerge from observing the world and, when possible, performing interventional studies in the system under study. With the advent of big data and the use of data-driven methods, causal and equation discovery fields have grown and made progress in computer science, physics, statistics, philosophy, and many applied fields. All these domains are intertwined and can be used to discover causal relations, physical laws, and equations from observational data. This paper reviews the concepts, methods, and relevant works on causal and equation discovery in the broad field of Physics and outlines the most important challenges and promising future lines of research. We also provide a taxonomy for observational causal and equation discovery, point out connections, and showcase a complete set of case studies in Earth and climate sciences, fluid dynamics and mechanics, and the neurosciences. This review demonstrates that discovering fundamental laws and causal relations by observing natural phenomena is being revolutionised with the efficient exploitation of observational data, modern machine learning algorithms and the interaction with domain knowledge. Exciting times are ahead with many challenges and opportunities to improve our understanding of complex systems. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: 137 pages

arXiv:2305.09778 [pdf, other]

Shortest Path to Boundary for Self-Intersecting Meshes

Authors: He Chen, Elie Diaz, Cem Yuksel

Abstract: We introduce a method for efficiently computing the exact shortest path to the boundary of a mesh from a given internal point in the presence of self-intersections. We provide a formal definition of shortest boundary paths for self-intersecting objects and present a robust algorithm for computing the actual shortest boundary path. The resulting method offers an effective solution for collision and… ▽ More We introduce a method for efficiently computing the exact shortest path to the boundary of a mesh from a given internal point in the presence of self-intersections. We provide a formal definition of shortest boundary paths for self-intersecting objects and present a robust algorithm for computing the actual shortest boundary path. The resulting method offers an effective solution for collision and self-collision handling while simulating deformable volumetric objects, using fast simulation techniques that provide no guarantees on collision resolution. Our evaluation includes complex self-collision scenarios with a large number of active contacts, showing that our method can successfully handle them by introducing a relatively minor computational overhead. △ Less

Submitted 16 May, 2023; originally announced May 2023.

ACM Class: I.3.5

arXiv:2303.14229 [pdf, ps, other]

Sharp threshold for embedding balanced spanning trees in random geometric graphs

Authors: Alberto Espuny Díaz, Lyuben Lichev, Dieter Mitsche, Alexandra Wesolek

Abstract: A rooted tree is balanced if the degree of a vertex depends only on its distance to the root. In this paper we determine the sharp threshold for the appearance of a large family of balanced spanning trees in the random geometric graph $\mathcal{G}(n,r,d)$. In particular, we find the sharp threshold for balanced binary trees. More generally, we show that all sequences of balanced trees with uniform… ▽ More A rooted tree is balanced if the degree of a vertex depends only on its distance to the root. In this paper we determine the sharp threshold for the appearance of a large family of balanced spanning trees in the random geometric graph $\mathcal{G}(n,r,d)$. In particular, we find the sharp threshold for balanced binary trees. More generally, we show that all sequences of balanced trees with uniformly bounded degrees and height tending to infinity appear above a sharp threshold, and none of these appears below the same value. Our results hold more generally for geometric graphs satisfying a mild condition on the distribution of their vertex set, and we provide a polynomial time algorithm to find such trees. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2303.09384 [pdf, other]

LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

Authors: Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, Riccardo Scandariato

Abstract: Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs prom… ▽ More Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLMSecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: Accepted at MSR '23 Data and Tool Showcase Track

arXiv:2303.09135 [pdf, other]

doi 10.1145/3544549.3585583

Regret, Delete, (Do Not) Repeat: An Analysis of Self-Cleaning Practices on Twitter After the Outbreak of the COVID-19 Pandemic

Authors: Nicolás E. Díaz Ferreyra, Gautam Kishore Shahi, Catherine Tony, Stefan Stieglitz, Riccardo Scandariato

Abstract: During the outbreak of the COVID-19 pandemic, many people shared their symptoms across Online Social Networks (OSNs) like Twitter, ho** for others' advice or moral support. Prior studies have shown that those who disclose health-related information across OSNs often tend to regret it and delete their publications afterwards. Hence, deleted posts containing sensitive data can be seen as manifesta… ▽ More During the outbreak of the COVID-19 pandemic, many people shared their symptoms across Online Social Networks (OSNs) like Twitter, ho** for others' advice or moral support. Prior studies have shown that those who disclose health-related information across OSNs often tend to regret it and delete their publications afterwards. Hence, deleted posts containing sensitive data can be seen as manifestations of online regrets. In this work, we present an analysis of deleted content on Twitter during the outbreak of the COVID-19 pandemic. For this, we collected more than 3.67 million tweets describing COVID-19 symptoms (e.g., fever, cough, and fatigue) posted between January and April 2020. We observed that around 24% of the tweets containing personal pronouns were deleted either by their authors or by the platform after one year. As a practical application of the resulting dataset, we explored its suitability for the automatic classification of regrettable content on Twitter. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: Accepted at CHI '23 Late Breaking Work (LBW)

arXiv:2303.01822 [pdf, other]

Developers Need Protection, Too: Perspectives and Research Challenges for Privacy in Social Coding Platforms

Authors: Nicolás E. Díaz Ferreyra, Abdessamad Imine, Melina Vidoni, Riccardo Scandariato

Abstract: Social Coding Platforms (SCPs) like GitHub have become central to modern software engineering thanks to their collaborative and version-control features. Like in mainstream Online Social Networks (OSNs) such as Facebook, users of SCPs are subjected to privacy attacks and threats given the high amounts of personal and project-related data available in their profiles and software repositories. Howev… ▽ More Social Coding Platforms (SCPs) like GitHub have become central to modern software engineering thanks to their collaborative and version-control features. Like in mainstream Online Social Networks (OSNs) such as Facebook, users of SCPs are subjected to privacy attacks and threats given the high amounts of personal and project-related data available in their profiles and software repositories. However, unlike in OSNs, the privacy concerns and practices of SCP users have not been extensively explored nor documented in the current literature. In this work, we present the preliminary results of an online survey (N=105) addressing developers' concerns and perceptions about privacy threats steaming from SCPs. Our results suggest that, although users express concern about social and organisational privacy threats, they often feel safe sharing personal and project-related information on these platforms. Moreover, attacks targeting the inference of sensitive attributes are considered more likely than those seeking to re-identify source-code contributors. Based on these findings, we propose a set of recommendations for future investigations addressing privacy and identity management in SCPs. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: Accepted at the 16th International Conference on Cooperative and Human Aspects of Software Engineering (CHASE 2023)

arXiv:2211.13498 [pdf, other]

GitHub Considered Harmful? Analyzing Open-Source Projects for the Automatic Generation of Cryptographic API Call Sequences

Authors: Catherine Tony, Nicolás E. Díaz Ferreyra, Riccardo Scandariato

Abstract: GitHub is a popular data repository for code examples. It is being continuously used to train several AI-based tools to automatically generate code. However, the effectiveness of such tools in correctly demonstrating the usage of cryptographic APIs has not been thoroughly assessed. In this paper, we investigate the extent and severity of misuses, specifically caused by incorrect cryptographic API… ▽ More GitHub is a popular data repository for code examples. It is being continuously used to train several AI-based tools to automatically generate code. However, the effectiveness of such tools in correctly demonstrating the usage of cryptographic APIs has not been thoroughly assessed. In this paper, we investigate the extent and severity of misuses, specifically caused by incorrect cryptographic API call sequences in GitHub. We also analyze the suitability of GitHub data to train a learning-based model to generate correct cryptographic API call sequences. For this, we manually extracted and analyzed the call sequences from GitHub. Using this data, we augmented an existing learning-based model called DeepAPI to create two security-specific models that generate cryptographic API call sequences for a given natural language (NL) description. Our results indicate that it is imperative to not neglect the misuses in API call sequences while using data sources like GitHub, to train models that generate code. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted at QRS 2022

arXiv:2208.07462 [pdf, other]

Speeding up random walk mixing by starting from a uniform vertex

Authors: Alberto Espuny Díaz, Patrick Morris, Guillem Perarnau, Oriol Serra

Abstract: The theory of rapid mixing random walks plays a fundamental role in the study of modern randomised algorithms. Usually, the mixing time is measured with respect to the worst initial position. It is well known that the presence of bottlenecks in a graph hampers mixing and, in particular, starting inside a small bottleneck significantly slows down the diffusion of the walk in the first steps of the… ▽ More The theory of rapid mixing random walks plays a fundamental role in the study of modern randomised algorithms. Usually, the mixing time is measured with respect to the worst initial position. It is well known that the presence of bottlenecks in a graph hampers mixing and, in particular, starting inside a small bottleneck significantly slows down the diffusion of the walk in the first steps of the process. The average mixing time is defined to be the mixing time starting at a uniformly random vertex and hence is not sensitive to the slow diffusion caused by these bottlenecks. In this paper we provide a general framework to show logarithmic average mixing time for random walks on graphs with small bottlenecks. The framework is especially effective on certain families of random graphs with heterogeneous properties. We demonstrate its applicability on two random models for which the mixing time was known to be of order $(\log n)^2$, speeding up the mixing to order $\log n$. First, in the context of smoothed analysis on connected graphs, we show logarithmic average mixing time for randomly perturbed graphs of bounded degeneracy. A particular instance is the Newman-Watts small-world model. Second, we show logarithmic average mixing time for supercritically percolated expander graphs. When the host graph is complete, this application gives an alternative proof that the average mixing time of the giant component in the supercritical Erdős-Rényi graph is logarithmic. △ Less

Submitted 27 January, 2024; v1 submitted 15 August, 2022; originally announced August 2022.

Comments: To appear in Electronic Journal of Probability

arXiv:2208.04649 [pdf, other]

doi 10.1145/3549015.3555674

ENAGRAM: An App to Evaluate Preventative Nudges for Instagram

Authors: Nicolás E. Díaz Ferreyra, Sina Ostendorf, Esma Aïmeur, Maritta Heisel, Matthias Brand

Abstract: Online self-disclosure is perhaps one of the last decade's most studied communication processes, thanks to the introduction of Online Social Networks (OSNs) like Facebook. Self-disclosure research has contributed significantly to the design of preventative nudges seeking to support and guide users when revealing private information in OSNs. Still, assessing the effectiveness of these solutions is… ▽ More Online self-disclosure is perhaps one of the last decade's most studied communication processes, thanks to the introduction of Online Social Networks (OSNs) like Facebook. Self-disclosure research has contributed significantly to the design of preventative nudges seeking to support and guide users when revealing private information in OSNs. Still, assessing the effectiveness of these solutions is often challenging since changing or modifying the choice architecture of OSN platforms is practically unfeasible. In turn, the effectiveness of numerous nudging designs is supported primarily by self-reported data instead of actual behavioral information. This work presents ENAGRAM, an app for evaluating preventative nudges, and reports the first results of an empirical study conducted with it. Such a study aims to showcase how the app (and the data collected with it) can be leveraged to assess the effectiveness of a particular nudging approach. We used ENAGRAM as a vehicle to test a risk-based strategy for nudging the self-disclosure decisions of Instagram users. For this, we created two variations of the same nudge and tested it in a between-subjects experimental setting. Study participants (N=22) were recruited via Prolific and asked to use the app regularly for 7 days. An online survey was distributed at the end of the experiment to measure some privacy-related constructs. From the data collected with ENAGRAM, we observed lower (though non-significant) self-disclosure levels when applying risk-based interventions. The constructs measured with the survey were not significant either, except for participants' External Information Privacy Concerns. Our results suggest that (i) ENAGRAM is a suitable alternative for conducting longitudinal experiments in a privacy-friendly way, and (ii) it provides a flexible framework for the evaluation of a broad spectrum of nudging solutions. △ Less

Submitted 18 August, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

Comments: Accepted at the 2022 European Symposium on Usable Security (EuroUSEC 2022)

arXiv:2207.01529 [pdf, other]

Cybersecurity Discussions in Stack Overflow: A Developer-Centred Analysis of Engagement and Self-Disclosure Behaviour

Authors: Nicolás E. Díaz Ferreyra, Melina Vidoni, Maritta Heisel, Riccardo Scandariato

Abstract: Stack Overflow (SO) is a popular platform among developers seeking advice on various software-related topics, including privacy and security. As for many knowledge-sharing websites, the value of SO depends largely on users' engagement, namely their willingness to answer, comment or post technical questions. Still, many of these questions (including cybersecurity-related ones) remain unanswered, pu… ▽ More Stack Overflow (SO) is a popular platform among developers seeking advice on various software-related topics, including privacy and security. As for many knowledge-sharing websites, the value of SO depends largely on users' engagement, namely their willingness to answer, comment or post technical questions. Still, many of these questions (including cybersecurity-related ones) remain unanswered, putting the site's relevance and reputation into question. Hence, it is important to understand users' participation in privacy and security discussions to promote engagement and foster the exchange of such expertise. Objective: Based on prior findings on online social networks, this work elaborates on the interplay between users' engagement and their privacy practices in SO. Particularly, it analyses developers' self-disclosure behaviour regarding profile visibility and their involvement in discussions related to privacy and security. Method: We followed a mixed-methods approach by (i) analysing SO data from 1239 cybersecurity-tagged questions along with 7048 user profiles, and (ii) conducting an anonymous online survey (N=64). Results: About 33% of the questions we retrieved had no answer, whereas more than 50% had no accepted answer. We observed that "proactive" users tend to disclose significantly less information in their profiles than "reactive" and "unengaged" ones. However, no correlations were found between these engagement categories and privacy-related constructs such as Perceived Control or General Privacy Concerns. Implications: These findings contribute to (i) a better understanding of developers' engagement towards privacy and security topics, and (ii) to shape strategies promoting the exchange of cybersecurity expertise in SO. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Submitted for publication

arXiv:2205.06200 [pdf, other]

Conversational DevBots for Secure Programming: An Empirical Study on SKF Chatbot

Authors: Catherine Tony, Mohana Balasubramanian, Nicolás E. Díaz Ferreyra, Riccardo Scandariato

Abstract: Conversational agents or chatbots are widely investigated and used across different fields including healthcare, education, and marketing. Still, the development of chatbots for assisting secure coding practices is in its infancy. In this paper, we present the results of an empirical study on SKF chatbot, a software-development bot (DevBot) designed to answer queries about software security. To th… ▽ More Conversational agents or chatbots are widely investigated and used across different fields including healthcare, education, and marketing. Still, the development of chatbots for assisting secure coding practices is in its infancy. In this paper, we present the results of an empirical study on SKF chatbot, a software-development bot (DevBot) designed to answer queries about software security. To the best of our knowledge, SKF chatbot is one of the very few of its kind, thus a representative instance of conversational DevBots aiding secure software development. In this study, we collect and analyse empirical evidence on the effectiveness of SKF chatbot, while assessing the needs and expectations of its users (i.e., software developers). Furthermore, we explore the factors that may hinder the elaboration of more sophisticated conversational security DevBots and identify features for improving the efficiency of state-of-the-art solutions. All in all, our findings provide valuable insights pointing towards the design of more context-aware and personalized conversational DevBots for security engineering. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: Accepted paper at the 2022 International Conference on Evaluation and Assessment in Software Engineering (EASE)

arXiv:2202.11969 [pdf, ps, other]

Should I Get Involved? On the Privacy Perils of Mining Software Repositories for Research Participants

Authors: Melina Vidoni, Nicolás E. Díaz Ferreyra

Abstract: Mining Software Repositories (MSRs) is an evidence-based methodology that cross-links data to uncover actionable information about software systems. Empirical studies in software engineering often leverage MSR techniques as they allow researchers to unveil issues and flaws in software development so as to analyse the different factors contributing to them. Hence, counting on fine-grained informati… ▽ More Mining Software Repositories (MSRs) is an evidence-based methodology that cross-links data to uncover actionable information about software systems. Empirical studies in software engineering often leverage MSR techniques as they allow researchers to unveil issues and flaws in software development so as to analyse the different factors contributing to them. Hence, counting on fine-grained information about the repositories and sources being mined (e.g., server names, and contributors' identities) is essential for the reproducibility and transparency of MSR studies. However, this can also introduce threats to participants' privacy as their identities may be linked to flawed/sub-optimal programming practices (e.g., code smells, improper documentation), or vice-versa. Moreover, this can be extensible to close collaborators and community members resulting "guilty by association". This position paper aims to start a discussion about indirect participation in MSRs investigations, the dichotomy of 'privacy vs. utility' regarding sharing non-aggregated data, and its effects on privacy restrictions and ethical considerations for participant involvement. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: Accepted at ROPES'22: 1st International Workshop on Recruiting Participants for Empirical Software Engineering (co-located with ICSE 2022)

arXiv:2202.01612 [pdf, other]

doi 10.1145/3538969.3538986

SoK: Security of Microservice Applications: A Practitioners' Perspective on Challenges and Best Practices

Authors: Priyanka Billawa, Anusha Bambhore Tukaram, Nicolás E. Díaz Ferreyra, Jan-Philipp Steghöfer, Riccardo Scandariato, Georg Simhandl

Abstract: Cloud-based application deployment is becoming increasingly popular among businesses, thanks to the emergence of microservices. However, securing such architectures is a challenging task since traditional security concepts cannot be directly applied to microservice architectures due to their distributed nature. The situation is exacerbated by the scattered nature of guidelines and best practices a… ▽ More Cloud-based application deployment is becoming increasingly popular among businesses, thanks to the emergence of microservices. However, securing such architectures is a challenging task since traditional security concepts cannot be directly applied to microservice architectures due to their distributed nature. The situation is exacerbated by the scattered nature of guidelines and best practices advocated by practitioners and organizations in this field. This research paper we aim to shay light over the current microservice security discussions hidden within Grey Literature (GL) sources. Particularly, we identify the challenges that arise when securing microservice architectures, as well as solutions recommended by practitioners to address these issues. For this, we conducted a systematic GL study on the challenges and best practices of microservice security present in the Internet with the goal of capturing relevant discussions in blogs, white papers, and standards. We collected 312 GL sources from which 57 were rigorously classified and analyzed. This analysis on the one hand validated past academic literature studies in the area of microservice security, but it also identified improvements to existing methodologies pointing towards future research directions. △ Less

Submitted 2 September, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

Comments: Accepted at the 17th International Conference on Availability, Reliability and Security (ARES 2022)

ACM Class: D.4.6

arXiv:2104.09137 [pdf, other]

doi 10.1016/j.osnem.2022.100203

Community Detection for Access-Control Decisions: Analysing the Role of Homophily and Information Diffusion in Online Social Networks

Authors: Nicolas E. Diaz Ferreyra, Tobias Hecking, Esma Aïmeur, Maritta Heisel, H. Ulrich Hoppe

Abstract: Access-Control Lists (ACLs) (a.k.a. friend lists) are one of the most important privacy features of Online Social Networks (OSNs) as they allow users to restrict the audience of their publications. Nevertheless, creating and maintaining custom ACLs can introduce a high cognitive burden on average OSNs users since it normally requires assessing the trustworthiness of a large number of contacts. In… ▽ More Access-Control Lists (ACLs) (a.k.a. friend lists) are one of the most important privacy features of Online Social Networks (OSNs) as they allow users to restrict the audience of their publications. Nevertheless, creating and maintaining custom ACLs can introduce a high cognitive burden on average OSNs users since it normally requires assessing the trustworthiness of a large number of contacts. In principle, community detection algorithms can be leveraged to support the generation of ACLs by map** a set of examples (i.e. contacts labelled as untrusted) to the emerging communities inside the user's ego-network. However, unlike users' access-control preferences, traditional community-detection algorithms do not take the homophily characteristics of such communities into account (i.e. attributes shared among members). Consequently, this strategy may lead to inaccurate ACL configurations and privacy breaches under certain homophily scenarios. This work investigates the use of community-detection algorithms for the automatic generation of ACLs in OSNs. Particularly, it analyses the performance of the aforementioned approach under different homophily conditions through a simulation model. Furthermore, since private information may reach the scope of untrusted recipients through the re-sharing affordances of OSNs, information diffusion processes are also modelled and taken explicitly into account. Altogether, the removal of gatekeeper nodes is further explored as a strategy to counteract unwanted data dissemination. △ Less

Submitted 7 June, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2012.04922 [pdf, other]

Consistent regression of biophysical parameters with kernel methods

Authors: Emiliano Díaz, Adrián Pérez-Suay, Valero Laparra, Gustau Camps-Valls

Abstract: This paper introduces a novel statistical regression framework that allows the incorporation of consistency constraints. A linear and nonlinear (kernel-based) formulation are introduced, and both imply closed-form analytical solutions. The models exploit all the information from a set of drivers while being maximally independent of a set of auxiliary, protected variables. We successfully illustrat… ▽ More This paper introduces a novel statistical regression framework that allows the incorporation of consistency constraints. A linear and nonlinear (kernel-based) formulation are introduced, and both imply closed-form analytical solutions. The models exploit all the information from a set of drivers while being maximally independent of a set of auxiliary, protected variables. We successfully illustrate the performance in the estimation of chlorophyll content. △ Less

Submitted 9 December, 2020; originally announced December 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1710.05578

arXiv:2009.12853 [pdf, other]

doi 10.5220/0010142402040211

Persuasion Meets AI: Ethical Considerations for the Design of Social Engineering Countermeasures

Authors: Nicolas E. Díaz Ferreyra, Esma Aïmeur, Hicham Hage, Maritta Heisel, Catherine García van Hoogstraten

Abstract: Privacy in Social Network Sites (SNSs) like Facebook or Instagram is closely related to people's self-disclosure decisions and their ability to foresee the consequences of sharing personal information with large and diverse audiences. Nonetheless, online privacy decisions are often based on spurious risk judgements that make people liable to reveal sensitive data to untrusted recipients and become… ▽ More Privacy in Social Network Sites (SNSs) like Facebook or Instagram is closely related to people's self-disclosure decisions and their ability to foresee the consequences of sharing personal information with large and diverse audiences. Nonetheless, online privacy decisions are often based on spurious risk judgements that make people liable to reveal sensitive data to untrusted recipients and become victims of social engineering attacks. Artificial Intelligence (AI) in combination with persuasive mechanisms like nudging is a promising approach for promoting preventative privacy behaviour among the users of SNSs. Nevertheless, combining behavioural interventions with high levels of personalization can be a potential threat to people's agency and autonomy even when applied to the design of social engineering countermeasures. This paper elaborates on the ethical challenges that nudging mechanisms can introduce to the development of AI-based countermeasures, particularly to those addressing unsafe self-disclosure practices in SNSs. Overall, it endorses the elaboration of personalized risk awareness solutions as i) an ethical approach to counteract social engineering, and ii) as an effective means for promoting reflective privacy decisions. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: Accepted for publication at IC3K 2020

arXiv:2008.09391 [pdf, other]

doi 10.1145/3314183.3323849

Learning from Online Regrets: From Deleted Posts to Risk Awareness in Social Network Sites

Authors: Nicolas E. Diaz Ferreyra, Rene Meis, Maritta Heisel

Abstract: Social Network Sites (SNSs) like Facebook or Instagram are spaces where people expose their lives to wide and diverse audiences. This practice can lead to unwanted incidents such as reputation damage, job loss or harassment when pieces of private information reach unintended recipients. As a consequence, users often regret to have posted private information in these platforms and proceed to delete… ▽ More Social Network Sites (SNSs) like Facebook or Instagram are spaces where people expose their lives to wide and diverse audiences. This practice can lead to unwanted incidents such as reputation damage, job loss or harassment when pieces of private information reach unintended recipients. As a consequence, users often regret to have posted private information in these platforms and proceed to delete such content after having a negative experience. Risk awareness is a strategy that can be used to persuade users towards safer privacy decisions. However, many risk awareness technologies for SNSs assume that information about risks is retrieved and measured by an expert in the field. Consequently, risk estimation is an activity that is often passed over despite its importance. In this work we introduce an approach that employs deleted posts as risk information vehicles to measure the frequency and consequence level of self-disclosure patterns in SNSs. In this method, consequence is reported by the users through an ordinal scale and used later on to compute a risk criticality index. We thereupon show how this index can serve in the design of adaptive privacy nudges for SNSs. △ Less

Submitted 21 August, 2020; originally announced August 2020.

arXiv:1809.10421 [pdf, ps, other]

Entropy versions of additive inequalities

Authors: Alberto Espuny Díaz, Oriol Serra

Abstract: The connection between inequalities in additive combinatorics and analogous versions in terms of the entropy of random variables has been extensively explored over the past few years. This paper extends a device introduced by Ruzsa in his seminal work introducing this correspondence. This extension provides a toolbox for establishing the equivalence between sumset inequalities and their entropic v… ▽ More The connection between inequalities in additive combinatorics and analogous versions in terms of the entropy of random variables has been extensively explored over the past few years. This paper extends a device introduced by Ruzsa in his seminal work introducing this correspondence. This extension provides a toolbox for establishing the equivalence between sumset inequalities and their entropic versions. It supplies simpler proofs of known results and opens a path for obtaining new ones. △ Less

Submitted 27 May, 2019; v1 submitted 27 September, 2018; originally announced September 2018.

Comments: The former version had an error the authors could not fix. This version keeps only that parts that did not depend on the incorrect statement

arXiv:1704.00829 [pdf, other]

Online deforestation detection

Authors: Emiliano Diaz

Abstract: Deforestation detection using satellite images can make an important contribution to forest management. Current approaches can be broadly divided into those that compare two images taken at similar periods of the year and those that monitor changes by using multiple images taken during the growing season. The CMFDA algorithm described in Zhu et al. (2012) is an algorithm that builds on the latter… ▽ More Deforestation detection using satellite images can make an important contribution to forest management. Current approaches can be broadly divided into those that compare two images taken at similar periods of the year and those that monitor changes by using multiple images taken during the growing season. The CMFDA algorithm described in Zhu et al. (2012) is an algorithm that builds on the latter category by implementing a year-long, continuous, time-series based approach to monitoring images. This algorithm was developed for 30m resolution, 16-day frequency reflectance data from the Landsat satellite. In this work we adapt the algorithm to 1km, 16-day frequency reflectance data from the modis sensor aboard the Terra satellite. The CMFDA algorithm is composed of two submodels which are fitted on a pixel-by-pixel basis. The first estimates the amount of surface reflectance as a function of the day of the year. The second estimates the occurrence of a deforestation event by comparing the last few predicted and real reflectance values. For this comparison, the reflectance observations for six different bands are first combined into a forest index. Real and predicted values of the forest index are then compared and high absolute differences for consecutive observation dates are flagged as deforestation events. Our adapted algorithm also uses the two model framework. However, since the modis 13A2 dataset used, includes reflectance data for different spectral bands than those included in the Landsat dataset, we cannot construct the forest index. Instead we propose two contrasting approaches: a multivariate and an index approach similar to that of CMFDA. △ Less

Submitted 3 April, 2017; originally announced April 2017.

arXiv:1704.00575 [pdf, other]

Sparse mean localization by information theory

Authors: Emiliano Diaz

Abstract: Sparse feature selection is necessary when we fit statistical models, we have access to a large group of features, don't know which are relevant, but assume that most are not. Alternatively, when the number of features is larger than the available data the model becomes over parametrized and the sparse feature selection task involves selecting the most informative variables for the model. When the… ▽ More Sparse feature selection is necessary when we fit statistical models, we have access to a large group of features, don't know which are relevant, but assume that most are not. Alternatively, when the number of features is larger than the available data the model becomes over parametrized and the sparse feature selection task involves selecting the most informative variables for the model. When the model is a simple location model and the number of relevant features does not grow with the total number of features, sparse feature selection corresponds to sparse mean estimation. We deal with a simplified mean estimation problem consisting of an additive model with gaussian noise and mean that is in a restricted, finite hypothesis space. This restriction simplifies the mean estimation problem into a selection problem of combinatorial nature. Although the hypothesis space is finite, its size is exponential in the dimension of the mean. In limited data settings and when the size of the hypothesis space depends on the amount of data or on the dimension of the data, choosing an approximation set of hypotheses is a desirable approach. Choosing a set of hypotheses instead of a single one implies replacing the bias-variance trade off with a resolution-stability trade off. Generalization capacity provides a resolution selection criterion based on allowing the learning algorithm to communicate the largest amount of information in the data to the learner without error. In this work the theory of approximation set coding and generalization capacity is explored in order to understand this approach. We then apply the generalization capacity criterion to the simplified sparse mean estimation problem and detail an importance sampling algorithm which at once solves the difficulty posed by large hypothesis spaces and the slow convergence of uniform sampling algorithms. △ Less

Submitted 3 April, 2017; originally announced April 2017.

Showing 1–23 of 23 results for author: Díaz, E