Search | arXiv e-print repository

arXiv:2301.03207 [pdf, other]

Negative Results of Fusing Code and Documentation for Learning to Accurately Identify Sensitive Source and Sink Methods An Application to the Android Framework for Data Leak Detection

Authors: Jordan Samhi, Maria Kober, Abdoul Kader Kabore, Steven Arzt, Tegawendé F. Bissyandé, Jacques Klein

Abstract: Apps on mobile phones manipulate all sorts of data, including sensitive data, leading to privacy-related concerns. Recent regulations like the European GDPR provide rules for the processing of personal and sensitive data, like that no such data may be leaked without the consent of the user. Researchers have proposed sophisticated approaches to track sensitive data within mobile apps, all of whic… ▽ More Apps on mobile phones manipulate all sorts of data, including sensitive data, leading to privacy-related concerns. Recent regulations like the European GDPR provide rules for the processing of personal and sensitive data, like that no such data may be leaked without the consent of the user. Researchers have proposed sophisticated approaches to track sensitive data within mobile apps, all of which rely on specific lists of sensitive source and sink API methods. The data flow analysis results greatly depend on these lists' quality. Previous approaches either used incomplete hand-written lists that quickly became outdated or relied on machine learning. The latter, however, leads to numerous false positives, as we show. This paper introduces CoDoC, a tool that aims to revive the machine-learning approach to precisely identify privacy-related source and sink API methods. In contrast to previous approaches, CoDoC uses deep learning techniques and combines the source code with the documentation of API methods. Firstly, we propose novel definitions that clarify the concepts of sensitive source and sink methods. Secondly, based on these definitions, we build a new ground truth of Android methods representing sensitive source, sink, and neither (i.e., no source or sink) methods that will be used to train our classifier. We evaluate CoDoC and show that, on our validation dataset, it achieves a precision, recall, and F1 score of 91% in 10-fold cross-validation, outperforming the state-of-the-art SuSi when used on the same dataset. However, similarly to existing tools, we show that in the wild, i.e., with unseen data, CoDoC performs poorly and generates many false positive results. Our findings, together with time-tested results of previous approaches, suggest that machine-learning models for abstract concepts such as privacy fail in practice despite good lab results. △ Less

Submitted 11 January, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: 30th IEEE International Conference on Software Analysis, Evolution and Reengineering, RENE track

arXiv:2103.08482 [pdf, other]

Surface Topography Characterization Using a Simple Optical Device and Artificial Neural Networks

Authors: Christoph Angermann, Markus Haltmeier, Christian Laubichler, Steinbjörn Jónsson, Matthias Schwab, Adéla Moravová, Constantin Kiesling, Martin Kober, Wolfgang Fimml

Abstract: State-of-the-art methods for quantifying wear in cylinder liners of large internal combustion engines require disassembly and cutting of the liner. This is followed by laboratory-based high-resolution microscopic surface depth measurement that quantitatively evaluates wear based on bearing load curves (Abbott-Firestone curves). Such methods are destructive, time-consuming and costly. The goal of t… ▽ More State-of-the-art methods for quantifying wear in cylinder liners of large internal combustion engines require disassembly and cutting of the liner. This is followed by laboratory-based high-resolution microscopic surface depth measurement that quantitatively evaluates wear based on bearing load curves (Abbott-Firestone curves). Such methods are destructive, time-consuming and costly. The goal of the research presented is to develop nondestructive yet reliable methods for quantifying the surface topography. A novel machine learning framework is proposed that allows prediction of the bearing load curves from RGB images of the liner surface that can be collected with a handheld microscope. A joint deep learning approach involving two neural network modules optimizes the prediction quality of surface roughness parameters as well and is trained using a custom-built database containing 422 aligned depth profile and reflection image pairs of liner surfaces. The observed success suggests its great potential for on-site wear assessment of engines during service. △ Less

Submitted 8 July, 2022; v1 submitted 15 March, 2021; originally announced March 2021.

arXiv:2103.02349 [pdf, other]

A Hamiltonian Monte Carlo Model for Imputation and Augmentation of Healthcare Data

Authors: Narges Pourshahrokhi, Samaneh Kouchaki, Kord M. Kober, Christine Miaskowski, Payam Barnaghi

Abstract: Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available. Inadequate handling of missing values can lead to biased results and loss of statistical power in analysis. Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values. In healt… ▽ More Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available. Inadequate handling of missing values can lead to biased results and loss of statistical power in analysis. Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values. In healthcare applications, we are usually confronted with high dimensional and sometimes small sample size datasets that need more effective augmentation or imputation techniques. Besides, imputation and augmentation processes are traditionally conducted individually. However, imputing missing values and augmenting data can significantly improve generalisation and avoid bias in machine learning models. A Bayesian approach to impute missing values and creating augmented samples in high dimensional healthcare data is proposed in this work. We propose folded Hamiltonian Monte Carlo (F-HMC) with Bayesian inference as a more practical approach to process the cross-dimensional relations by applying a random walk and Hamiltonian dynamics to adapt posterior distribution and generate large-scale samples. The proposed method is applied to a cancer symptom assessment dataset and confirmed to enrich the quality of data in precision, accuracy, recall, F1 score, and propensity metric. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Showing 1–3 of 3 results for author: Kober, M