-
Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segmentation
Authors:
Tanvi Deshpande,
Eva Prakash,
Elsie Gyang Ross,
Curtis Langlotz,
Andrew Ng,
Jeya Maria Jose Valanarasu
Abstract:
The high cost of creating pixel-by-pixel gold-standard labels, limited expert availability, and presence of diverse tasks make it challenging to generate segmentation labels to train deep learning models for medical imaging tasks. In this work, we present a new approach to overcome the hurdle of costly medical image labeling by leveraging foundation models like Segment Anything Model (SAM) and its…
▽ More
The high cost of creating pixel-by-pixel gold-standard labels, limited expert availability, and presence of diverse tasks make it challenging to generate segmentation labels to train deep learning models for medical imaging tasks. In this work, we present a new approach to overcome the hurdle of costly medical image labeling by leveraging foundation models like Segment Anything Model (SAM) and its medical alternate MedSAM. Our pipeline has the ability to generate weak labels for any unlabeled medical image and subsequently use it to augment label-scarce datasets. We perform this by leveraging a model trained on a few gold-standard labels and using it to intelligently prompt MedSAM for weak label generation. This automation eliminates the manual prompting step in MedSAM, creating a streamlined process for generating labels for both real and synthetic images, regardless of quantity. We conduct experiments on label-scarce settings for multiple tasks pertaining to modalities ranging from ultrasound, dermatology, and X-rays to demonstrate the usefulness of our pipeline. The code is available at https://github.com/stanfordmlgroup/Auto-Generate-WLs/.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version)
Authors:
Lucky Onwuzurike,
Enrico Mariconti,
Panagiotis Andriotis,
Emiliano De Cristofaro,
Gordon Ross,
Gianluca Stringhini
Abstract:
As Android has become increasingly popular, so has malware targeting it, thus pushing the research community to propose different detection techniques. However, the constant evolution of the Android ecosystem, and of malware itself, makes it hard to design robust tools that can operate for long periods of time without the need for modifications or costly re-training. Aiming to address this issue,…
▽ More
As Android has become increasingly popular, so has malware targeting it, thus pushing the research community to propose different detection techniques. However, the constant evolution of the Android ecosystem, and of malware itself, makes it hard to design robust tools that can operate for long periods of time without the need for modifications or costly re-training. Aiming to address this issue, we set to detect malware from a behavioral point of view, modeled as the sequence of abstracted API calls. We introduce MaMaDroid, a static-analysis based system that abstracts the API calls performed by an app to their class, package, or family, and builds a model from their sequences obtained from the call graph of an app as Markov chains. This ensures that the model is more resilient to API changes and the features set is of manageable size. We evaluate MaMaDroid using a dataset of 8.5K benign and 35.5K malicious apps collected over a period of six years, showing that it effectively detects malware (with up to 0.99 F-measure) and keeps its detection capabilities for long periods of time (up to 0.87 F-measure two years after training). We also show that MaMaDroid remarkably outperforms DroidAPIMiner, a state-of-the-art detection system that relies on the frequency of (raw) API calls. Aiming to assess whether MaMaDroid's effectiveness mainly stems from the API abstraction or from the sequencing modeling, we also evaluate a variant of it that uses frequency (instead of sequences), of abstracted API calls. We find that it is not as accurate, failing to capture maliciousness when trained on malware samples that include API calls that are equally or more frequently used by benign apps.
△ Less
Submitted 2 March, 2019; v1 submitted 20 November, 2017;
originally announced November 2017.
-
Centralities in Simplicial Complexes
Authors:
Ernesto Estrada,
Grant Ross
Abstract:
Complex networks can be used to represent complex systems which originate in the real world. Here we study a transformation of these complex networks into simplicial complexes, where cliques represent the simplices of the complex. We extend the concept of node centrality to that of simplicial centrality and study several mathematical properties of degree, closeness, betweenness, eigenvector, Katz,…
▽ More
Complex networks can be used to represent complex systems which originate in the real world. Here we study a transformation of these complex networks into simplicial complexes, where cliques represent the simplices of the complex. We extend the concept of node centrality to that of simplicial centrality and study several mathematical properties of degree, closeness, betweenness, eigenvector, Katz, and subgraph centrality for simplicial complexes. We study the degree distributions of these centralities at the different levels. We also compare and describe the differences between the centralities at the different levels. Using these centralities we study a method for detecting essential proteins in PPI networks of cells and explain the varying abilities of the centrality measures at the different levels in identifying these essential proteins. The paper is written in a self-contained way, such that it can be used by practitioners of network theory as a basis for further developments.
△ Less
Submitted 1 September, 2017; v1 submitted 10 March, 2017;
originally announced March 2017.
-
MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Models
Authors:
Enrico Mariconti,
Lucky Onwuzurike,
Panagiotis Andriotis,
Emiliano De Cristofaro,
Gordon Ross,
Gianluca Stringhini
Abstract:
The rise in popularity of the Android platform has resulted in an explosion of malware threats targeting it. As both Android malware and the operating system itself constantly evolve, it is very challenging to design robust malware mitigation techniques that can operate for long periods of time without the need for modifications or costly re-training. In this paper, we present MaMaDroid, an Androi…
▽ More
The rise in popularity of the Android platform has resulted in an explosion of malware threats targeting it. As both Android malware and the operating system itself constantly evolve, it is very challenging to design robust malware mitigation techniques that can operate for long periods of time without the need for modifications or costly re-training. In this paper, we present MaMaDroid, an Android malware detection system that relies on app behavior. MaMaDroid builds a behavioral model, in the form of a Markov chain, from the sequence of abstracted API calls performed by an app, and uses it to extract features and perform classification. By abstracting calls to their packages or families, MaMaDroid maintains resilience to API changes and keeps the feature set size manageable. We evaluate its accuracy on a dataset of 8.5K benign and 35.5K malicious apps collected over a period of six years, showing that it not only effectively detects malware (with up to 99% F-measure), but also that the model built by the system keeps its detection capabilities for long periods of time (on average, 86% and 75% F-measure, respectively, one and two years after training). Finally, we compare against DroidAPIMiner, a state-of-the-art system that relies on the frequency of API calls performed by apps, showing that MaMaDroid significantly outperforms it.
△ Less
Submitted 20 November, 2017; v1 submitted 13 December, 2016;
originally announced December 2016.
-
Privacy-Friendly Mobility Analytics using Aggregate Location Data
Authors:
Apostolos Pyrgelis,
Emiliano De Cristofaro,
Gordon Ross
Abstract:
Location data can be extremely useful to study commuting patterns and disruptions, as well as to predict real-time traffic volumes. At the same time, however, the fine-grained collection of user locations raises serious privacy concerns, as this can reveal sensitive information about the users, such as, life style, political and religious inclinations, or even identities. In this paper, we study t…
▽ More
Location data can be extremely useful to study commuting patterns and disruptions, as well as to predict real-time traffic volumes. At the same time, however, the fine-grained collection of user locations raises serious privacy concerns, as this can reveal sensitive information about the users, such as, life style, political and religious inclinations, or even identities. In this paper, we study the feasibility of crowd-sourced mobility analytics over aggregate location information: users periodically report their location, using a privacy-preserving aggregation protocol, so that the server can only recover aggregates -- i.e., how many, but not which, users are in a region at a given time. We experiment with real-world mobility datasets obtained from the Transport For London authority and the San Francisco Cabs network, and present a novel methodology based on time series modeling that is geared to forecast traffic volumes in regions of interest and to detect mobility anomalies in them. In the presence of anomalies, we also make enhanced traffic volume predictions by feeding our model with additional information from correlated regions. Finally, we present and evaluate a mobile app prototype, called Mobility Data Donors (MDD), in terms of computation, communication, and energy overhead, demonstrating the real-world deployability of our techniques.
△ Less
Submitted 9 October, 2016; v1 submitted 21 September, 2016;
originally announced September 2016.
-
Understanding the Heavy Tailed Dynamics in Human Behavior
Authors:
Gordon J Ross,
Tim Jones
Abstract:
The recent availability of electronic datasets containing large volumes of communication data has made it possible to study human behavior on a larger scale than ever before. From this, it has been discovered that across a diverse range of data sets, the inter-event times between consecutive communication events obey heavy tailed power law dynamics. Explaining this has proved controversial, and tw…
▽ More
The recent availability of electronic datasets containing large volumes of communication data has made it possible to study human behavior on a larger scale than ever before. From this, it has been discovered that across a diverse range of data sets, the inter-event times between consecutive communication events obey heavy tailed power law dynamics. Explaining this has proved controversial, and two distinct hypotheses have emerged. The first holds that these power laws are fundamental, and arise from the mechanisms such as priority queuing that humans use to schedule tasks. The second holds that they are a statistical artifact which only occur in aggregated data when features such as circadian rhythms and burstiness are ignored. We use a large social media data set to test these hypotheses, and find that although models that incorporate circadian rhythms and burstiness do explain part of the observed heavy tails, there is residual unexplained heavy tail behavior which suggests a more fundamental cause. Based on this, we develop a new quantitative model of human behavior which improves on existing approaches, and gives insight into the mechanisms underlying human interactions.
△ Less
Submitted 6 May, 2015;
originally announced May 2015.
-
Exponentially Weighted Moving Average Charts for Detecting Concept Drift
Authors:
Gordon J. Ross,
Niall M. Adams,
Dimitris K. Tasoulis,
David J. Hand
Abstract:
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming…
▽ More
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an Exponentially Weighted Moving Average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.
△ Less
Submitted 25 December, 2012;
originally announced December 2012.