Search | arXiv e-print repository

GraphWeaver: Billion-Scale Cybersecurity Incident Correlation

Abstract: In the dynamic landscape of large enterprise cybersecurity, accurately and efficiently correlating billions of security alerts into comprehensive incidents is a substantial challenge. Traditional correlation techniques often struggle with maintenance, scaling, and adapting to emerging threats and novel sources of telemetry. We introduce GraphWeaver, an industry-scale framework that shifts the trad… ▽ More In the dynamic landscape of large enterprise cybersecurity, accurately and efficiently correlating billions of security alerts into comprehensive incidents is a substantial challenge. Traditional correlation techniques often struggle with maintenance, scaling, and adapting to emerging threats and novel sources of telemetry. We introduce GraphWeaver, an industry-scale framework that shifts the traditional incident correlation process to a data-optimized, geo-distributed graph based approach. GraphWeaver introduces a suite of innovations tailored to handle the complexities of correlating billions of shared evidence alerts across hundreds of thousands of enterprises. Key among these innovations are a geo-distributed database and PySpark analytics engine for large-scale data processing, a minimum spanning tree algorithm to optimize correlation storage, integration of security domain knowledge and threat intelligence, and a human-in-the-loop feedback system to continuously refine key correlation processes and parameters. GraphWeaver is integrated into the Microsoft Defender XDR product and deployed worldwide, handling billions of correlations with a 99% accuracy rate, as confirmed by customer feedback and extensive investigations by security experts. This integration has not only maintained high correlation accuracy but reduces traditional correlation storage requirements by 7.4x. We provide an in-depth overview of the key design and operational features of GraphWeaver, setting a precedent as the first cybersecurity company to openly discuss these critical capabilities at this level of depth. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2404.15367 [pdf, other]

Leveraging Visibility Graphs for Enhanced Arrhythmia Classification with Graph Convolutional Networks

Authors: Rafael F. Oliveira, Gladston J. P. Moreira, Vander L. S. Freitas, Eduardo J. S. Luz

Abstract: Arrhythmias, detectable via electrocardiograms (ECGs), pose significant health risks, emphasizing the need for robust automated identification techniques. Although traditional deep learning methods have shown potential, recent advances in graph-based strategies are aimed at enhancing arrhythmia detection performance. However, effectively representing ECG signals as graphs remains a challenge. This… ▽ More Arrhythmias, detectable via electrocardiograms (ECGs), pose significant health risks, emphasizing the need for robust automated identification techniques. Although traditional deep learning methods have shown potential, recent advances in graph-based strategies are aimed at enhancing arrhythmia detection performance. However, effectively representing ECG signals as graphs remains a challenge. This study explores graph representations of ECG signals using Visibility Graph (VG) and Vector Visibility Graph (VVG), coupled with Graph Convolutional Networks (GCNs) for arrhythmia classification. Through experiments on the MIT-BIH dataset, we investigated various GCN architectures and preprocessing parameters. The results reveal that GCNs, when integrated with VG and VVG for signal graph map**, can classify arrhythmias without the need for preprocessing or noise removal from ECG signals. While both VG and VVG methods show promise, VG is notably more efficient. The proposed approach was competitive compared to baseline methods, although classifying the S class remains challenging, especially under the inter-patient paradigm. Computational complexity, particularly with the VVG method, required data balancing and sophisticated implementation strategies. The source code is publicly available for further research and development at https://github.com/raffoliveira/VG_for_arrhythmia_classification_with_GCN. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2305.04675 [pdf, other]

Predicting nuclear masses with product-unit networks

Authors: Babette Dellen, Uwe Jaekel, Paulo S. A. Freitas, John W. Clark

Abstract: Accurate estimation of nuclear masses and their prediction beyond the experimentally explored domains of the nuclear landscape are crucial to an understanding of the fundamental origin of nuclear properties and to many applications of nuclear science, most notably in quantifying the $r$-process of stellar nucleosynthesis. Neural networks have been applied with some success to the prediction of nuc… ▽ More Accurate estimation of nuclear masses and their prediction beyond the experimentally explored domains of the nuclear landscape are crucial to an understanding of the fundamental origin of nuclear properties and to many applications of nuclear science, most notably in quantifying the $r$-process of stellar nucleosynthesis. Neural networks have been applied with some success to the prediction of nuclear masses, but they are known to have shortcomings in application to extrapolation tasks. In this work, we propose and explore a novel type of neural network for mass prediction in which the usual neuron-like processing units are replaced by complex-valued product units that permit multiplicative couplings of inputs to be learned from the input data. This generalized network model is tested on both interpolation and extrapolation data sets drawn from the Atomic Mass Evaluation. Its performance is compared with that of several neural-network architectures, substantiating its suitability for nuclear mass prediction. Additionally, a prediction-uncertainty measure for such complex-valued networks is proposed that serves to identify regions of expected low prediction error. △ Less

Submitted 8 May, 2023; originally announced May 2023.

arXiv:2301.09860 [pdf, other]

A predictive physics-aware hybrid reduced order model for reacting flows

Authors: Adrián Corrochano, Rodolfo S. M. Freitas, Alessandro Parente, Soledad Le Clainche

Abstract: In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two d… ▽ More In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two different deep learning architectures have been tested to predict the temporal coefficients, based on recursive (RNN) and convolutional (CNN) neural networks. From each architecture, different models have been created to understand the behavior of each parameter of the neural network. Results show that these architectures are able to predict the temporal coefficients of the POD modes, as well as the whole snapshots. The RNN shows lower prediction error for all the variables analyzed. The model was also found capable of predicting more complex simulations showing transfer learning capabilities. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2201.05503 [pdf, other]

Global-threshold and backbone high-resolution weather radar networks are significantly complementary in a watershed

Authors: Aurelienne A. S. Jorge, Iuri da Silva Diniz, Vander L. S. Freitas, Izabelly C. Costa, Leonardo B. L. Santos

Abstract: There are several criteria for building up networks from time series related to different points in geographical space. The most used criterion is the Global-Threshold (GT). Using a weather radar dataset, this paper shows that the Backbone (BB) - a local-threshold criterion - generates networks whose geographical configuration is complementary to the GT networks. We compare the results for two wel… ▽ More There are several criteria for building up networks from time series related to different points in geographical space. The most used criterion is the Global-Threshold (GT). Using a weather radar dataset, this paper shows that the Backbone (BB) - a local-threshold criterion - generates networks whose geographical configuration is complementary to the GT networks. We compare the results for two well-known similarities measures: the Pearson Correlation (PC) coefficient and the Mutual Information (MI). The extracted backbone network (miBB), whose number of links is the same as the global MI (miGT), has the lowest average shortest path and presents a small-world effect. Regarding the global PC (pcGT) and its corresponding BB network (pcBB), there is a significant linear relationship: $R2=0.77$ with a slope of $1.15$ (p-value $<E-7$) for the pcGT network, and $R2=0.68$ with a slope of $0.76$ (p-value $<E-7$) for the pcBB network. In relation to the MI ones, only the miGT present a high $R2$ ($0.79$, with slope = $1.95$), whereas the miBB has an $R2$ of only $0.20$ ($\text{slope} =0.24$). On the one hand, the GT networks present a sizeable connected component in the central area, close to the main rivers. On the other hand, the BB networks present a few meaningful connected components surrounding the watershed and dominating cells close to the outlet, with significant statistical differences in the altimetry distribution. △ Less

Submitted 13 January, 2022; originally announced January 2022.

Comments: 7 pages, 6 figures To be submitted to Computers and Geosciences (Elsevier)

arXiv:2112.02000 [pdf, other]

doi 10.1145/3472752

A Survey on Concept Drift in Process Mining

Authors: Denise Maria Vecino Sato, Sheila Cristiana de Freitas, Jean Paul Barddal, Edson Emilio Scalabrin

Abstract: Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in process mining and bring forward a taxonomy of existing techniques for drift detection and online process mining for evolving… ▽ More Concept drift in process mining (PM) is a challenge as classical methods assume processes are in a steady-state, i.e., events share the same process version. We conducted a systematic literature review on the intersection of these areas, and thus, we review concept drift in process mining and bring forward a taxonomy of existing techniques for drift detection and online process mining for evolving environments. Existing works depict that (i) PM still primarily focuses on offline analysis, and (ii) the assessment of concept drift techniques in processes is cumbersome due to the lack of common evaluation protocol, datasets, and metrics. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: 38 pages, ACM Computing Surveys

Journal ref: ACM Comput. Surv. 54, 9, Article 189 (December 2022)

arXiv:2110.09360 [pdf, other]

Prediction of liquid fuel properties using machine learning models with Gaussian processes and probabilistic conditional generative learning

Authors: Rodolfo S. M. Freitas, Ágatha P. F. Lima, Cheng Chen, Fernando A. Rochinha, Daniel Mira, Xi Jiang

Abstract: Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations… ▽ More Accurate determination of fuel properties of complex mixtures over a wide range of pressure and temperature conditions is essential to utilizing alternative fuels. The present work aims to construct cheap-to-compute machine learning (ML) models to act as closure equations for predicting the physical properties of alternative fuels. Those models can be trained using the database from MD simulations and/or experimental measurements in a data-fusion-fidelity approach. Here, Gaussian Process (GP) and probabilistic generative models are adopted. GP is a popular non-parametric Bayesian approach to build surrogate models mainly due to its capacity to handle the aleatory and epistemic uncertainties. Generative models have shown the ability of deep neural networks employed with the same intent. In this work, ML analysis is focused on a particular property, the fuel density, but it can also be extended to other physicochemical properties. This study explores the versatility of the ML models to handle multi-fidelity data. The results show that ML models can predict accurately the fuel properties of a wide range of pressure and temperature conditions. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 22 pages, 13 figures

arXiv:2105.00419 [pdf, other]

doi 10.1109/TKDE.2022.3163672

Graph Vulnerability and Robustness: A Survey

Authors: Scott Freitas, Diyi Yang, Srijan Kumar, Hanghang Tong, Duen Horng Chau

Abstract: The study of network robustness is a critical tool in the characterization and sense making of complex interconnected systems such as infrastructure, communication and social networks. While significant research has been conducted in all of these areas, gaps in the surveying literature still exist. Answers to key questions are currently scattered across multiple scientific fields and numerous pape… ▽ More The study of network robustness is a critical tool in the characterization and sense making of complex interconnected systems such as infrastructure, communication and social networks. While significant research has been conducted in all of these areas, gaps in the surveying literature still exist. Answers to key questions are currently scattered across multiple scientific fields and numerous papers. In this survey, we distill key findings across numerous domains and provide researchers crucial access to important information by--(1) summarizing and comparing recent and classical graph robustness measures; (2) exploring which robustness measures are most applicable to different categories of networks (e.g., social, infrastructure; (3) reviewing common network attack strategies, and summarizing which attacks are most effective across different network topologies; and (4) extensive discussion on selecting defense techniques to mitigate attacks across a variety of networks. This survey guides researchers and practitioners in navigating the expansive field of network robustness, while summarizing answers to key questions. We conclude by highlighting current research directions and open problems. △ Less

Submitted 29 March, 2022; v1 submitted 2 May, 2021; originally announced May 2021.

Comments: Accepted into Transactions on Knowledge and Data Engineering (TKDE) 2022

arXiv:2103.16435 [pdf, other]

EnergyVis: Interactively Tracking and Exploring Energy Consumption for ML Models

Authors: Omar Shaikh, Jon Saad-Falcon, Austin P Wright, Nilaksh Das, Scott Freitas, Omar Isaac Asensio, Duen Horng Chau

Abstract: The advent of larger machine learning (ML) models have improved state-of-the-art (SOTA) performance in various modeling tasks, ranging from computer vision to natural language. As ML models continue increasing in size, so does their respective energy consumption and computational requirements. However, the methods for tracking, reporting, and comparing energy consumption remain limited. We present… ▽ More The advent of larger machine learning (ML) models have improved state-of-the-art (SOTA) performance in various modeling tasks, ranging from computer vision to natural language. As ML models continue increasing in size, so does their respective energy consumption and computational requirements. However, the methods for tracking, reporting, and comparing energy consumption remain limited. We presentEnergyVis, an interactive energy consumption tracker for ML models. Consisting of multiple coordinated views, EnergyVis enables researchers to interactively track, visualize and compare model energy consumption across key energy consumption and carbon footprint metrics (kWh and CO2), hel** users explore alternative deployment locations and hardware that may reduce carbon footprints. EnergyVis aims to raise awareness concerning computational sustainability by interactively highlighting excessive energy usage during model training; and by providing alternative training options to reduce energy usage. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: 7 pages, 5 figures; CHI 2021 Extended Abstracts

arXiv:2102.01072 [pdf, other]

MalNet: A Large-Scale Image Database of Malicious Software

Authors: Scott Freitas, Rahul Duggal, Duen Horng Chau

Abstract: Computer vision is playing an increasingly important role in automated malware detection with the rise of the image-based binary representation. These binary images are fast to generate, require no feature engineering, and are resilient to popular obfuscation methods. Significant research has been conducted in this area, however, it has been restricted to small-scale or private datasets that only… ▽ More Computer vision is playing an increasingly important role in automated malware detection with the rise of the image-based binary representation. These binary images are fast to generate, require no feature engineering, and are resilient to popular obfuscation methods. Significant research has been conducted in this area, however, it has been restricted to small-scale or private datasets that only a few industry labs and research teams have access to. This lack of availability hinders examination of existing work, development of new research, and dissemination of ideas. We release MalNet-Image, the largest public cybersecurity image database, offering 24x more images and 70x more classes than existing databases (available at https://mal-net.org). MalNet-Image contains over 1.2 million malware images -- across 47 types and 696 families -- democratizing image-based malware capabilities by enabling researchers and practitioners to evaluate techniques that were previously reported in propriety settings. We report the first million-scale malware detection results on binary images. MalNet-Image unlocks new and unique opportunities to advance the frontiers of machine learning, enabling new research directions into vision-based cyber defenses, multi-class imbalanced classification, and interpretable security. △ Less

Submitted 3 September, 2022; v1 submitted 30 January, 2021; originally announced February 2021.

Comments: Accepted at CIKM'22 as a short/resource track paper

arXiv:2011.07682 [pdf, other]

A Large-Scale Database for Graph Representation Learning

Authors: Scott Freitas, Yuxiao Dong, Joshua Neil, Duen Horng Chau

Abstract: With the rapid emergence of graph representation learning, the construction of new large-scale datasets is necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many g… ▽ More With the rapid emergence of graph representation learning, the construction of new large-scale datasets is necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of malicious software function call graphs. MalNet contains over 1.2 million graphs, averaging over 15k nodes and 35k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 39x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance, along with the evaluation of state-of-the-art machine learning and graph neural network techniques. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning--enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publicly available at www.mal-net.org. △ Less

Submitted 6 November, 2021; v1 submitted 15 November, 2020; originally announced November 2020.

Comments: Published in NeurIPS Datasets and Benchmarks Track, 2021

arXiv:2008.11844 [pdf, other]

doi 10.1145/3340531.3412877

Argo Lite: Open-Source Interactive Graph Exploration and Visualization in Browsers

Authors: Siwei Li, Zhiyan Zhou, Anish Upadhayay, Omar Shaikh, Scott Freitas, Haekyu Park, Zijie J. Wang, Susanta Routray, Matthew Hull, Duen Horng Chau

Abstract: Graph data have become increasingly common. Visualizing them helps people better understand relations among entities. Unfortunately, existing graph visualization tools are primarily designed for single-person desktop use, offering limited support for interactive web-based exploration and online collaborative analysis. To address these issues, we have developed Argo Lite, a new in-browser interacti… ▽ More Graph data have become increasingly common. Visualizing them helps people better understand relations among entities. Unfortunately, existing graph visualization tools are primarily designed for single-person desktop use, offering limited support for interactive web-based exploration and online collaborative analysis. To address these issues, we have developed Argo Lite, a new in-browser interactive graph exploration and visualization tool. Argo Lite enables users to publish and share interactive graph visualizations as URLs and embedded web widgets. Users can explore graphs incrementally by adding more related nodes, such as highly cited papers cited by or citing a paper of interest in a citation network. Argo Lite works across devices and platforms, leveraging WebGL for high-performance rendering. Argo Lite has been used by over 1,000 students at Georgia Tech's Data and Visual Analytics class. Argo Lite may serve as a valuable open-source tool for advancing multiple CIKM research areas, from data presentation, to interfaces for information systems and more. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: CIKM'20 Resource Track (October 19-23, 2020), 6 pages, 6 figures

arXiv:2006.11979 [pdf, other]

ELF: An Early-Exiting Framework for Long-Tailed Classification

Authors: Rahul Duggal, Scott Freitas, Sunny Dhamnani, Duen Horng Chau, Jimeng Sun

Abstract: The natural world often follows a long-tailed data distribution where only a few classes account for most of the examples. This long-tail causes classifiers to overfit to the majority class. To mitigate this, prior solutions commonly adopt class rebalancing strategies such as data resampling and loss resha**. However, by treating each example within a class equally, these methods fail to account… ▽ More The natural world often follows a long-tailed data distribution where only a few classes account for most of the examples. This long-tail causes classifiers to overfit to the majority class. To mitigate this, prior solutions commonly adopt class rebalancing strategies such as data resampling and loss resha**. However, by treating each example within a class equally, these methods fail to account for the important notion of example hardness, i.e., within each class some examples are easier to classify than others. To incorporate this notion of hardness into the learning process, we propose the EarLy-exiting Framework(ELF). During training, ELF learns to early-exit easy examples through auxiliary branches attached to a backbone network. This offers a dual benefit-(1) the neural network increasingly focuses on hard examples, since they contribute more to the overall network loss; and (2) it frees up additional model capacity to distinguish difficult examples. Experimental results on two large-scale datasets, ImageNet LT and iNaturalist'18, demonstrate that ELF can improve state-of-the-art accuracy by more than 3 percent. This comes with the additional benefit of reducing up to 20 percent of inference time FLOPS. ELF is complementary to prior work and can naturally integrate with a variety of existing methods to tackle the challenge of long-tailed distributions. △ Less

Submitted 13 September, 2020; v1 submitted 21 June, 2020; originally announced June 2020.

arXiv:2006.05648 [pdf, other]

Evaluating Graph Vulnerability and Robustness using TIGER

Authors: Scott Freitas, Diyi Yang, Srijan Kumar, Hanghang Tong, Duen Horng Chau

Abstract: Network robustness plays a crucial role in our understanding of complex interconnected systems such as transportation, communication, and computer networks. While significant research has been conducted in the area of network robustness, no comprehensive open-source toolbox currently exists to assist researchers and practitioners in this important topic. This lack of available tools hinders reprod… ▽ More Network robustness plays a crucial role in our understanding of complex interconnected systems such as transportation, communication, and computer networks. While significant research has been conducted in the area of network robustness, no comprehensive open-source toolbox currently exists to assist researchers and practitioners in this important topic. This lack of available tools hinders reproducibility and examination of existing work, development of new research, and dissemination of new ideas. We contribute TIGER, an open-sourced Python toolbox to address these challenges. TIGER contains 22 graph robustness measures with both original and fast approximate versions; 17 failure and attack strategies; 15 heuristic and optimization-based defense techniques; and 4 simulation tools. By democratizing the tools required to study network robustness, our goal is to assist researchers and practitioners in analyzing their own networks; and facilitate the development of new research in the field. TIGER has been integrated into the Nvidia Data Science Teaching Kit available to educators across the world; and Georgia Tech's Data and Visual Analytics class with over 1,000 students. TIGER is open sourced at: https://github.com/safreita1/TIGER △ Less

Submitted 15 August, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: Published at CIKM 2021 Resource Track

arXiv:2004.02648 [pdf, other]

Robustness analysis in an inter-cities mobility network: modeling municipal, state and federal initiatives as failures and attacks

Authors: Vander L. S. Freitas, Jeferson Feitosa, Catia S. N. Sepetauskas, Leonardo B. L. Santos

Abstract: Motivated by the challenge related to the COVID-19 epidemic and the seek for optimal containment strategies, we present a robustness analysis into an inter-cities mobility complex network. We abstract municipal initiatives as nodes' failures and the federal actions as targeted attacks. The geo(graphs) approach is applied to visualize the geographical graph and produce maps of topological indexes,… ▽ More Motivated by the challenge related to the COVID-19 epidemic and the seek for optimal containment strategies, we present a robustness analysis into an inter-cities mobility complex network. We abstract municipal initiatives as nodes' failures and the federal actions as targeted attacks. The geo(graphs) approach is applied to visualize the geographical graph and produce maps of topological indexes, such as degree and vulnerability. A Brazilian data of 2016 is considered a case study, with more than five thousand cities and twenty-seven states. Based on the Network Robustness index, we show that the most efficient attack strategy shifts from a topological degree-based, for the all cities network, to a topological vulnerability-based, for a network considering the Brazilian States as nodes. Moreover, our results reveal that individual municipalities' actions do not cause a high impact on mobility restrain since they tend to be punctual and disconnected to the country scene as a whole. Oppositely, the coordinated isolation of specific cities is key to detach entire network areas and thus prevent a spreading process to prevail. △ Less

Submitted 8 April, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 16 pages, 10 figures

arXiv:2002.09576 [pdf, other]

UnMask: Adversarial Detection and Defense Through Robust Feature Alignment

Authors: Scott Freitas, Shang-Tse Chen, Zijie J. Wang, Duen Horng Chau

Abstract: Deep learning models are being integrated into a wide range of high-impact, security-critical systems, from self-driving cars to medical diagnosis. However, recent research has demonstrated that many of these deep learning architectures are vulnerable to adversarial attacks--highlighting the vital need for defensive techniques to detect and mitigate these attacks before they occur. To combat these… ▽ More Deep learning models are being integrated into a wide range of high-impact, security-critical systems, from self-driving cars to medical diagnosis. However, recent research has demonstrated that many of these deep learning architectures are vulnerable to adversarial attacks--highlighting the vital need for defensive techniques to detect and mitigate these attacks before they occur. To combat these adversarial attacks, we developed UnMask, an adversarial detection and defense framework based on robust feature alignment. The core idea behind UnMask is to protect these models by verifying that an image's predicted class ("bird") contains the expected robust features (e.g., beak, wings, eyes). For example, if an image is classified as "bird", but the extracted features are wheel, saddle and frame, the model may be under attack. UnMask detects such attacks and defends the model by rectifying the misclassification, re-classifying the image based on its robust features. Our extensive evaluation shows that UnMask (1) detects up to 96.75% of attacks, and (2) defends the model by correctly classifying up to 93% of adversarial images produced by the current strongest attack, Projected Gradient Descent, in the gray-box setting. UnMask provides significantly better protection than adversarial training across 8 attack vectors, averaging 31.18% higher accuracy. We open source the code repository and data with this paper: https://github.com/safreita1/unmask. △ Less

Submitted 14 November, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: Accepted into IEEE Big Data 2020

arXiv:2001.11363 [pdf, other]

doi 10.1145/3366423.3380241

REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild

Authors: Rahul Duggal, Scott Freitas, Cao Xiao, Duen Horng Chau, Jimeng Sun

Abstract: In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2) compact and energy-efficient. We propose REST, a new method that simultaneously… ▽ More In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2) compact and energy-efficient. We propose REST, a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enabling neural network compression through sparsity regularization. We demonstrate that REST produces highly-robust and efficient models that substantially outperform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), the REST model achieves a macro-F1 score of 0.67 vs. 0.39 achieved by a state-of-the-art model in the presence of Gaussian noise while obtaining 19x parameter reduction and 15x MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quantitatively observe that REST allows models to achieve up to 17x energy reduction and 9x faster inference. We open-source the code repository with this paper: https://github.com/duggalrahul/REST. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: Accepted to WWW 2020

arXiv:2001.11108 [pdf, other]

D2M: Dynamic Defense and Modeling of Adversarial Movement in Networks

Authors: Scott Freitas, Andrew Wicker, Duen Horng Chau, Joshua Neil

Abstract: Given a large enterprise network of devices and their authentication history (e.g., device logons), how can we quantify network vulnerability to lateral attack and identify at-risk devices? We systematically address these problems through D2M, the first framework that models lateral attacks on enterprise networks using multiple attack strategies developed with researchers, engineers, and threat hu… ▽ More Given a large enterprise network of devices and their authentication history (e.g., device logons), how can we quantify network vulnerability to lateral attack and identify at-risk devices? We systematically address these problems through D2M, the first framework that models lateral attacks on enterprise networks using multiple attack strategies developed with researchers, engineers, and threat hunters in the Microsoft Defender Advanced Threat Protection group. These strategies integrate real-world adversarial actions (e.g., privilege escalation) to generate attack paths: a series of compromised machines. Leveraging these attack paths and a novel Monte-Carlo method, we formulate network vulnerability as a probabilistic function of the network topology, distribution of access credentials and initial penetration point. To identify machines at risk to lateral attack, we propose a suite of five fast graph mining techniques, including a novel technique called AnomalyShield inspired by node immunization research. Using three real-world authentication graphs from Microsoft and Los Alamos National Laboratory (up to 223,399 authentications), we report the first experimental results on network vulnerability to lateral attack, demonstrating D2M's unique potential to empower IT admins to develop robust user access credential policies. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: Accepted to SDM 2020

arXiv:1902.03520 [pdf, other]

Swarm Debugging: the Collective Intelligence on Interactive Debugging

Authors: Fabio Petrillo, Yann-Gaël Guéhéneuc, Marcelo Pimenta, Carla Dal Sasso Freitas, Foutse Khomh

Abstract: One of the most important tasks in software maintenance is debugging. To start an interactive debugging session, developers usually set breakpoints in an integrated development environment and navigate through different paths in their debuggers. We started our work by asking} what debugging information is useful to share among developers and study two pieces of information: breakpoints (and their… ▽ More One of the most important tasks in software maintenance is debugging. To start an interactive debugging session, developers usually set breakpoints in an integrated development environment and navigate through different paths in their debuggers. We started our work by asking} what debugging information is useful to share among developers and study two pieces of information: breakpoints (and their locations) and sessions (debugging paths). To answer our question, we introduce the Swarm Debugging concept to frame the sharing of debugging information, the Swarm Debugging Infrastructure (SDI) with which practitioners and researchers can collect and share data about developers' interactive debugging sessions, and the Swarm Debugging Global View (GV) to display debugging paths. Using the SDI, we conducted a large study with professional developers to understand how developers set breakpoints. Using the GV, we also analyzed professional developers in two studies and collected data about their debugging sessions. Our observations and the answers to our research questions suggest that sharing and visualizing debugging data can support debugging activities. △ Less

Submitted 9 February, 2019; originally announced February 2019.

arXiv:1811.01638 [pdf]

Identifying influential patents in citation networks using enhanced VoteRank centrality

Authors: João C. S. Freitas, Rafael Barbastefano, Diego Carvalho

Abstract: This study proposes the usage of a method called VoteRank, created by Zhang et al. (2016), to identify influential nodes on patent citation networks. In addition, it proposes enhanced VoteRank algorithms, extending the Zhang et al. work. These novel algorithms comprise a reduction on the voting ability of the nodes affected by a chosen spreader if the nodes are distant from the spreader. One metho… ▽ More This study proposes the usage of a method called VoteRank, created by Zhang et al. (2016), to identify influential nodes on patent citation networks. In addition, it proposes enhanced VoteRank algorithms, extending the Zhang et al. work. These novel algorithms comprise a reduction on the voting ability of the nodes affected by a chosen spreader if the nodes are distant from the spreader. One method uses a reduction factor that is linear regarding the distance from the spreader, which we called VoteRank-LRed. The other method uses a reduction factor that is exponential concerning the distance from the spreader, which we called VoteRank-XRed. By applying the methods to a citation network, we were able to demonstrate that VoteRank-LRed improved performance in choosing influence spreaders more efficiently than the original VoteRank on the tested citation network. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: 10 pages, 3 figures

arXiv:1803.05084 [pdf, other]

Local Partition in Rich Graphs

Authors: Scott Freitas, Hanghang Tong, Nan Cao, Yinglong Xia

Abstract: Local graph partitioning is a key graph mining tool that allows researchers to identify small groups of interrelated nodes (e.g. people) and their connective edges (e.g. interactions). Because local graph partitioning is primarily focused on the network structure of the graph (vertices and edges), it often fails to consider the additional information contained in the attributes. In this paper we p… ▽ More Local graph partitioning is a key graph mining tool that allows researchers to identify small groups of interrelated nodes (e.g. people) and their connective edges (e.g. interactions). Because local graph partitioning is primarily focused on the network structure of the graph (vertices and edges), it often fails to consider the additional information contained in the attributes. In this paper we propose---(i) a scalable algorithm to improve local graph partitioning by taking into account both the network structure of the graph and the attribute data and (ii) an application of the proposed local graph partitioning algorithm (AttriPart) to predict the evolution of local communities (LocalForecasting). Experimental results show that our proposed AttriPart algorithm finds up to 1.6$\times$ denser local partitions, while running approximately 43$\times$ faster than traditional local partitioning techniques (PageRank-Nibble). In addition, our LocalForecasting algorithm shows a significant improvement in the number of nodes and edges correctly predicted over baseline methods. △ Less

Submitted 13 March, 2018; originally announced March 2018.

Comments: Under KDD 2018 review

arXiv:1304.1903 [pdf, other]

doi 10.1140/epjst/e2012-01689-8

Towards a living earth simulator

Authors: M. Paolucci, D. Kossman, R. Conte, P. Lukowicz, P. Argyrakis, A. Blandford, G. Bonelli, S. Anderson, S. de Freitas, B. Edmonds, N. Gilbert, M. Gross, J. Kohlhammer, P. Koumoutsakos, A. Krause, B. -O. Linnér, P. Slusallek, O. Sorkine, R. W. Sumner, D. Helbing

Abstract: The Living Earth Simulator (LES) is one of the core components of the FuturICT architecture. It will work as a federation of methods, tools, techniques and facilities supporting all of the FuturICT simulation-related activities to allow and encourage interactive exploration and understanding of societal issues. Society-relevant problems will be targeted by leaning on approaches based on complex sy… ▽ More The Living Earth Simulator (LES) is one of the core components of the FuturICT architecture. It will work as a federation of methods, tools, techniques and facilities supporting all of the FuturICT simulation-related activities to allow and encourage interactive exploration and understanding of societal issues. Society-relevant problems will be targeted by leaning on approaches based on complex systems theories and data science in tight interaction with the other components of FuturICT. The LES will evaluate and provide answers to real-world questions by taking into account multiple scenarios. It will build on present approaches such as agent-based simulation and modeling, multiscale modelling, statistical inference, and data mining, moving beyond disciplinary borders to achieve a new perspective on complex social systems. △ Less

Submitted 6 April, 2013; originally announced April 2013.

Journal ref: Eur. Phys. J. Special Topics vol. 214, pp. 77-108 (2012)

Showing 1–22 of 22 results for author: Freitas, S