Search | arXiv e-print repository

Robust Decentralized Learning with Local Updates and Gradient Tracking

Authors: Sajjad Ghiasvand, Amirhossein Reisizadeh, Mahnoosh Alizadeh, Ramtin Pedarsani

Abstract: As distributed learning applications such as Federated Learning, the Internet of Things (IoT), and Edge Computing grow, it is critical to address the shortcomings of such technologies from a theoretical perspective. As an abstraction, we consider decentralized learning over a network of communicating clients or nodes and tackle two major challenges: data heterogeneity and adversarial robustness. W… ▽ More As distributed learning applications such as Federated Learning, the Internet of Things (IoT), and Edge Computing grow, it is critical to address the shortcomings of such technologies from a theoretical perspective. As an abstraction, we consider decentralized learning over a network of communicating clients or nodes and tackle two major challenges: data heterogeneity and adversarial robustness. We propose a decentralized minimax optimization method that employs two important modules: local updates and gradient tracking. Minimax optimization is the key tool to enable adversarial training for ensuring robustness. Having local updates is essential in Federated Learning (FL) applications to mitigate the communication bottleneck, and utilizing gradient tracking is essential to proving convergence in the case of data heterogeneity. We analyze the performance of the proposed algorithm, Dec-FedTrack, in the case of nonconvex-strongly concave minimax optimization, and prove that it converges a stationary point. We also conduct numerical experiments to support our theoretical findings. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2402.05114 [pdf, ps, other]

A Light-weight and Unsupervised Method for Near Real-time Behavioral Analysis using Operational Data Measurement

Authors: Tom Richard Vargis, Siavash Ghiasvand

Abstract: Monitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and uptime. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves… ▽ More Monitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and uptime. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves to the continuous changes in the computing system. In addition, they should be able to identify behavioral anomalies in useful time, to perform appropriate reactions. This work proposes a general lightweight and unsupervised method for near real-time anomaly detection using operational data measurement on large computing systems. The proposed model requires as little as 4 hours of data and 50 epochs for each training process to accurately resemble the behavioral pattern of computing systems. △ Less

Submitted 10 January, 2024; originally announced February 2024.

arXiv:2401.05049 [pdf, ps, other]

Content-Aware Depth-Adaptive Image Restoration

Authors: Tom Richard Vargis, Siavash Ghiasvand

Abstract: This work prioritizes building a modular pipeline that utilizes existing models to systematically restore images, rather than creating new restoration models from scratch. Restoration is carried out at an object-specific level, with each object regenerated using its corresponding class label information. The approach stands out by providing complete user control over the entire restoration process… ▽ More This work prioritizes building a modular pipeline that utilizes existing models to systematically restore images, rather than creating new restoration models from scratch. Restoration is carried out at an object-specific level, with each object regenerated using its corresponding class label information. The approach stands out by providing complete user control over the entire restoration process. Users can select models for specialized restoration steps, customize the sequence of steps to meet their needs, and refine the resulting regenerated image with depth awareness. The research provides two distinct pathways for implementing image regeneration, allowing for a comparison of their respective strengths and limitations. The most compelling aspect of this versatile system is its adaptability. This adaptability enables users to target particular object categories, including medical images, by providing models that are trained on those object classes. △ Less

Submitted 10 January, 2024; originally announced January 2024.

arXiv:2212.01101 [pdf, other]

Assessing Anonymized System Logs Usefulness for Behavioral Analysis in RNN Models

Authors: Tom Richard Vargis, Siavash Ghiasvand

Abstract: System logs are a common source of monitoring data for analyzing computing systems' behavior. Due to the complexity of modern computing systems and the large size of collected monitoring data, automated analysis mechanisms are required. Numerous machine learning and deep learning methods are proposed to address this challenge. However, due to the existence of sensitive data in system logs their an… ▽ More System logs are a common source of monitoring data for analyzing computing systems' behavior. Due to the complexity of modern computing systems and the large size of collected monitoring data, automated analysis mechanisms are required. Numerous machine learning and deep learning methods are proposed to address this challenge. However, due to the existence of sensitive data in system logs their analysis and storage raise serious privacy concerns. Anonymization methods could be used to clean the monitoring data before analysis. However, anonymized system logs, in general, do not provide adequate usefulness for the majority of behavioral analysis. Content-aware anonymization mechanisms such as PaRS preserve the correlation of system logs even after anonymization. This work evaluates the usefulness of anonymized system logs taken from the Taurus HPC cluster anonymized using PaRS, for behavioral analysis via recurrent neural network models. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: 12 pages, 7 main figures, 2 tables, Conference: International Workshop on Data-driven Resilience Research 2022

Journal ref: International Workshop on Data-driven Resilience Research 2022, https://2022.dataweek.de/d2r2-22/

arXiv:1906.04550 [pdf, other]

doi 10.1109/ispdc.2019.00024

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Authors: Siavash Ghiasvand, Florina M. Ciorba

Abstract: In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the H… ▽ More In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users' and systems' privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62% to 81%. △ Less

Submitted 11 June, 2019; originally announced June 2019.

Comments: 9 pages, Submitted to the 18th IEEE International Symposium on Parallel and Distributed Computing

MSC Class: 97R99

arXiv:1901.06918

Turning Privacy Constraints into Syslog Analysis Advantage

Authors: Siavash Ghiasvand, Florina M. Ciorba, Wolfgang E. Nagel

Abstract: The mean time between failures (MTBF) of HPC systems is rapidly reducing, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures. Early failure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. System logs (syslogs) are invaluable source of information which give u… ▽ More The mean time between failures (MTBF) of HPC systems is rapidly reducing, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures. Early failure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. System logs (syslogs) are invaluable source of information which give us a deep insight about system behavior, and make the early failure detection possible. Beside normal information, syslogs contain sensitive data which might endanger users' privacy. Even though analyzing various syslogs is necessary for creating a general failure detection/prediction method, privacy concerns discourage system administrators to publish syslogs. Herein, we ensure user privacy via de-identifying syslogs, and then turning the applied constraint for addressing users' privacy into an advantage for system behavior analysis. Results indicate significant reduction in required storage space and 3 times shorter processing time. △ Less

Submitted 14 March, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

Comments: This document is mistakenly submitted to arXiv

Journal ref: 29th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016)

arXiv:1805.01790 [pdf, other]

doi 10.1109/ISPDC2018.2018.00031

Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

Authors: Siavash Ghiasvand, Florina M. Ciorba

Abstract: System logs are a valuable source of information for the analysis and understanding of systems behavior for the purpose of improving their performance. Such logs contain various types of information, including sensitive information. Information deemed sensitive can either directly be extracted from system log entries by correlation of several log entries, or can be inferred from the combination of… ▽ More System logs are a valuable source of information for the analysis and understanding of systems behavior for the purpose of improving their performance. Such logs contain various types of information, including sensitive information. Information deemed sensitive can either directly be extracted from system log entries by correlation of several log entries, or can be inferred from the combination of the (non-sensitive) information contained within system logs with other logs and/or additional datasets. The analysis of system logs containing sensitive information compromises data privacy. Therefore, various anonymization techniques, such as generalization and suppression have been employed, over the years, by data and computing centers to protect the privacy of their users, their data, and the system as a whole. Privacy-preserving data resulting from anonymization via generalization and suppression may lead to significantly decreased data usefulness, thus, hindering the intended analysis for understanding the system behavior. Maintaining a balance between data usefulness and privacy preservation, therefore, remains an open and important challenge. Irreversible encoding of system logs using collision-resistant hashing algorithms, such as SHAKE-128, is a novel approach previously introduced by the authors to mitigate data privacy concerns. The present work describes a study of the applicability of the encoding approach from earlier work on the system logs of a production high performance computing system. Moreover, a metric is introduced to assess the data usefulness of the anonymized system logs to detect and identify the failures encountered in the system. △ Less

Submitted 4 May, 2018; originally announced May 2018.

Comments: 11 pages, 3 figures, submitted to 17th IEEE International Symposium on Parallel and Distributed Computing

arXiv:1706.04345 [pdf, ps, other]

Towards Adaptive Resilience in High Performance Computing

Authors: Siavash Ghiasvand, Florina M. Ciorba

Abstract: Failure rates in high performance computers rapidly increase due to the growth in system size and complexity. Hence, failures became the norm rather than the exception. Different approaches on high performance computing (HPC) systems have been introduced, to prevent failures (e. g., redundancy) or at least minimize their impacts (e. g., checkpoint and restart). In most cases, when these approaches… ▽ More Failure rates in high performance computers rapidly increase due to the growth in system size and complexity. Hence, failures became the norm rather than the exception. Different approaches on high performance computing (HPC) systems have been introduced, to prevent failures (e. g., redundancy) or at least minimize their impacts (e. g., checkpoint and restart). In most cases, when these approaches are employed to increase the resilience of certain parts of a system, energy consumption rapidly increases, or performance significantly degrades. To address this challenge, we propose on-demand resilience as an approach to achieve adaptive resilience in HPC systems. In this work, the HPC system is considered in its entirety and resilience mechanisms such as checkpointing, isolation, and migration, are activated on-demand. Using the proposed approach, the unavoidable increase in total energy consumption and system performance degradation is decreased compared to the typical checkpoint/restart and redundant resilience mechanisms. Our work aims to mitigate a large number of failures occurring at various layers in the system, to prevent their propagation, and to minimize their impact, all of this in an energy-saving manner. In the case of failures that are estimated to occur but cannot be mitigated using the proposed on-demand resilience approach, the system administrators will be notified in view of performing further investigations into the causes of these failures and their impacts. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 2 pages, to be published in Proceedings of the Work in Progress Session held in connection with the 25th EUROMICRO International Conference on Parallel, Distributed and Network-based Processing, PDP 2017

ACM Class: C.1.4; C.2.4; C.4

arXiv:1706.04337 [pdf, other]

doi 10.1007/978-3-030-03405-4_11

Anonymization of System Logs for Privacy and Storage Benefits

Authors: Siavash Ghiasvand, Florina M. Ciorba

Abstract: System logs constitute valuable information for analysis and diagnosis of system behavior. The size of parallel computing systems and the number of their components steadily increase. The volume of generated logs by the system is in proportion to this increase. Hence, long-term collection and storage of system logs is challenging. The analysis of system logs requires advanced text processing techn… ▽ More System logs constitute valuable information for analysis and diagnosis of system behavior. The size of parallel computing systems and the number of their components steadily increase. The volume of generated logs by the system is in proportion to this increase. Hence, long-term collection and storage of system logs is challenging. The analysis of system logs requires advanced text processing techniques. For very large volumes of logs, the analysis is highly time-consuming and requires a high level of expertise. For many parallel computing centers, outsourcing the analysis of system logs to third parties is the only affordable option. The existence of sensitive data within system log entries obstructs, however, the transmission of system logs to third parties. Moreover, the analytical tools for processing system logs and the solutions provided by such tools are highly system specific. Achieving a more general solution is only possible through the access and analysis system of logs of multiple computing systems. The privacy concerns impede, however, the sharing of system logs across institutions as well as in the public domain. This work proposes a new method for the anonymization of the information within system logs that employs de-identification and encoding to provide sharable system logs, with the highest possible data quality and of reduced size. The results presented in this work indicate that apart from eliminating the sensitive data within system logs and converting them into shareable data, the proposed anonymization method provides 25% performance improvement in post-processing of the anonymized system logs, and more than 50% reduction in their required storage space. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 8 pages, 5 figures, for demonstration see https://www.ghiasvand.net/u/hpcmaspa17

ACM Class: K.4.1; G.3; H.3.4; H.3.5

Showing 1–9 of 9 results for author: Ghiasvand, S