Search | arXiv e-print repository

Distributed Training of Large Graph Neural Networks with Variable Communication Rates

Authors: Juan Cervino, Md Asadullah Turja, Hesham Mostafa, Nageen Himayat, Alejandro Ribeiro

Abstract: Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the traini… ▽ More Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds. Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN. In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model. Based on our theoretical analysis, we derive a variable compression method that converges to a solution equivalent to the full communication case, for all graph partitioning schemes. Our empirical results show that our method attains a comparable performance to the one obtained with full communication. We outperform full communication at any fixed compression ratio for any communication budget. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.20445 [pdf, other]

GraphAny: A Foundation Model for Node Classification on Any Graph

Authors: Jianan Zhao, Hesham Mostafa, Mikhail Galkin, Michael Bronstein, Zhaocheng Zhu, Jian Tang

Abstract: Foundation models that can perform inference on any new task without requiring specific training have revolutionized machine learning in vision and language applications. However, applications involving graph-structured data remain a tough nut for foundation models, due to challenges in the unique feature- and label spaces associated with each graph. Traditional graph ML models such as graph neura… ▽ More Foundation models that can perform inference on any new task without requiring specific training have revolutionized machine learning in vision and language applications. However, applications involving graph-structured data remain a tough nut for foundation models, due to challenges in the unique feature- and label spaces associated with each graph. Traditional graph ML models such as graph neural networks (GNNs) trained on graphs cannot perform inference on a new graph with feature and label spaces different from the training ones. Furthermore, existing models learn functions specific to the training graph and cannot generalize to new graphs. In this work, we tackle these two challenges with a new foundational architecture for inductive node classification named GraphAny. GraphAny models inference on a new graph as an analytical solution to a LinearGNN, thereby solving the first challenge. To solve the second challenge, we learn attention scores for each node to fuse the predictions of multiple LinearGNNs. Specifically, the attention module is carefully parameterized as a function of the entropy-normalized distance-features between multiple LinearGNNs predictions to ensure generalization to new graphs. Empirically, GraphAny trained on the Wisconsin dataset with only 120 labeled nodes can effectively generalize to 30 new graphs with an average accuracy of 67.26\% in an inductive manner, surpassing GCN and GAT trained in the supervised regime, as well as other inductive baselines. △ Less

Submitted 2 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: Preprint. Work in progress

arXiv:2405.05495 [pdf, other]

PARSAC: Fast, Human-quality Floorplanning for Modern SoCs with Complex Design Constraints

Authors: Hesham Mostafa, Uday Mallappa, Mikhail Galkin, Mariano Phielipp, Somdeb Majumdar

Abstract: The floorplanning of Systems-on-a-Chip (SoCs) and of chip sub-systems is a crucial step in the physical design flow as it determines the optimal shapes and locations of the blocks that make up the system. Simulated Annealing (SA) has been the method of choice for tackling classical floorplanning problems where the objective is to minimize wire-length and the total placement area. The goal in indus… ▽ More The floorplanning of Systems-on-a-Chip (SoCs) and of chip sub-systems is a crucial step in the physical design flow as it determines the optimal shapes and locations of the blocks that make up the system. Simulated Annealing (SA) has been the method of choice for tackling classical floorplanning problems where the objective is to minimize wire-length and the total placement area. The goal in industry-relevant floorplanning problems, however, is not only to minimize area and wire-length, but to do that while respecting hard placement constraints that specify the general area and/or the specific locations for the placement of some blocks. We show that simply incorporating these constraints into the SA objective function leads to sub-optimal, and often illegal, solutions. We propose the Constraints-Aware Simulated Annealing (CA-SA) method and show that it strongly outperforms vanilla SA in floorplanning problems with hard placement constraints. We developed a new floorplanning tool on top of CA-SA: PARSAC (Parallel Simulated Annealing with Constraints). PARSAC is an efficient, easy-to-use, and massively parallel floorplanner. Unlike current SA-based or learning-based floorplanning tools that cannot effectively incorporate hard placement-constraints, PARSAC can quickly construct the Pareto-optimal legal solutions front for constrained floorplanning problems. PARSAC also outperforms traditional SA on legacy floorplanning benchmarks. PARSAC is available as an open-source repository for researchers to replicate and build on our result. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 9 pages, 7 figures

arXiv:2405.05480 [pdf, other]

FloorSet -- a VLSI Floorplanning Dataset with Design Constraints of Real-World SoCs

Authors: Uday Mallappa, Hesham Mostafa, Mikhail Galkin, Mariano Phielipp, Somdeb Majumdar

Abstract: Floorplanning for systems-on-a-chip (SoCs) and its sub-systems is a crucial and non-trivial step of the physical design flow. It represents a difficult combinatorial optimization problem. A typical large scale SoC with 120 partitions generates a search-space of nearly 10E250. As novel machine learning (ML) approaches emerge to tackle such problems, there is a growing need for a modern benchmark th… ▽ More Floorplanning for systems-on-a-chip (SoCs) and its sub-systems is a crucial and non-trivial step of the physical design flow. It represents a difficult combinatorial optimization problem. A typical large scale SoC with 120 partitions generates a search-space of nearly 10E250. As novel machine learning (ML) approaches emerge to tackle such problems, there is a growing need for a modern benchmark that comprises a large training dataset and performance metrics that better reflect real-world constraints and objectives compared to existing benchmarks. To address this need, we present FloorSet -- two comprehensive datasets of synthetic fixed-outline floorplan layouts that reflect the distribution of real SoCs. Each dataset has 1M training samples and 100 test samples where each sample is a synthetic floor-plan. FloorSet-Prime comprises fully-abutted rectilinear partitions and near-optimal wire-length. A simplified dataset that reflects early design phases, FloorSet-Lite comprises rectangular partitions, with under 5 percent white-space and near-optimal wire-length. Both datasets define hard constraints seen in modern design flows such as shape constraints, edge-affinity, grou** constraints, and pre-placement constraints. FloorSet is intended to spur fundamental research on large-scale constrained optimization problems. Crucially, FloorSet alleviates the core issue of reproducibility in modern ML driven solutions to such problems. FloorSet is available as an open-source repository for the research community. △ Less

Submitted 27 June, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: 10 pages, 11 figures

arXiv:2311.17847 [pdf, other]

FastSample: Accelerating Distributed Graph Neural Network Training for Billion-Scale Graphs

Authors: Hesham Mostafa, Adam Grabowski, Md Asadullah Turja, Juan Cervino, Alejandro Ribeiro, Nageen Himayat

Abstract: Training Graph Neural Networks(GNNs) on a large monolithic graph presents unique challenges as the graph cannot fit within a single machine and it cannot be decomposed into smaller disconnected components. Distributed sampling-based training distributes the graph across multiple machines and trains the GNN on small parts of the graph that are randomly sampled every training iteration. We show that… ▽ More Training Graph Neural Networks(GNNs) on a large monolithic graph presents unique challenges as the graph cannot fit within a single machine and it cannot be decomposed into smaller disconnected components. Distributed sampling-based training distributes the graph across multiple machines and trains the GNN on small parts of the graph that are randomly sampled every training iteration. We show that in a distributed environment, the sampling overhead is a significant component of the training time for large-scale graphs. We propose FastSample which is composed of two synergistic techniques that greatly reduce the distributed sampling time: 1)a new graph partitioning method that eliminates most of the communication rounds in distributed sampling , 2)a novel highly optimized sampling kernel that reduces memory movement during sampling. We test FastSample on large-scale graph benchmarks and show that FastSample speeds up distributed sampling-based GNN training by up to 2x with no loss in accuracy. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2310.04562 [pdf, other]

Towards Foundation Models for Knowledge Graph Reasoning

Authors: Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, Zhaocheng Zhu

Abstract: Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations t… ▽ More Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance. △ Less

Submitted 9 April, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2302.13285 [pdf, other]

Ultra-Reliable Device-Centric Uplink Communications in Airborne Networks: A Spatiotemporal Analysis

Authors: Yasser Nabil, Hesham ElSawy, Suhail Al-Dharrab, Hussein Attia, Hassan Mostafa

Abstract: This paper proposes an ultra-reliable device-centric uplink (URDC-UL) communication scheme for airborne networks. In particular, base stations (BSs) are mounted on unmanned aerial vehicles (UAVs) that travel to schedule UL transmissions and collect data from devices. To attain an ultra-reliable unified device-centric performance, the UL connection is established when the UAV-BS is hovering at the… ▽ More This paper proposes an ultra-reliable device-centric uplink (URDC-UL) communication scheme for airborne networks. In particular, base stations (BSs) are mounted on unmanned aerial vehicles (UAVs) that travel to schedule UL transmissions and collect data from devices. To attain an ultra-reliable unified device-centric performance, the UL connection is established when the UAV-BS is hovering at the nearest possible distance from the scheduled device. The performance of the proposed URDC-UL scheme is benchmarked against a stationary UAV-centric uplink (SUC-UL) scheme where the devices are scheduled to communicate to UAV-BSs that are continuously hovering at static locations. Utilizing stochastic geometry and queueing theory, novel spatiotemporal mathematical models are developed, which account for the UAV-BS spatial densities, mobility, altitude, antenna directivity, ground-to-air channel, and temporal traffic, among other factors. The results demonstrate the sensitivity of the URDC-UL scheme to the ratio between hovering and traveling time. In particular, the hovering to traveling time ratio should be carefully adjusted to maximize the harvested performance gains for the URDC-UL scheme in terms of link reliability, transmission rate, energy efficiency, and delay. Exploiting the URDC-UL scheme allows IoT devices to minimize transmission power while maintaining unified reliable transmission. This preserves the device's battery and addresses a critical IoT design challenge. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2112.09828 [pdf, other]

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Authors: Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar

Abstract: Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relat… ▽ More Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Our ablation studies validate the effectiveness of each component of the proposed approach. The source code is available at https://github.com/Shengyu-Feng/DSG-DETR. △ Less

Submitted 19 October, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: WACV 2023

arXiv:2111.06483 [pdf, other]

Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs

Authors: Hesham Mostafa

Abstract: We present the Sequential Aggregation and Rematerialization (SAR) scheme for distributed full-batch training of Graph Neural Networks (GNNs) on large graphs. Large-scale training of GNNs has recently been dominated by sampling-based methods and methods based on non-learnable message passing. SAR on the other hand is a distributed technique that can train any GNN type directly on an entire large gr… ▽ More We present the Sequential Aggregation and Rematerialization (SAR) scheme for distributed full-batch training of Graph Neural Networks (GNNs) on large graphs. Large-scale training of GNNs has recently been dominated by sampling-based methods and methods based on non-learnable message passing. SAR on the other hand is a distributed technique that can train any GNN type directly on an entire large graph. The key innovation in SAR is the distributed sequential rematerialization scheme which sequentially re-constructs then frees pieces of the prohibitively large GNN computational graph during the backward pass. This results in excellent memory scaling behavior where the memory consumption per worker goes down linearly with the number of workers, even for densely connected graphs. Using SAR, we report the largest applications of full-batch GNN training to-date, and demonstrate large memory savings as the number of workers increases. We also present a general technique based on kernel fusion and attention-matrix rematerialization to optimize both the runtime and memory efficiency of attention-based models. We show that, coupled with SAR, our optimized attention kernels lead to significant speedups and memory savings in attention-based GNNs.We made the SAR GNN training library publicy available: \url{https://github.com/IntelLabs/SAR}. △ Less

Submitted 15 April, 2022; v1 submitted 11 November, 2021; originally announced November 2021.

arXiv:2111.06312 [pdf, other]

Implicit SVD for Graph Representation Learning

Authors: Sami Abu-El-Haija, Hesham Mostafa, Marcel Nassar, Valentino Crespi, Greg Ver Steeg, Aram Galstyan

Abstract: Recent improvements in the performance of state-of-the-art (SOTA) methods for Graph Representational Learning (GRL) have come at the cost of significant computational resource requirements for training, e.g., for calculating gradients via backprop over many data epochs. Meanwhile, Singular Value Decomposition (SVD) can find closed-form solutions to convex problems, using merely a handful of epochs… ▽ More Recent improvements in the performance of state-of-the-art (SOTA) methods for Graph Representational Learning (GRL) have come at the cost of significant computational resource requirements for training, e.g., for calculating gradients via backprop over many data epochs. Meanwhile, Singular Value Decomposition (SVD) can find closed-form solutions to convex problems, using merely a handful of epochs. In this paper, we make GRL more computationally tractable for those with modest hardware. We design a framework that computes SVD of \textit{implicitly} defined matrices, and apply this framework to several GRL tasks. For each task, we derive linear approximation of a SOTA model, where we design (expensive-to-store) matrix $\mathbf{M}$ and train the model, in closed-form, via SVD of $\mathbf{M}$, without calculating entries of $\mathbf{M}$. By converging to a unique point in one step, and without calculating gradients, our models show competitive empirical test performance over various graphs such as article citation and biological interaction networks. More importantly, SVD can initialize a deeper model, that is architected to be non-linear almost everywhere, though behaves linearly when its parameters reside on a hyperplane, onto which SVD initializes. The deeper model can then be fine-tuned within only a few epochs. Overall, our procedure trains hundreds of times faster than state-of-the-art methods, while competing on empirical test performance. We open-source our implementation at: https://github.com/samihaija/isvd △ Less

Submitted 11 November, 2021; originally announced November 2021.

Journal ref: Advances in Neural Information Processing Systems (NeurIPS) 2021

arXiv:2109.03563 [pdf, other]

Data Aggregation in Synchronous Large-scale IoT Networks: Granularity, Reliability, and Delay Tradeoffs

Authors: Yasser Nabil, Hesham ElSawy, Suhail Al-Dharrab, Hassan Mostafa, Hussein Attia

Abstract: This paper studies data aggregation in large-scale regularly deployed Internet of Things (IoT) networks, where devices generate synchronized time-triggered traffic (e.g., measurements or updates). The data granularity, in terms of information content and temporal resolution, is parameterized by the sizes of the generated packets and the duty cycle of packet generation. The generated data packets a… ▽ More This paper studies data aggregation in large-scale regularly deployed Internet of Things (IoT) networks, where devices generate synchronized time-triggered traffic (e.g., measurements or updates). The data granularity, in terms of information content and temporal resolution, is parameterized by the sizes of the generated packets and the duty cycle of packet generation. The generated data packets at the devices are aggregated through static terrestrial gateways. Universal frequency reuse is adopted across all gateways and randomized scheduling is utilized for the IoT devices associated with each gateway. Such network model finds applications in environmental sensing, precision agriculture, and geological seismic sensing to name a few. To this end, we develop a novel spatiotemporal mathematical model to characterize the interplay between data granularity, transmission reliability, and delay. The developed model accounts for several IoT design parameters, which include packet sizes, generation duty cycle, devices and gateways spatial densities, transmission rate adaptation, power control, and antenna directivity. For tractable analysis, we propose two accurate approximations, based on the Poisson point process, to characterize the signal-to-interference-plus-noise-ratio (SINR) based transmission reliability. For the delay analysis, we propose a phase-type arrival/departure (PH/PH/1) queueing model that accounts for packet generation, transmission scheduling, and rate-sensitive SINR-based packet departure. The developed model is utilized to obtain the optimal transmission rate for the IoT devices that minimizes delay. The numerical results delineate the joint feasibility range of packet sizes and inter-arrival times for data aggregation and reveal significant gains when deploying directional antennas. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2106.03213 [pdf, other]

On Local Aggregation in Heterophilic Graphs

Authors: Hesham Mostafa, Marcel Nassar, Somdeb Majumdar

Abstract: Many recent works have studied the performance of Graph Neural Networks (GNNs) in the context of graph homophily - a label-dependent measure of connectivity. Traditional GNNs generate node embeddings by aggregating information from a node's neighbors in the graph. Recent results in node classification tasks show that this local aggregation approach performs poorly in graphs with low homophily (het… ▽ More Many recent works have studied the performance of Graph Neural Networks (GNNs) in the context of graph homophily - a label-dependent measure of connectivity. Traditional GNNs generate node embeddings by aggregating information from a node's neighbors in the graph. Recent results in node classification tasks show that this local aggregation approach performs poorly in graphs with low homophily (heterophilic graphs). Several mechanisms have been proposed to improve the accuracy of GNNs on such graphs by increasing the aggregation range of a GNN layer, either through multi-hop aggregation, or through long-range aggregation from distant nodes. In this paper, we show that properly tuned classical GNNs and multi-layer perceptrons match or exceed the accuracy of recent long-range aggregation methods on heterophilic graphs. Thus, our results highlight the need for alternative datasets to benchmark long-range GNN aggregation mechanisms. We also show that homophily is a poor measure of the information in a node's local neighborhood and propose the Neighborhood Information Content(NIC) metric, which is a novel information-theoretic graph metric. We argue that NIC is more relevant for local aggregation methods as used by GNNs. We show that, empirically, it correlates better with GNN accuracy in node classification tasks than homophily. △ Less

Submitted 6 June, 2021; originally announced June 2021.

arXiv:2012.09904 [pdf, other]

Attention-based Image Upsampling

Authors: Souvik Kundu, Hesham Mostafa, Sharath Nittur Sridhar, Sairam Sundaresan

Abstract: Convolutional layers are an integral part of many deep neural network solutions in computer vision. Recent work shows that replacing the standard convolution operation with mechanisms based on self-attention leads to improved performance on image classification and object detection tasks. In this work, we show how attention mechanisms can be used to replace another canonical operation: strided tra… ▽ More Convolutional layers are an integral part of many deep neural network solutions in computer vision. Recent work shows that replacing the standard convolution operation with mechanisms based on self-attention leads to improved performance on image classification and object detection tasks. In this work, we show how attention mechanisms can be used to replace another canonical operation: strided transposed convolution. We term our novel attention-based operation attention-based upsampling since it increases/upsamples the spatial dimensions of the feature maps. Through experiments on single image super-resolution and joint-image upsampling tasks, we show that attention-based upsampling consistently outperforms traditional upsampling methods based on strided transposed convolution or based on adaptive filters while using fewer parameters. We show that the inherent flexibility of the attention mechanism, which allows it to use separate sources for calculating the attention coefficients and the attention targets, makes attention-based upsampling a natural choice when fusing information from multiple image modalities. △ Less

Submitted 17 December, 2020; originally announced December 2020.

arXiv:2003.00635 [pdf, other]

Permutohedral-GCN: Graph Convolutional Networks with Global Attention

Authors: Hesham Mostafa, Marcel Nassar

Abstract: Graph convolutional networks (GCNs) update a node's feature vector by aggregating features from its neighbors in the graph. This ignores potentially useful contributions from distant nodes. Identifying such useful distant contributions is challenging due to scalability issues (too many nodes can potentially contribute) and oversmoothing (aggregating features from too many nodes risks swam** out… ▽ More Graph convolutional networks (GCNs) update a node's feature vector by aggregating features from its neighbors in the graph. This ignores potentially useful contributions from distant nodes. Identifying such useful distant contributions is challenging due to scalability issues (too many nodes can potentially contribute) and oversmoothing (aggregating features from too many nodes risks swam** out relevant information and may result in nodes having different labels but indistinguishable features). We introduce a global attention mechanism where a node can selectively attend to, and aggregate features from, any other node in the graph. The attention coefficients depend on the Euclidean distance between learnable node embeddings, and we show that the resulting attention-based global aggregation scheme is analogous to high-dimensional Gaussian filtering. This makes it possible to use efficient approximate Gaussian filtering techniques to implement our attention-based global aggregation scheme. By employing an approximate filtering method based on the permutohedral lattice, the time complexity of our proposed global aggregation scheme only grows linearly with the number of nodes. The resulting GCNs, which we term permutohedral-GCNs, are differentiable and trained end-to-end, and they achieve state of the art performance on several node classification benchmarks. △ Less

Submitted 1 March, 2020; originally announced March 2020.

arXiv:1912.13075 [pdf, other]

Robust Federated Learning Through Representation Matching and Adaptive Hyper-parameters

Authors: Hesham Mostafa

Abstract: Federated learning is a distributed, privacy-aware learning scenario which trains a single model on data belonging to several clients. Each client trains a local model on its data and the local models are then aggregated by a central party. Current federated learning methods struggle in cases with heterogeneous client-side data distributions which can quickly lead to divergent local models and a c… ▽ More Federated learning is a distributed, privacy-aware learning scenario which trains a single model on data belonging to several clients. Each client trains a local model on its data and the local models are then aggregated by a central party. Current federated learning methods struggle in cases with heterogeneous client-side data distributions which can quickly lead to divergent local models and a collapse in performance. Careful hyper-parameter tuning is particularly important in these cases but traditional automated hyper-parameter tuning methods would require several training trials which is often impractical in a federated learning setting. We describe a two-pronged solution to the issues of robustness and hyper-parameter tuning in federated learning settings. We propose a novel representation matching scheme that reduces the divergence of local models by ensuring the feature representations in the global (aggregate) model can be derived from the locally learned representations. We also propose an online hyper-parameter tuning scheme which uses an online version of the REINFORCE algorithm to find a hyper-parameter distribution that maximizes the expected improvements in training loss. We show on several benchmarks that our two-part scheme of local representation matching and global adaptive hyper-parameters significantly improves performance and training robustness. △ Less

Submitted 30 December, 2019; originally announced December 2019.

arXiv:1907.06916 [pdf, other]

Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems

Authors: Mark D. McDonnell, Hesham Mostafa, Runchun Wang, Andre van Schaik

Abstract: Batch-normalization (BN) layers are thought to be an integrally important layer type in today's state-of-the-art deep convolutional neural networks for computer vision tasks such as classification and detection. However, BN layers introduce complexity and computational overheads that are highly undesirable for training and/or inference on low-power custom hardware implementations of real-time embe… ▽ More Batch-normalization (BN) layers are thought to be an integrally important layer type in today's state-of-the-art deep convolutional neural networks for computer vision tasks such as classification and detection. However, BN layers introduce complexity and computational overheads that are highly undesirable for training and/or inference on low-power custom hardware implementations of real-time embedded vision systems such as UAVs, robots and Internet of Things (IoT) devices. They are also problematic when batch sizes need to be very small during training, and innovations such as residual connections introduced more recently than BN layers could potentially have lessened their impact. In this paper we aim to quantify the benefits BN layers offer in image classification networks, in comparison with alternative choices. In particular, we study networks that use shifted-ReLU layers instead of BN layers. We found, following experiments with wide residual networks applied to the ImageNet, CIFAR 10 and CIFAR 100 image classification datasets, that BN layers do not consistently offer a significant advantage. We found that the accuracy margin offered by BN layers depends on the data set, the network size, and the bit-depth of weights. We conclude that in situations where BN layers are undesirable due to speed, memory or complexity costs, that using shifted-ReLU layers instead should be considered; we found they can offer advantages in all these areas, and often do not impose a significant accuracy cost. △ Less

Submitted 22 July, 2019; v1 submitted 16 July, 2019; originally announced July 2019.

Comments: 8 pages, published IEEE conference paper

arXiv:1902.05967 [pdf, ps, other]

Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

Authors: Hesham Mostafa, Xin Wang

Abstract: Modern deep neural networks are typically highly overparameterized. Pruning techniques are able to remove a significant fraction of network parameters with little loss in accuracy. Recently, techniques based on dynamic reallocation of non-zero parameters have emerged, allowing direct training of sparse networks without having to pre-train a large dense model. Here we present a novel dynamic sparse… ▽ More Modern deep neural networks are typically highly overparameterized. Pruning techniques are able to remove a significant fraction of network parameters with little loss in accuracy. Recently, techniques based on dynamic reallocation of non-zero parameters have emerged, allowing direct training of sparse networks without having to pre-train a large dense model. Here we present a novel dynamic sparse reparameterization method that addresses the limitations of previous techniques such as high computational cost and the need for manual configuration of the number of free parameters allocated to each layer. We evaluate the performance of dynamic reallocation methods in training deep convolutional networks and show that our method outperforms previous static and dynamic reparameterization methods, yielding the best accuracy for a fixed parameter budget, on par with accuracies obtained by iteratively pruning a pre-trained dense model. We further investigated the mechanisms underlying the superior generalization performance of the resultant sparse networks. We found that neither the structure, nor the initialization of the non-zero parameters were sufficient to explain the superior performance. Rather, effective learning crucially depended on the continuous exploration of the sparse network structure space during training. Our work suggests that exploring structural degrees of freedom during training is more effective than adding extra parameters to the network. △ Less

Submitted 12 May, 2019; v1 submitted 15 February, 2019; originally announced February 2019.

Comments: Proceedings of the 36th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019

arXiv:1901.09948 [pdf, other]

Surrogate Gradient Learning in Spiking Neural Networks

Authors: Emre O. Neftci, Hesham Mostafa, Friedemann Zenke

Abstract: Spiking neural networks are nature's versatile solution to fault-tolerant and energy efficient signal processing. To translate these benefits into hardware, a growing number of neuromorphic spiking neural network processors attempt to emulate biological neural networks. These developments have created an imminent need for methods and tools to enable such systems to solve real-world signal processi… ▽ More Spiking neural networks are nature's versatile solution to fault-tolerant and energy efficient signal processing. To translate these benefits into hardware, a growing number of neuromorphic spiking neural network processors attempt to emulate biological neural networks. These developments have created an imminent need for methods and tools to enable such systems to solve real-world signal processing problems. Like conventional neural networks, spiking neural networks can be trained on real, domain specific data. However, their training requires overcoming a number of challenges linked to their binary and dynamical nature. This article elucidates step-by-step the problems typically encountered when training spiking neural networks, and guides the reader through the key concepts of synaptic plasticity and data-driven learning in the spiking setting. To that end, it gives an overview of existing approaches and provides an introduction to surrogate gradient methods, specifically, as a particularly flexible and efficient method to overcome the aforementioned challenges. △ Less

Submitted 3 May, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

arXiv:1811.10766 [pdf, other]

doi 10.3389/fnins.2020.00424

Synaptic Plasticity Dynamics for Deep Continuous Local Learning (DECOLLE)

Authors: Jacques Kaiser, Hesham Mostafa, Emre Neftci

Abstract: A growing body of work underlines striking similarities between biological neural networks and recurrent, binary neural networks. A relatively smaller body of work, however, discusses similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely caused by the discrepancy between the dy… ▽ More A growing body of work underlines striking similarities between biological neural networks and recurrent, binary neural networks. A relatively smaller body of work, however, discusses similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely caused by the discrepancy between the dynamical properties of synaptic plasticity and the requirements for gradient backpropagation. Learning algorithms that approximate gradient backpropagation using locally synthesized gradients can overcome this challenge. Here, we show that synthetic gradients enable the derivation of Deep Continuous Local Learning (DECOLLE) in spiking neural networks. DECOLLE is capable of learning deep spatio-temporal representations from spikes relying solely on local information. Synaptic plasticity rules are derived systematically from user-defined cost functions and neural dynamics by leveraging existing autodifferentiation methods of machine learning frameworks. We benchmark our approach on the MNIST and the event-based neuromorphic DvsGesture dataset, on which DECOLLE performs comparably to the state-of-the-art. DECOLLE networks provide continuously learning machines that are relevant to biology and supportive of event-based, low-power computer vision architectures matching the accuracies of conventional computers on tasks where temporal precision and speed are essential. △ Less

Submitted 20 May, 2020; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: Published in Frontiers in Neuroscience - Neuromorphic Engineering

Journal ref: Frontiers in Neuroscience, 2020

arXiv:1711.06756 [pdf, other]

Deep supervised learning using local errors

Authors: Hesham Mostafa, Vishwajith Ramesh, Gert Cauwenberghs

Abstract: Error backpropagation is a highly effective mechanism for learning high-quality hierarchical features in deep networks. Updating the features or weights in one layer, however, requires waiting for the propagation of error signals from higher layers. Learning using delayed and non-local errors makes it hard to reconcile backpropagation with the learning mechanisms observed in biological neural netw… ▽ More Error backpropagation is a highly effective mechanism for learning high-quality hierarchical features in deep networks. Updating the features or weights in one layer, however, requires waiting for the propagation of error signals from higher layers. Learning using delayed and non-local errors makes it hard to reconcile backpropagation with the learning mechanisms observed in biological neural networks as it requires the neurons to maintain a memory of the input long enough until the higher-layer errors arrive. In this paper, we propose an alternative learning mechanism where errors are generated locally in each layer using fixed, random auxiliary classifiers. Lower layers could thus be trained independently of higher layers and training could either proceed layer by layer, or simultaneously in all layers using local error information. We address biological plausibility concerns such as weight symmetry requirements and show that the proposed learning mechanism based on fixed, broad, and random tuning of each neuron to the classification categories outperforms the biologically-motivated feedback alignment learning technique on the MNIST, CIFAR10, and SVHN datasets, approaching the performance of standard backpropagation. Our approach highlights a potential biological mechanism for the supervised, or task-dependent, learning of feature hierarchies. In addition, we show that it is well suited for learning deep networks in custom hardware where it can drastically reduce memory traffic and data communication overheads. △ Less

Submitted 17 November, 2017; originally announced November 2017.

arXiv:1708.04251 [pdf, other]

A learning framework for winner-take-all networks with stochastic synapses

Authors: Hesham Mostafa, Gert Cauwenberghs

Abstract: Many recent generative models make use of neural networks to transform the probability distribution of a simple low-dimensional noise process into the complex distribution of the data. This raises the question of whether biological networks operate along similar principles to implement a probabilistic model of the environment through transformations of intrinsic noise processes. The intrinsic neur… ▽ More Many recent generative models make use of neural networks to transform the probability distribution of a simple low-dimensional noise process into the complex distribution of the data. This raises the question of whether biological networks operate along similar principles to implement a probabilistic model of the environment through transformations of intrinsic noise processes. The intrinsic neural and synaptic noise processes in biological networks, however, are quite different from the noise processes used in current abstract generative networks. This, together with the discrete nature of spikes and local circuit interactions among the neurons, raises several difficulties when using recent generative modeling frameworks to train biologically motivated models. In this paper, we show that a biologically motivated model based on multi-layer winner-take-all (WTA) circuits and stochastic synapses admits an approximate analytical description. This allows us to use the proposed networks in a variational learning setting where stochastic backpropagation is used to optimize a lower bound on the data log likelihood, thereby learning a generative model of the data. We illustrate the generality of the proposed networks and learning technique by using them in a structured output prediction task, and in a semi-supervised learning task. Our results extend the domain of application of modern stochastic network architectures to networks where synaptic transmission failure is the principal noise mechanism. △ Less

Submitted 5 February, 2018; v1 submitted 14 August, 2017; originally announced August 2017.

arXiv:1707.03049 [pdf, other]

Hardware-efficient on-line learning through pipelined truncated-error backpropagation in binary-state networks

Authors: Hesham Mostafa, Bruno Pedroni, Sadique Sheik, Gert Cauwenberghs

Abstract: Artificial neural networks (ANNs) trained using backpropagation are powerful learning architectures that have achieved state-of-the-art performance in various benchmarks. Significant effort has been devoted to develo** custom silicon devices to accelerate inference in ANNs. Accelerating the training phase, however, has attracted relatively little attention. In this paper, we describe a hardware-… ▽ More Artificial neural networks (ANNs) trained using backpropagation are powerful learning architectures that have achieved state-of-the-art performance in various benchmarks. Significant effort has been devoted to develo** custom silicon devices to accelerate inference in ANNs. Accelerating the training phase, however, has attracted relatively little attention. In this paper, we describe a hardware-efficient on-line learning technique for feedforward multi-layer ANNs that is based on pipelined backpropagation. Learning is performed in parallel with inference in the forward pass, removing the need for an explicit backward pass and requiring no extra weight lookup. By using binary state variables in the feedforward network and ternary errors in truncated-error backpropagation, the need for any multiplications in the forward and backward passes is removed, and memory requirements for the pipelining are drastically reduced. Further reduction in addition operations owing to the sparsity in the forward neural and backpropagating error signal paths contributes to highly efficient hardware implementation. For proof-of-concept validation, we demonstrate on-line learning of MNIST handwritten digit classification on a Spartan 6 FPGA interfacing with an external 1Gb DDR2 DRAM, that shows small degradation in test error performance compared to an equivalently sized binary ANN trained off-line using standard back-propagation and exact errors. Our results highlight an attractive synergy between pipelined backpropagation and binary-state networks in substantially reducing computation and memory requirements, making pipelined on-line learning practical in deep networks. △ Less

Submitted 16 August, 2017; v1 submitted 15 June, 2017; originally announced July 2017.

Comments: Now also consider 0/1 binary activations. Memory access statistics reported

arXiv:1706.01406 [pdf, other]

doi 10.1109/TNNLS.2018.2852335

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

Authors: Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck

Abstract: Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture… ▽ More Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the MAC units, and achieves a power efficiency of over 3TOp/s/W in a core area of 6.3mm$^2$. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real time interactive demonstrations. △ Less

Submitted 6 March, 2018; v1 submitted 5 June, 2017; originally announced June 2017.

arXiv:1606.08165 [pdf, other]

Supervised learning based on temporal coding in spiking neural networks

Authors: Hesham Mostafa

Abstract: Gradient descent training techniques are remarkably successful in training analog-valued artificial neural networks (ANNs). Such training techniques, however, do not transfer easily to spiking networks due to the spike generation hard non-linearity and the discrete nature of spike communication. We show that in a feedforward spiking network that uses a temporal coding scheme where information is e… ▽ More Gradient descent training techniques are remarkably successful in training analog-valued artificial neural networks (ANNs). Such training techniques, however, do not transfer easily to spiking networks due to the spike generation hard non-linearity and the discrete nature of spike communication. We show that in a feedforward spiking network that uses a temporal coding scheme where information is encoded in spike times instead of spike rates, the network input-output relation is differentiable almost everywhere. Moreover, this relation is piece-wise linear after a transformation of variables. Methods for training ANNs thus carry directly to the training of such spiking networks as we show when training on the permutation invariant MNIST task. In contrast to rate-based spiking networks that are often used to approximate the behavior of ANNs, the networks we present spike much more sparsely and their behavior can not be directly approximated by conventional ANNs. Our results highlight a new approach for controlling the behavior of spiking networks with realistic temporal dynamics, opening up the potential for using these networks to process spike patterns with complex temporal information. △ Less

Submitted 16 August, 2017; v1 submitted 27 June, 2016; originally announced June 2016.

Comments: Extended the discussion and introduction. Clarified the training parameters

arXiv:1512.02930 [pdf, other]

Stochastic Interpretation of Quasi-periodic Event-based Systems

Authors: Hesham Mostafa, Giacomo Indiveri

Abstract: Many networks used in machine learning and as models of biological neural networks make use of stochastic neurons or neuron-like units. We show that stochastic artificial neurons can be realized on silicon chips by exploiting the quasi-periodic behavior of mismatched analog oscillators to approximate the neuron's stochastic activation function. We represent neurons by finite state machines (FSMs)… ▽ More Many networks used in machine learning and as models of biological neural networks make use of stochastic neurons or neuron-like units. We show that stochastic artificial neurons can be realized on silicon chips by exploiting the quasi-periodic behavior of mismatched analog oscillators to approximate the neuron's stochastic activation function. We represent neurons by finite state machines (FSMs) that communicate using digital events and whose transitions are event-triggered. The event generation times of each neuron are controlled by an analog oscillator internal to that neuron/FSM and the frequencies of the oscillators in different FSMs are incommensurable. We show that within this quasi-periodic system, the transition graph of a FSM can be interpreted as the transition graph of a Markov chain and we show that by using different FSMs, we can obtain approximations of different stochastic activation functions. We investigate the quality of the stochastic interpretation of such a deterministic system and we use the system to realize and sample from a restricted Boltzmann machine. We implemented the quasi-periodic event-based system on a custom silicon chip and we show that the chip behavior can be used to closely approximate a stochastic sampling task. △ Less

Submitted 9 December, 2015; originally announced December 2015.

arXiv:1505.01139 [pdf, other]

doi 10.1038/ncomms9941

An event-based architecture for solving constraint satisfaction problems

Authors: Hesham Mostafa, Lorenz K. Müller, Giacomo Indiveri

Abstract: Constraint satisfaction problems (CSPs) are typically solved using conventional von Neumann computing architectures. However, these architectures do not reflect the distributed nature of many of these problems and are thus ill-suited to solving them. In this paper we present a hybrid analog/digital hardware architecture specifically designed to solve such problems. We cast CSPs as networks of ster… ▽ More Constraint satisfaction problems (CSPs) are typically solved using conventional von Neumann computing architectures. However, these architectures do not reflect the distributed nature of many of these problems and are thus ill-suited to solving them. In this paper we present a hybrid analog/digital hardware architecture specifically designed to solve such problems. We cast CSPs as networks of stereotyped multi-stable oscillatory elements that communicate using digital pulses, or events. The oscillatory elements are implemented using analog non-stochastic circuits. The non-repeating phase relations among the oscillatory elements drive the exploration of the solution space. We show that this hardware architecture can yield state-of-the-art performance on a number of CSPs under reasonable assumptions on the implementation. We present measurements from a prototype electronic chip to demonstrate that a physical implementation of the proposed architecture is robust to practical non-idealities and to validate the theory proposed. △ Less

Submitted 4 May, 2015; originally announced May 2015.

Comments: First two authors contributed equally to this work

Journal ref: Nature Communications 6, Article number: 8941 (2015), pg. 1-10

arXiv:1202.3749 [pdf]

Compact Mathematical Programs For DEC-MDPs With Structured Agent Interactions

Authors: Hala Mostafa, Victor Lesser

Abstract: To deal with the prohibitive complexity of calculating policies in Decentralized MDPs, researchers have proposed models that exploit structured agent interactions. Settings where most agent actions are independent except for few actions that affect the transitions and/or rewards of other agents can be modeled using Event-Driven Interactions with Complex Rewards (EDI-CR). Finding the optimal joint… ▽ More To deal with the prohibitive complexity of calculating policies in Decentralized MDPs, researchers have proposed models that exploit structured agent interactions. Settings where most agent actions are independent except for few actions that affect the transitions and/or rewards of other agents can be modeled using Event-Driven Interactions with Complex Rewards (EDI-CR). Finding the optimal joint policy can be formulated as an optimization problem. However, existing formulations are too verbose and/or lack optimality guarantees. We propose a compact Mixed Integer Linear Program formulation of EDI-CR instances. The key insight is that most action sequences of a group of agents have the same effect on a given agent. This allows us to treat these sequences similarly and use fewer variables. Experiments show that our formulation is more compact and leads to faster solution times and better solutions than existing formulations. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Report number: UAI-P-2011-PG-523-530

Showing 1–27 of 27 results for author: Mostafa, H