-
MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
Authors:
Kaan Ozkara,
Can Karakus,
Parameswaran Raman,
Mingyi Hong,
Shoham Sabach,
Branislav Kveton,
Volkan Cevher
Abstract:
Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during tra…
▽ More
Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.
△ Less
Submitted 17 June, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
Authors:
Can Karakus,
Rahul Huilgol,
Fei Wu,
Anirudh Subramanian,
Cade Daniel,
Derya Cavdar,
Teng Xu,
Haohan Chen,
Arash Rahnama,
Luis Quintela
Abstract:
With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much mor…
▽ More
With deep learning models rapidly growing in size, systems-level solutions for large-model training are required. We present Amazon SageMaker model parallelism, a software library that integrates with PyTorch, and enables easy training of large models using model parallelism and other memory-saving features. In contrast to existing solutions, the implementation of the SageMaker library is much more generic and flexible, in that it can automatically partition and run pipeline parallelism over arbitrary model architectures with minimal code change, and also offers a general and extensible framework for tensor parallelism, which supports a wider range of use cases, and is modular enough to be easily applied to new training scripts. The library also preserves the native PyTorch user experience to a much larger degree, supporting module re-use and dynamic graphs, while giving the user full control over the details of the training step. We evaluate performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering, and demonstrate competitive performance over existing solutions.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations
Authors:
Debraj Basu,
Deepesh Data,
Can Karakus,
Suhas Diggavi
Abstract:
Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper, we propose \emph{Qsparse-local-SGD} algorithm, which combines aggressive spars…
▽ More
Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper, we propose \emph{Qsparse-local-SGD} algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by kee** track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of \emph{Qsparse-local-SGD}. We analyze convergence for \emph{Qsparse-local-SGD} in the \emph{distributed} setting for smooth non-convex and convex objective functions. We demonstrate that \emph{Qsparse-local-SGD} converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use \emph{Qsparse-local-SGD} to train ResNet-50 on ImageNet and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.
△ Less
Submitted 2 November, 2019; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Densifying Assumed-sparse Tensors: Improving Memory Efficiency and MPI Collective Performance during Tensor Accumulation for Parallelized Training of Neural Machine Translation Models
Authors:
Derya Cavdar,
Valeriu Codreanu,
Can Karakus,
John A. Lockman III,
Damian Podareanu,
Vikram Saletore,
Alexander Sergeev,
Don D. Smith II,
Victor Suthichai,
Quy Ta,
Srinivas Varadharajan,
Lucas A. Wilson,
Rengan Xu,
Pei Yang
Abstract:
Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of…
▽ More
Neural machine translation - using neural networks to translate human language - is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches, such as the multi-head attention-based transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow "Transformer" model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives. This paper describes modifications made to the Horovod MPI-based distributed training framework to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scale-out capability, with CPU-only scaling tests achieving 91% weak scaling efficiency up to 1200 MPI processes (300 nodes), and up to 65% strong scaling efficiency up to 400 MPI processes (200 nodes) using the Stampede2 supercomputer.
△ Less
Submitted 10 May, 2019;
originally announced May 2019.
-
Differentially Private Consensus-Based Distributed Optimization
Authors:
Mehrdad Showkatbakhsh,
Can Karakus,
Suhas Diggavi
Abstract:
Data privacy is an important concern in learning, when datasets contain sensitive information about individuals. This paper considers consensus-based distributed optimization under data privacy constraints. Consensus-based optimization consists of a set of computational nodes arranged in a graph, each having a local objective that depends on their local data, where in every step nodes take a linea…
▽ More
Data privacy is an important concern in learning, when datasets contain sensitive information about individuals. This paper considers consensus-based distributed optimization under data privacy constraints. Consensus-based optimization consists of a set of computational nodes arranged in a graph, each having a local objective that depends on their local data, where in every step nodes take a linear combination of their neighbors' messages, as well as taking a new gradient step. Since the algorithm requires exchanging messages that depend on local data, private information gets leaked at every step. Taking $(ε, δ)$-differential privacy (DP) as our criterion, we consider the strategy where the nodes add random noise to their messages before broadcasting it, and show that the method achieves convergence with a bounded mean-squared error, while satisfying $(ε, δ)$-DP. By relaxing the more stringent $ε$-DP requirement in previous work, we strengthen a known convergence result in the literature. We conclude the paper with numerical results demonstrating the effectiveness of our methods for mean estimation.
△ Less
Submitted 18 March, 2019;
originally announced March 2019.
-
Privacy-Utility Trade-off of Linear Regression under Random Projections and Additive Noise
Authors:
Mehrdad Showkatbakhsh,
Can Karakus,
Suhas Diggavi
Abstract:
Data privacy is an important concern in machine learning, and is fundamentally at odds with the task of training useful learning models, which typically require the acquisition of large amounts of private user data. One possible way of fulfilling the machine learning task while preserving user privacy is to train the model on a transformed, noisy version of the data, which does not reveal the data…
▽ More
Data privacy is an important concern in machine learning, and is fundamentally at odds with the task of training useful learning models, which typically require the acquisition of large amounts of private user data. One possible way of fulfilling the machine learning task while preserving user privacy is to train the model on a transformed, noisy version of the data, which does not reveal the data itself directly to the training procedure. In this work, we analyze the privacy-utility trade-off of two such schemes for the problem of linear regression: additive noise, and random projections. In contrast to previous work, we consider a recently proposed notion of differential privacy that is based on conditional mutual information (MI-DP), which is stronger than the conventional $(ε, δ)$-differential privacy, and use relative objective error as the utility metric. We find that projecting the data to a lower-dimensional subspace before adding noise attains a better trade-off in general. We also make a connection between privacy problem and (non-coherent) SIMO, which has been extensively studied in wireless communication, and use tools from there for the analysis. We present numerical results demonstrating the performance of the schemes.
△ Less
Submitted 12 February, 2019;
originally announced February 2019.
-
Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Authors:
Can Karakus,
Yifan Sun,
Suhas Diggavi,
Wotao Yin
Abstract:
Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is "encoded" to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at ev…
▽ More
Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is "encoded" to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at every iteration, whose loss is compensated by the embedded redundancy. We show that oblivious application of several popular optimization algorithms on encoded data, including gradient descent, L-BFGS, proximal gradient under data parallelism, and coordinate descent under model parallelism, converge to either approximate or exact solutions of the original problem when stragglers are treated as erasures. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.
△ Less
Submitted 14 March, 2018;
originally announced March 2018.
-
Straggler Mitigation in Distributed Optimization Through Data Encoding
Authors:
Can Karakus,
Yifan Sun,
Suhas Diggavi,
Wotao Yin
Abstract:
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an al…
▽ More
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
△ Less
Submitted 22 January, 2018; v1 submitted 14 November, 2017;
originally announced November 2017.
-
Approximate Capacity of Fast Fading Interference Channels with No Instantaneous CSIT
Authors:
Joyson Sebastian,
Can Karakus,
Suhas Diggavi
Abstract:
We develop a characterization of fading models, which assigns a number called logarithmic Jensen's gap to a given fading model. We show that as a consequence of a finite logarithmic Jensen's gap, approximate capacity region can be obtained for fast fading interference channels (FF-IC) for several scenarios. We illustrate three instances where a constant capacity gap can be obtained as a function o…
▽ More
We develop a characterization of fading models, which assigns a number called logarithmic Jensen's gap to a given fading model. We show that as a consequence of a finite logarithmic Jensen's gap, approximate capacity region can be obtained for fast fading interference channels (FF-IC) for several scenarios. We illustrate three instances where a constant capacity gap can be obtained as a function of the logarithmic Jensen's gap. Firstly for an FF-IC with neither feedback nor instantaneous channel state information at transmitter (CSIT), if the fading distribution has finite logarithmic Jensen's gap, we show that a rate-splitting scheme based on average interference-to-noise ratio (inr) can achieve its approximate capacity. Secondly we show that a similar scheme can achieve the approximate capacity of FF-IC with feedback and delayed CSIT, if the fading distribution has finite logarithmic Jensen's gap. Thirdly, when this condition holds, we show that point-to-point codes can achieve approximate capacity for a class of FF-IC with feedback. We prove that the logarithmic Jensen's gap is finite for common fading models, including Rayleigh and Nakagami fading, thereby obtaining the approximate capacity region of FF-IC with these fading models. For Rayleigh fading the capacity gap is obtained as 1.83 bits per channel use for non-feedback case and 2.83 bits per channel use for feedback case. Our analysis also yields approximate capacity results for fading 2-tap ISI channel and fading interference multiple access channel as corollaries.
△ Less
Submitted 3 June, 2018; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Enhancing Multiuser MIMO Through Opportunistic D2D Cooperation
Authors:
Can Karakus,
Suhas Diggavi
Abstract:
We propose a cellular architecture that combines multiuser MIMO (MU-MIMO) downlink with opportunistic use of unlicensed ISM bands to establish device-to-device (D2D) cooperation. The architecture consists of a physical-layer cooperation scheme based on forming downlink virtual MIMO channels through D2D relaying, and a novel resource allocation strategy for such D2D-enabled networks. We prove the a…
▽ More
We propose a cellular architecture that combines multiuser MIMO (MU-MIMO) downlink with opportunistic use of unlicensed ISM bands to establish device-to-device (D2D) cooperation. The architecture consists of a physical-layer cooperation scheme based on forming downlink virtual MIMO channels through D2D relaying, and a novel resource allocation strategy for such D2D-enabled networks. We prove the approximate optimality of the physical-layer scheme, and demonstrate that such cooperation boosts the effective SNR of the weakest user in the system, especially in the many-user regime, due to multiuser diversity. To harness this physical-layer scheme, we formulate the cooperative user scheduling and relay selection problem using the network utility maximization framework. For such a cooperative network, we propose a novel utility metric that jointly captures fairness in throughput and the cost of relaying in the system. We propose a joint user scheduling and relay selection algorithm, which we prove to be asymptotically optimal. We study the architecture through system-level simulations over a wide range of scenarios. The highlight of these simulations is an approximately $6$x improvement in data rate for cell-edge (bottom fifth-percentile) users (over the state-of-the-art SU-MIMO) while still improving the overall throughput, and taking into account various system constraints.
△ Less
Submitted 6 March, 2017; v1 submitted 20 April, 2016;
originally announced April 2016.
-
Opportunistic Scheduling for Full-Duplex Uplink-Downlink Networks
Authors:
Can Karakus,
Suhas Diggavi
Abstract:
We study opportunistic scheduling and the sum capacity of cellular networks with a full-duplex multi-antenna base station and a large number of single-antenna half-duplex users. Simultaneous uplink and downlink over the same band results in uplink-to-downlink interference, degrading performance. We present a simple opportunistic joint uplink-downlink scheduling algorithm that exploits multiuser di…
▽ More
We study opportunistic scheduling and the sum capacity of cellular networks with a full-duplex multi-antenna base station and a large number of single-antenna half-duplex users. Simultaneous uplink and downlink over the same band results in uplink-to-downlink interference, degrading performance. We present a simple opportunistic joint uplink-downlink scheduling algorithm that exploits multiuser diversity and treats interference as noise. We show that in homogeneous networks, our algorithm achieves the same sum capacity as what would have been achieved if there was no uplink-to-downlink interference, asymptotically in the number of users. The algorithm does not require interference CSI at the base station or uplink users. It is also shown that for a simple class of heterogeneous networks without sufficient channel diversity, it is not possible to achieve the corresponding interference-free system capacity. We discuss the potential for using device-to-device side-channels to overcome this limitation in heterogeneous networks.
△ Less
Submitted 22 April, 2015;
originally announced April 2015.
-
Gaussian Interference Channel with Intermittent Feedback
Authors:
Can Karakus,
I-Hsiang Wang,
Suhas Diggavi
Abstract:
We investigate how to exploit intermittent feedback for interference management by studying the two-user Gaussian interference channel (IC). We approximately characterize (within a universal constant) the capacity region for the Gaussian IC with intermittent feedback. We exactly characterize the the capacity region of the linear deterministic version of the problem, which gives us insight into the…
▽ More
We investigate how to exploit intermittent feedback for interference management by studying the two-user Gaussian interference channel (IC). We approximately characterize (within a universal constant) the capacity region for the Gaussian IC with intermittent feedback. We exactly characterize the the capacity region of the linear deterministic version of the problem, which gives us insight into the Gaussian problem. We find that the characterization only depends on the forward channel parameters and the marginal probability distribution of each feedback link. The result shows that passive and unreliable feedback can be harnessed to provide multiplicative capacity gain in Gaussian interference channels. We find that when the feedback links are active with sufficiently large probabilities, the perfect feedback sum-capacity is achieved to within a constant gap. In contrast to other schemes developed for interference channel with feedback, our achievable scheme makes use of quantize-map-and-forward to relay the information obtained through feedback, performs forward decoding, and does not use structured codes. We also develop new outer bounds enabling us to obtain the (approximate) characterization of the capacity region.
△ Less
Submitted 30 August, 2015; v1 submitted 20 August, 2014;
originally announced August 2014.
-
Interference Channel with Intermittent Feedback
Authors:
Can Karakus,
I-Hsiang Wang,
Suhas Diggavi
Abstract:
We investigate how to exploit intermittent feedback for interference management. Focusing on the two-user linear deterministic interference channel, we completely characterize the capacity region. We find that the characterization only depends on the forward channel parameters and the marginal probability distribution of each feedback link. The scheme we propose makes use of block Markov encoding…
▽ More
We investigate how to exploit intermittent feedback for interference management. Focusing on the two-user linear deterministic interference channel, we completely characterize the capacity region. We find that the characterization only depends on the forward channel parameters and the marginal probability distribution of each feedback link. The scheme we propose makes use of block Markov encoding and quantize-map-and-forward at the transmitters, and backward decoding at the receivers. Matching outer bounds are derived based on novel genie-aided techniques. As a consequence, the perfect-feedback capacity can be achieved once the two feedback links are active with large enough probabilities.
△ Less
Submitted 17 May, 2013; v1 submitted 14 May, 2013;
originally announced May 2013.