-
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Authors:
Ashwinee Panda,
Berivan Isik,
Xiangyu Qi,
Sanmi Koyejo,
Tsachy Weissman,
Prateek Mittal
Abstract:
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ti…
▽ More
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.
△ Less
Submitted 25 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations
Authors:
Rylan Schaeffer,
Victor Lecomte,
Dhruv Bhandarkar Pai,
Andres Carranza,
Berivan Isik,
Alyssa Unell,
Mikail Khona,
Thomas Yerxa,
Yann LeCun,
SueYeon Chung,
Andrey Gromov,
Ravid Shwartz-Ziv,
Sanmi Koyejo
Abstract:
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to impro…
▽ More
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
On Fairness of Low-Rank Adaptation of Large Models
Authors:
Zhoujie Ding,
Ken Ziyu Liu,
Pura Peetathawatchai,
Berivan Isik,
Sanmi Koyejo
Abstract:
Low-rank adaptation of large models, particularly LoRA, has gained traction due to its computational efficiency. This efficiency, contrasted with the prohibitive costs of full-model fine-tuning, means that practitioners often turn to LoRA and sometimes without a complete understanding of its ramifications. In this study, we focus on fairness and ask whether LoRA has an unexamined impact on utility…
▽ More
Low-rank adaptation of large models, particularly LoRA, has gained traction due to its computational efficiency. This efficiency, contrasted with the prohibitive costs of full-model fine-tuning, means that practitioners often turn to LoRA and sometimes without a complete understanding of its ramifications. In this study, we focus on fairness and ask whether LoRA has an unexamined impact on utility, calibration, and resistance to membership inference across different subgroups (e.g., genders, races, religions) compared to a full-model fine-tuning baseline. We present extensive experiments across vision and language domains and across classification and generation tasks using ViT-Base, Swin-v2-Large, Llama-2 7B, and Mistral 7B. Intriguingly, experiments suggest that while one can isolate cases where LoRA exacerbates model bias across subgroups, the pattern is inconsistent -- in many cases, LoRA has equivalent or even improved fairness compared to the base model or its full fine-tuning baseline. We also examine the complications of evaluating fine-tuning fairness relating to task design and model token bias, calling for more careful fairness evaluations in future work.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Improved Communication-Privacy Trade-offs in $L_2$ Mean Estimation under Streaming Differential Privacy
Authors:
Wei-Ning Chen,
Berivan Isik,
Peter Kairouz,
Albert No,
Sewoong Oh,
Zheng Xu
Abstract:
We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square…
▽ More
We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers.
In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
Authors:
Onur G. Guleryuz,
Philip A. Chou,
Berivan Isik,
Hugues Hoppe,
Danhang Tang,
Ruofei Du,
Jonathan Taylor,
Philip Davidson,
Sean Fanello
Abstract:
We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, it can effectively adapt the codec to other types of image/video content and to…
▽ More
We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec's performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit ``neural code images'' that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec's design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 30\% bitrate reductions) compared to alternative adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Scaling Laws for Downstream Task Performance of Large Language Models
Authors:
Berivan Isik,
Natalia Ponomareva,
Hussein Hazimeh,
Dimitris Paparas,
Sergei Vassilvitskii,
Sanmi Koyejo
Abstract:
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we…
▽ More
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Adaptive Compression in Federated Learning via Side Information
Authors:
Berivan Isik,
Francesco Pase,
Deniz Gunduz,
Sanmi Koyejo,
Tsachy Weissman,
Michele Zorzi
Abstract:
The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client $n$ sends a sample from a client-only probability distribution $q_{φ^{(n)}}$, and the server estimat…
▽ More
The high communication cost of sending model updates from the clients to the server is a significant bottleneck for scalable federated learning (FL). Among existing approaches, state-of-the-art bitrate-accuracy tradeoffs have been achieved using stochastic compression methods -- in which the client $n$ sends a sample from a client-only probability distribution $q_{φ^{(n)}}$, and the server estimates the mean of the clients' distributions using these samples. However, such methods do not take full advantage of the FL setup where the server, throughout the training process, has side information in the form of a global distribution $p_θ$ that is close to the clients' distribution $q_{φ^{(n)}}$ in Kullback-Leibler (KL) divergence. In this work, we exploit this closeness between the clients' distributions $q_{φ^{(n)}}$'s and the side information $p_θ$ at the server, and propose a framework that requires approximately $D_{KL}(q_{φ^{(n)}}|| p_θ)$ bits of communication. We show that our method can be integrated into many existing stochastic compression frameworks to attain the same (and often higher) test accuracy with up to $82$ times smaller bitrate than the prior work -- corresponding to 2,650 times overall compression.
△ Less
Submitted 21 April, 2024; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation
Authors:
Berivan Isik,
Wei-Ning Chen,
Ayfer Ozgur,
Tsachy Weissman,
Albert No
Abstract:
We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optim…
▽ More
We study the mean estimation problem under communication and local differential privacy constraints. While previous work has proposed \emph{order}-optimal algorithms for the same problem (i.e., asymptotically optimal as we spend more bits), \emph{exact} optimality (in the non-asymptotic setting) still has not been achieved. In this work, we take a step towards characterizing the \emph{exact}-optimal approach in the presence of shared randomness (a random variable shared between the server and the user) and identify several conditions for \emph{exact} optimality. We prove that one of the conditions is to utilize a rotationally symmetric shared random codebook. Based on this, we propose a randomization mechanism where the codebook is a randomly rotated simplex -- satisfying the properties of the \emph{exact}-optimal codebook. The proposed mechanism is based on a $k$-closest encoding which we prove to be \emph{exact}-optimal for the randomly rotated simplex codebook.
△ Less
Submitted 28 October, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Sandwiched Video Compression: Efficiently Extending the Reach of Standard Codecs with Neural Wrappers
Authors:
Berivan Isik,
Onur G. Guleryuz,
Danhang Tang,
Jonathan Taylor,
Philip A. Chou
Abstract:
We propose sandwiched video compression -- a video compression system that wraps neural networks around a standard video codec. The sandwich framework consists of a neural pre- and post-processor with a standard video codec between them. The networks are trained jointly to optimize a rate-distortion loss function with the goal of significantly improving over the standard codec in various compressi…
▽ More
We propose sandwiched video compression -- a video compression system that wraps neural networks around a standard video codec. The sandwich framework consists of a neural pre- and post-processor with a standard video codec between them. The networks are trained jointly to optimize a rate-distortion loss function with the goal of significantly improving over the standard codec in various compression scenarios. End-to-end training in this setting requires a differentiable proxy for the standard video codec, which incorporates temporal processing with motion compensation, inter/intra mode decisions, and in-loop filtering. We propose differentiable approximations to key video codec components and demonstrate that, in addition to providing meaningful compression improvements over the standard codec, the neural codes of the sandwich lead to significantly better rate-distortion performance in two important scenarios.When transporting high-resolution video via low-resolution HEVC, the sandwich system obtains 6.5 dB improvements over standard HEVC. More importantly, using the well-known perceptual similarity metric, LPIPS, we observe 30% improvements in rate at the same quality over HEVC. Last but not least, we show that pre- and post-processors formed by very modestly-parameterized, light-weight networks can closely approximate these results.
△ Less
Submitted 5 July, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Upper bounds on the Rate of Uniformly-Random Codes for the Deletion Channel
Authors:
Berivan Isik,
Francisco Pernice,
Tsachy Weissman
Abstract:
We consider the maximum coding rate achievable by uniformly-random codes for the deletion channel. We prove an upper bound that's within 0.1 of the best known lower bounds for all values of the deletion probability $d,$ and much closer for small and large $d.$ We give simulation results which suggest that our upper bound is within 0.05 of the exact value for all $d$, and within $0.01$ for…
▽ More
We consider the maximum coding rate achievable by uniformly-random codes for the deletion channel. We prove an upper bound that's within 0.1 of the best known lower bounds for all values of the deletion probability $d,$ and much closer for small and large $d.$ We give simulation results which suggest that our upper bound is within 0.05 of the exact value for all $d$, and within $0.01$ for $d>0.75$. Despite our upper bounds, based on simulations, we conjecture that a positive rate is achievable with uniformly-random codes for all deletion probabilities less than 1. Our results imply impossibility results for the (equivalent) problem of compression of i.i.d. sources correlated via the deletion channel, a relevant model for DNA storage.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Sparse Random Networks for Communication-Efficient Federated Learning
Authors:
Berivan Isik,
Francesco Pase,
Deniz Gunduz,
Tsachy Weissman,
Michele Zorzi
Abstract:
One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \em…
▽ More
One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
△ Less
Submitted 8 February, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Lossy Compression of Noisy Data for Private and Data-Efficient Learning
Authors:
Berivan Isik,
Tsachy Weissman
Abstract:
Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show…
▽ More
Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
△ Less
Submitted 22 March, 2023; v1 submitted 6 February, 2022;
originally announced February 2022.
-
LVAC: Learned Volumetric Attribute Compression for Point Clouds using Coordinate Based Networks
Authors:
Berivan Isik,
Philip A. Chou,
Sung ** Hwang,
Nick Johnston,
George Toderici
Abstract:
We consider the attributes of a point cloud as samples of a vector-valued volumetric function at discrete positions. To compress the attributes given the positions, we compress the parameters of the volumetric function. We model the volumetric function by tiling space into blocks, and representing the function over each block by shifts of a coordinate-based, or implicit, neural network. Inputs to…
▽ More
We consider the attributes of a point cloud as samples of a vector-valued volumetric function at discrete positions. To compress the attributes given the positions, we compress the parameters of the volumetric function. We model the volumetric function by tiling space into blocks, and representing the function over each block by shifts of a coordinate-based, or implicit, neural network. Inputs to the network include both spatial coordinates and a latent vector per block. We represent the latent vectors using coefficients of the region-adaptive hierarchical transform (RAHT) used in the MPEG geometry-based point cloud codec G-PCC. The coefficients, which are highly compressible, are rate-distortion optimized by back-propagation through a rate-distortion Lagrangian loss in an auto-decoder configuration. The result outperforms RAHT by 2--4 dB. This is the first work to compress volumetric functions represented by local coordinate-based neural networks. As such, we expect it to be applicable beyond point clouds, for example to compression of high-resolution neural radiance fields.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Neural 3D Scene Compression via Model Compression
Authors:
Berivan Isik
Abstract:
Rendering 3D scenes requires access to arbitrary viewpoints from the scene. Storage of such a 3D scene can be done in two ways; (1) storing 2D images taken from the 3D scene that can reconstruct the scene back through interpolations, or (2) storing a representation of the 3D scene itself that already encodes views from all directions. So far, traditional 3D compression methods have focused on the…
▽ More
Rendering 3D scenes requires access to arbitrary viewpoints from the scene. Storage of such a 3D scene can be done in two ways; (1) storing 2D images taken from the 3D scene that can reconstruct the scene back through interpolations, or (2) storing a representation of the 3D scene itself that already encodes views from all directions. So far, traditional 3D compression methods have focused on the first type of storage and compressed the original 2D images with image compression techniques. With this approach, the user first decodes the stored 2D images and then renders the 3D scene. However, this separated procedure is inefficient since a large amount of 2D images have to be stored. In this work, we take a different approach and compress a functional representation of 3D scenes. In particular, we introduce a method to compress 3D scenes by compressing the neural networks that represent the scenes as neural radiance fields. Our method provides more efficient storage of 3D scenes since it does not store 2D images -- which are redundant when we render the scene from the neural functional representation.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
An Information-Theoretic Justification for Model Pruning
Authors:
Berivan Isik,
Tsachy Weissman,
Albert No
Abstract:
We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this…
▽ More
We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this formulation shows that \emph{pruning}, implicitly or explicitly, must be a part of a good compression algorithm. This observation bridges a gap between parts of the literature pertaining to NN and data compression, respectively, providing insight into the empirical success of model pruning. Finally, we propose a novel pruning strategy derived from our information-theoretic formulation and show that it outperforms the relevant baselines on CIFAR-10 and ImageNet datasets.
△ Less
Submitted 9 February, 2022; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Neural Network Compression for Noisy Storage Devices
Authors:
Berivan Isik,
Kristy Choi,
Xin Zheng,
Tsachy Weissman,
Stefano Ermon,
H. -S. Philip Wong,
Armin Alaghi
Abstract:
Compression and efficient storage of neural network (NN) parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actual \textit{physical} storage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media wit…
▽ More
Compression and efficient storage of neural network (NN) parameters is critical for applications that run on resource-constrained devices. Despite the significant progress in NN model compression, there has been considerably less investigation in the actual \textit{physical} storage of NN parameters. Conventionally, model compression and physical storage are decoupled, as digital storage media with error-correcting codes (ECCs) provide robust error-free storage. However, this decoupled approach is inefficient as it ignores the overparameterization present in most NNs and forces the memory device to allocate the same amount of resources to every bit of information regardless of its importance. In this work, we investigate analog memory devices as an alternative to digital media -- one that naturally provides a way to add more protection for significant bits unlike its counterpart, but is noisy and may compromise the stored model's performance if used naively. We develop a variety of robust coding strategies for NN weight storage on analog devices, and propose an approach to jointly optimize model compression and memory resource allocation. We then demonstrate the efficacy of our approach on models trained on MNIST, CIFAR-10 and ImageNet datasets for existing compression techniques. Compared to conventional error-free digital storage, our method reduces the memory footprint by up to one order of magnitude, without significantly compromising the stored model's accuracy.
△ Less
Submitted 13 March, 2023; v1 submitted 15 February, 2021;
originally announced February 2021.
-
rTop-k: A Statistical Estimation Approach to Distributed SGD
Authors:
Leighton Pate Barnes,
Huseyin A. Inan,
Berivan Isik,
Ayfer Ozgur
Abstract:
The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k a…
▽ More
The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.
△ Less
Submitted 2 December, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
Visual GUI testing in practice: An extended industrial case study
Authors:
Vahid Garousi,
Wasif Afzal,
Adem Çağlar,
İhsan Berk Işık,
Berker Baydan,
Seçkin Çaylak,
Ahmet Zeki Boyraz,
Burak Yolaçan,
Kadir Herkiloğlu
Abstract:
Context: Visual GUI testing (VGT) is referred to as the latest generation GUI-based testing. It is a tool-driven technique, which uses image recognition for interacting with and asserting the behavior of the system under test. Motivated by the industrial need of a large Turkish software and systems company providing solutions in the areas of defense and IT sector, an action-research project was re…
▽ More
Context: Visual GUI testing (VGT) is referred to as the latest generation GUI-based testing. It is a tool-driven technique, which uses image recognition for interacting with and asserting the behavior of the system under test. Motivated by the industrial need of a large Turkish software and systems company providing solutions in the areas of defense and IT sector, an action-research project was recently initiated to implement VGT in several teams and projects in the company.
Objective: To address the above needs, we planned and carried out an empirical investigation with the goal of assessing VGT using two tools (Sikuli and JAutomate). The purpose was to determine a suitable approach and tool for VGT of a given project (software product) in the company, increase the know-how in the company's test teams.
Method: Using an action-research case-study design, we investigated the use of VGT in the studied organization. Specifically, using the two selected VGT tools, we conducted a quantitative and a qualitative evaluation of VGT.
Results: By assessing the list of Challenges, Problems and Limitations (CPL), proposed in previous work, in the context of our empirical study, we found that test-tool- and SUT-related CPLs were quite comparable to a previous empirical study, e.g., the synchronization between SUT and test tools were not always robust and there were failures in test tools' image recognition features. When assessing the types of test maintenance activities, when executing the automated test cases on next versions of the SUTs, for the case of the two test tools, we found that about half of the test cases (59.1% and 47.8%) failed in the next version.
Conclusion: By our results, we confirm some of the previously-reported issues when conducting VGT. Further, we highlight some additional challenges in test maintenance when using VGT.
△ Less
Submitted 20 May, 2020; v1 submitted 19 May, 2020;
originally announced May 2020.