Skip to main content

Showing 1–50 of 64 results for author: Papailiopoulos, D

.
  1. arXiv:2406.19292  [pdf, other

    cs.LG cs.AI cs.CL

    From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

    Authors: Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B dem… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2403.08058  [pdf, other

    cs.LG cs.CL

    CHAI: Clustered Head Attention for Efficient LLM Inference

    Authors: Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Ye** Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

    Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and comput… ▽ More

    Submitted 27 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  3. arXiv:2403.03183  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Well Can Transformers Emulate In-context Newton's Method?

    Authors: Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

    Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher orde… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  4. arXiv:2402.04248  [pdf, other

    cs.LG

    Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

    Authors: Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of mo… ▽ More

    Submitted 25 April, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: Changes in v2: experiments on formal language ICL and explorations of width vs. depth on ICL; code repo available (24 pages, 10 figures)

  5. arXiv:2311.12424  [pdf, other

    cs.LG cs.NE

    Looped Transformers are Better at Learning Learning Algorithms

    Authors: Liu Yang, Kangwook Lee, Robert Nowak, Dimitris Papailiopoulos

    Abstract: Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utili… ▽ More

    Submitted 16 March, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted for publication at ICLR 2024

  6. arXiv:2307.05908  [pdf, other

    cs.CL cs.LG

    Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

    Authors: Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This innovative method reduces de… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

    Comments: ES-FoMo Workshop at ICML 2023

  7. arXiv:2307.05906  [pdf, other

    cs.LG

    Mini-Batch Optimization of Contrastive Loss

    Authors: Jaewoong Cho, Kartik Sreenivasan, Keon Lee, Kyunghoo Mun, Soheun Yi, Jeong-Gwan Lee, Anna Lee, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging t… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

  8. arXiv:2307.03381  [pdf, other

    cs.LG

    Teaching Arithmetic to Small Transformers

    Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as add… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  9. arXiv:2305.18869  [pdf, other

    cs.LG cs.AI cs.CL

    Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning

    Authors: Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, Samet Oymak

    Abstract: Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositi… ▽ More

    Submitted 7 November, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted for NeurIPS 2023. Changes in this version: refined title, restructured content, included new out-of-distribution experiments, and code now available

  10. Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

    Authors: Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning. Our method utilizes pre-trained large language models (LLMs) as individual modules for long-term consistency and flexibility, by using techniques such as few-shot prompting, chain-of-thought (CoT), and external memory. Our human evaluation resul… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted to the Findings of ACL2023. The camera-ready version with additional experimental results will be uploaded

  11. arXiv:2305.02538  [pdf, other

    cs.LG

    Cuttlefish: Low-Rank Model Training without All the Tuning

    Authors: Hongyi Wang, Saurabh Agarwal, Pongsakorn U-chupala, Yoshiki Tanaka, Eric P. Xing, Dimitris Papailiopoulos

    Abstract: Recent research has shown that training low-rank neural networks can effectively reduce the total number of trainable parameters without sacrificing predictive accuracy, resulting in end-to-end speedups. However, low-rank model training necessitates adjusting several additional factorization hyperparameters, such as the rank of the factorization at each layer. In this paper, we tackle this challen… ▽ More

    Submitted 5 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: Accepted for presentation at MLSys 2023

  12. arXiv:2302.07937  [pdf, other

    cs.LG cs.AI stat.ML

    The Expressive Power of Tuning Only the Normalization Layers

    Authors: Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos

    Abstract: Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the norm… ▽ More

    Submitted 4 July, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

  13. arXiv:2301.13196  [pdf, other

    cs.LG cs.AI

    Looped Transformers as Programmable Computers

    Authors: Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

    Abstract: We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, fu… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  14. arXiv:2301.07067  [pdf, other

    cs.LG cs.CL stat.ML

    Transformers as Algorithms: Generalization and Stability in In-context Learning

    Authors: Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

    Abstract: In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through t… ▽ More

    Submitted 6 February, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

    Comments: Revised version significantly improves the stability guarantees and provides new experiments

  15. arXiv:2210.03069  [pdf, other

    cs.LG

    PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

    Authors: Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, Robert D. Nowak

    Abstract: Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks w… ▽ More

    Submitted 5 July, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

  16. arXiv:2206.06565  [pdf, other

    cs.LG cs.CL

    LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks

    Authors: Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Fine-tuning pretrained language models (LMs) without making any architectural changes has become a norm for learning various language downstream tasks. However, for non-language downstream tasks, a common practice is to employ task-specific designs for input, output layers, and loss functions. For instance, it is possible to fine-tune an LM into an MNIST classifier by replacing the word embedding… ▽ More

    Submitted 30 October, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: Accepted at NeurIPS 2022

  17. arXiv:2205.11616  [pdf, other

    cs.CL cs.LG

    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

    Authors: Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations b… ▽ More

    Submitted 7 November, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)

  18. arXiv:2202.12002  [pdf, other

    cs.LG cs.AI cs.CV

    Rare Gems: Finding Lottery Tickets at Initialization

    Authors: Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming "train, prune, re-train" approach. Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work by Frankle e… ▽ More

    Submitted 2 June, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

  19. arXiv:2201.02354  [pdf, other

    cs.LG

    GenLabel: Mixup Relabeling using Generative Models

    Authors: Jy-yong Sohn, Liang Shang, Hongxu Chen, Jaekyun Moon, Dimitris Papailiopoulos, Kangwook Lee

    Abstract: Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose GenLabel, a simple yet effective relabeling… ▽ More

    Submitted 7 January, 2022; originally announced January 2022.

  20. arXiv:2110.08996  [pdf, other

    cs.LG cs.AI

    Finding Everything within Random Binary Networks

    Authors: Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos

    Abstract: A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used cont… ▽ More

    Submitted 22 October, 2021; v1 submitted 17 October, 2021; originally announced October 2021.

  21. arXiv:2106.07724  [pdf, other

    cs.LG cs.IT stat.ML

    An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

    Authors: Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi

    Abstract: It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/δ^2}+\sqrt{n})$ neurons and… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

  22. arXiv:2103.03936  [pdf, other

    cs.LG

    Pufferfish: Communication-efficient Models At No Extra Cost

    Authors: Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos

    Abstract: To mitigate communication overheads in distributed model training, several studies propose the use of compressed stochastic gradients, usually achieved by sparsification or quantization. Such techniques achieve high compression ratios, but in many cases incur either significant computational overheads or some accuracy loss. In this work, we present Pufferfish, a communication and computation effic… ▽ More

    Submitted 5 March, 2021; originally announced March 2021.

    Comments: Accepted by MLSys 2021

  23. arXiv:2103.00543  [pdf, other

    cs.DC cs.LG

    On the Utility of Gradient Compression in Distributed Training Systems

    Authors: Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos

    Abstract: A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD acr… ▽ More

    Submitted 29 June, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

  24. arXiv:2102.09718  [pdf, other

    cs.LG math.OC stat.ML

    Permutation-Based SGD: Is Random Optimal?

    Authors: Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We f… ▽ More

    Submitted 24 November, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

  25. arXiv:2010.16248  [pdf, other

    cs.LG

    Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

    Authors: Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos

    Abstract: Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model acc… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

  26. arXiv:2007.05084  [pdf, other

    cs.LG cs.CR cs.DC stat.ML

    Attack of the Tails: Yes, You Really Can Backdoor Federated Learning

    Authors: Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training. The goal of a backdoor is to corrupt the performance of the trained model on specific sub-tasks (e.g., by classifying green cars as frogs). A range of FL backdoor attacks have been introduced in the literature, but also methods to defend against them, and it is cur… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  27. arXiv:2006.07990  [pdf, other

    cs.LG cs.IT stat.ML

    Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

    Authors: Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

    Abstract: The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random… ▽ More

    Submitted 11 March, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

  28. arXiv:2002.10400  [pdf, other

    cs.LG math.OC stat.ML

    Closing the convergence gap of SGD without replacement

    Authors: Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos

    Abstract: Stochastic gradient descent without replacement sampling is widely used in practice for model training. However, the vast majority of SGD analyses assumes data is sampled with replacement, and when the function minimized is strongly convex, an $\mathcal{O}\left(\frac{1}{T}\right)$ rate can be established when SGD is run for $T$ iterations. A recent line of breakthrough works on SGD without replace… ▽ More

    Submitted 9 July, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

    Comments: Simplified some proofs and fixed typos

  29. arXiv:2002.06440  [pdf, other

    cs.LG stat.ML

    Federated Learning with Matched Averaging

    Authors: Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, Yasaman Khazaeni

    Abstract: Federated learning allows edge devices to collaboratively learn a shared model while kee** the training data on device, decoupling the ability to do model training from the need to store the data in the cloud. We propose Federated matched averaging (FedMA) algorithm designed for federated learning of modern neural network architectures e.g. convolutional neural networks (CNNs) and LSTMs. FedMA c… ▽ More

    Submitted 15 February, 2020; originally announced February 2020.

    Comments: Accepted by ICLR 2020

  30. arXiv:1907.12205  [pdf, other

    cs.LG cs.DC stat.ML

    DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

    Authors: Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but… ▽ More

    Submitted 7 March, 2020; v1 submitted 29 July, 2019; originally announced July 2019.

  31. arXiv:1906.02613  [pdf, other

    cs.LG stat.ML

    Bad Global Minima Exist and SGD Can Reach Them

    Authors: Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas

    Abstract: Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image class… ▽ More

    Submitted 22 February, 2021; v1 submitted 6 June, 2019; originally announced June 2019.

  32. arXiv:1905.09209  [pdf, other

    cs.LG math.OC stat.ML

    Convergence and Margin of Adversarial Training on Separable Data

    Authors: Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos

    Abstract: Adversarial training is a technique for training robust machine learning models. To encourage robustness, it iteratively computes adversarial examples for the model, and then re-trains on these examples via some update rule. This work analyzes the performance of adversarial training on linearly separable data, and provides bounds on the number of iterations required for large margin. We show that… ▽ More

    Submitted 22 May, 2019; originally announced May 2019.

  33. arXiv:1905.03177  [pdf, other

    cs.LG stat.ML

    Does Data Augmentation Lead to Positive Margin?

    Authors: Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos

    Abstract: Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness. DA artificially expands the training set by applying random noise, rotations, crops, or even adversarial perturbations to the input data. Although DA is widely used, its capacity to provably improve robustness is not fully understood. In this work, we analyze the robustness… ▽ More

    Submitted 8 May, 2019; originally announced May 2019.

    Comments: ICML 2019

  34. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  35. arXiv:1901.09671  [pdf, other

    cs.LG cs.DC cs.IT math.OC stat.ML

    ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

    Authors: Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding. Gradient coded distributed GD uses redundancy to exactly recover the gradient at each iteration from a subset of compute nodes. ErasureHead instead uses approximate gradient codes to recover an inexact gradient at each iteration, but with higher delay… ▽ More

    Submitted 28 January, 2019; originally announced January 2019.

  36. arXiv:1811.03531  [pdf, other

    cs.LG stat.ML

    A Geometric Perspective on the Transferability of Adversarial Directions

    Authors: Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos

    Abstract: State-of-the-art machine learning models frequently misclassify inputs that have been perturbed in an adversarial manner. Adversarial perturbations generated for a given input and a specific classifier often seem to be effective on other inputs and even different classifiers. In other words, adversarial perturbations seem to transfer between different inputs, models, and even different neural netw… ▽ More

    Submitted 8 November, 2018; originally announced November 2018.

  37. arXiv:1806.04090  [pdf, other

    stat.ML cs.DC cs.LG

    ATOMO: Communication-efficient Learning via Atomic Sparsification

    Authors: Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, Dimitris Papailiopoulos

    Abstract: Distributed model training suffers from communication overheads due to frequent gradient updates transmitted between compute nodes. To mitigate these overheads, several studies propose the use of sparsified stochastic gradients. We argue that these are facets of a general sparsification method that can operate on any possible atomic decomposition. Notable examples include element-wise, singular va… ▽ More

    Submitted 8 November, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

  38. arXiv:1806.03791  [pdf, other

    stat.ML cs.DC cs.LG math.OC stat.CO

    The Effect of Network Width on the Performance of Large-batch Training

    Authors: Lingjiao Chen, Hongyi Wang, **man Zhao, Dimitris Papailiopoulos, Paraschos Koutris

    Abstract: Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards a… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

  39. arXiv:1805.10378  [pdf, other

    stat.ML cs.DC cs.IT cs.LG stat.CO

    Gradient Coding via the Stochastic Block Model

    Authors: Zachary Charles, Dimitris Papailiopoulos

    Abstract: Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning. Due to the size and scale of modern data, gradient computations are often distributed across multiple compute nodes. Unfortunately, such distributed implementations can face significant delays caused by straggler nodes, i.e., nodes that a… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

  40. arXiv:1803.09877  [pdf, other

    stat.ML cs.DC cs.IT cs.LG cs.NE

    DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

    Authors: Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

    Abstract: Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur… ▽ More

    Submitted 21 June, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: Accepted by ICML 2018

  41. arXiv:1711.06771  [pdf, other

    stat.ML cs.DC cs.IT cs.LG stat.CO

    Approximate Gradient Coding via Sparse Random Graphs

    Authors: Zachary Charles, Dimitris Papailiopoulos, Jordan Ellenberg

    Abstract: Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time. Coding-theoretic techniques have been recently proposed to mitigate stragglers via algorithmic redundancy. Prior work in coded computation and gradient coding has mainly focused on exact recovery of the desired output. However, slightly inexact solutions c… ▽ More

    Submitted 17 November, 2017; originally announced November 2017.

  42. arXiv:1710.08402  [pdf, other

    stat.ML cs.IT cs.LG math.OC

    Stability and Generalization of Learning Algorithms that Converge to Global Optima

    Authors: Zachary Charles, Dimitris Papailiopoulos

    Abstract: We establish novel generalization bounds for learning algorithms that converge to global minima. We do so by deriving black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the loss function. The results are shown for nonconvex loss functions satisfying the Polyak-Łojasiewicz (PL) and the quadratic growth (QG) conditions. W… ▽ More

    Submitted 23 October, 2017; originally announced October 2017.

    Comments: 27 pages, 5 figures

  43. arXiv:1706.05699  [pdf, other

    cs.LG cs.DC

    Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

    Authors: Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett

    Abstract: It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notio… ▽ More

    Submitted 6 January, 2018; v1 submitted 18 June, 2017; originally announced June 2017.

  44. arXiv:1605.09721  [pdf, other

    stat.ML cs.DC cs.DS cs.LG math.OC

    CYCLADES: Conflict-free Asynchronous Machine Learning

    Authors: Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I. Jordan, Kannan Ramchandran, Chris Re, Benjamin Recht

    Abstract: We present CYCLADES, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting. CYCLADES is asynchronous during shared model updates, and requires no memory locking mechanisms, similar to HOGWILD!-type algorithms. Unlike HOGWILD!, CYCLADES introduces no conflicts during the parallel execution, and offers a black-box analysis for provable speedups across a… ▽ More

    Submitted 31 May, 2016; originally announced May 2016.

  45. arXiv:1603.02782  [pdf, other

    cs.DS stat.ML

    Bipartite Correlation Clustering -- Maximizing Agreements

    Authors: Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis

    Abstract: In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the… ▽ More

    Submitted 9 March, 2016; originally announced March 2016.

    Comments: To appear in AISTATS 2016

  46. arXiv:1512.02673  [pdf, other

    cs.DC cs.IT cs.LG cs.PF

    Speeding Up Distributed Machine Learning Using Codes

    Authors: Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran

    Abstract: Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system failures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this… ▽ More

    Submitted 28 January, 2018; v1 submitted 8 December, 2015; originally announced December 2015.

    Comments: This work is published in IEEE Transactions on Information Theory and presented in part at the NIPS 2015 Workshop on Machine Learning Systems and the IEEE ISIT 2016

  47. arXiv:1508.00625  [pdf, ps, other

    stat.ML cs.DS cs.LG math.OC

    Sparse PCA via Bipartite Matchings

    Authors: Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

    Abstract: We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal… ▽ More

    Submitted 3 August, 2015; originally announced August 2015.

  48. arXiv:1507.06970  [pdf, ps, other

    stat.ML cs.DC cs.DS cs.LG math.OC

    Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

    Authors: Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan

    Abstract: We introduce and analyze stochastic optimization methods where the input to each gradient update is perturbed by bounded noise. We show that this framework forms the basis of a unified approach to analyze asynchronous implementations of stochastic optimization algorithms.In this framework, asynchronous stochastic optimization algorithms can be thought of as serial methods operating on noisy inputs… ▽ More

    Submitted 25 March, 2016; v1 submitted 24 July, 2015; originally announced July 2015.

    Comments: 30 pages

    MSC Class: 65K10; 65Y05; 68W10; 68W20

  49. arXiv:1507.05950  [pdf, ps, other

    stat.ML cs.CC cs.DS cs.LG

    On the Worst-Case Approximability of Sparse PCA

    Authors: Siu On Chan, Dimitris Papailiopoulos, Aviad Rubinstein

    Abstract: It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances. What is the complexity of solving Sparse PCA approximately? Our contributions include: 1) a simple and efficient algorithm that achieves an $n^{-1/3}$-approximation; 2) NP-hardness of approximation to within $(1-\varepsilon)$, for some small constant $\varepsilon > 0$; 3) SSE… ▽ More

    Submitted 21 July, 2015; originally announced July 2015.

    Comments: 20 pages

  50. arXiv:1507.05086  [pdf, other

    cs.DC cs.DS stat.ML

    Parallel Correlation Clustering on Big Graphs

    Authors: Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael I. Jordan

    Abstract: Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large gra… ▽ More

    Submitted 20 July, 2015; v1 submitted 17 July, 2015; originally announced July 2015.