-
QTIP: Quantization with Trellises and Incoherence Processing
Authors:
Albert Tseng,
Qingyao Sun,
David Hou,
Christopher De Sa
Abstract:
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches have converged on using vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better sha**.…
▽ More
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches have converged on using vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better sha**. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Feedback Efficient Online Fine-Tuning of Diffusion Models
Authors:
Masatoshi Uehara,
Yulai Zhao,
Kevin Black,
Ehsan Hajiramezanali,
Gabriele Scalia,
Nathaniel Lee Diamant,
Alex M Tseng,
Sergey Levine,
Tommaso Biancalani
Abstract:
Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) prob…
▽ More
Diffusion models excel at modeling complex data distributions, including those of images, proteins, and small molecules. However, in many cases, our goal is to model parts of the distribution that maximize certain properties: for example, we may want to generate images with high aesthetic quality, or molecules with high bioactivity. It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property. Even with access to online queries of the ground-truth reward function, efficiently discovering high-reward samples can be challenging: they might have a low probability in the initial distribution, and there might be many infeasible samples that do not even have a well-defined reward (e.g., unnatural images or physically impossible molecules). In this work, we propose a novel reinforcement learning procedure that efficiently explores on the manifold of feasible samples. We present a theoretical analysis providing a regret guarantee, as well as empirical validation across three domains: images, biological sequences, and molecules.
△ Less
Submitted 27 February, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control
Authors:
Masatoshi Uehara,
Yulai Zhao,
Kevin Black,
Ehsan Hajiramezanali,
Gabriele Scalia,
Nathaniel Lee Diamant,
Alex M Tseng,
Tommaso Biancalani,
Sergey Levine
Abstract:
Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal…
▽ More
Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth "genuine" reward, as is the case in many practical applications. These challenges, collectively termed "reward collapse," pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.
△ Less
Submitted 28 February, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
Authors:
Albert Tseng,
Jerry Chee,
Qingyao Sun,
Volodymyr Kuleshov,
Christopher De Sa
Abstract:
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Had…
▽ More
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.
△ Less
Submitted 4 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Complex Preferences for Different Convergent Priors in Discrete Graph Diffusion
Authors:
Alex M. Tseng,
Nathaniel Diamant,
Tommaso Biancalani,
Gabriele Scalia
Abstract:
Diffusion models have achieved state-of-the-art performance in generating many different kinds of data, including images, text, and videos. Despite their success, there has been limited research on how the underlying diffusion process and the final convergent prior can affect generative performance; this research has also been limited to continuous data types and a score-based diffusion framework.…
▽ More
Diffusion models have achieved state-of-the-art performance in generating many different kinds of data, including images, text, and videos. Despite their success, there has been limited research on how the underlying diffusion process and the final convergent prior can affect generative performance; this research has also been limited to continuous data types and a score-based diffusion framework. To fill this gap, we explore how different discrete diffusion kernels (which converge to different prior distributions) affect the performance of diffusion models for graphs. To this end, we developed a novel formulation of a family of discrete diffusion kernels which are easily adjustable to converge to different Bernoulli priors, and we study the effect of these different kernels on generative performance. We show that the quality of generated graphs is sensitive to the prior used, and that the optimal choice cannot be explained by obvious statistics or metrics, which challenges the intuitions which previous works have suggested.
△ Less
Submitted 21 June, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Coneheads: Hierarchy Aware Attention
Authors:
Albert Tseng,
Tao Yu,
Toni J. B. Liu,
Christopher De Sa
Abstract:
Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. T…
▽ More
Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a hierarchy aware similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.
△ Less
Submitted 3 December, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
RINGER: Rapid Conformer Generation for Macrocycles with Sequence-Conditioned Internal Coordinate Diffusion
Authors:
Colin A. Grambow,
Hayley Weir,
Nathaniel L. Diamant,
Alex M. Tseng,
Tommaso Biancalani,
Gabriele Scalia,
Kangway V. Chuang
Abstract:
Macrocyclic peptides are an emerging therapeutic modality, yet computational approaches for accurately sampling their diverse 3D ensembles remain challenging due to their conformational diversity and geometric constraints. Here, we introduce RINGER, a diffusion-based transformer model for sequence-conditioned generation of macrocycle structures based on internal coordinates. RINGER provides fast b…
▽ More
Macrocyclic peptides are an emerging therapeutic modality, yet computational approaches for accurately sampling their diverse 3D ensembles remain challenging due to their conformational diversity and geometric constraints. Here, we introduce RINGER, a diffusion-based transformer model for sequence-conditioned generation of macrocycle structures based on internal coordinates. RINGER provides fast backbone sampling while respecting key structural invariances of cyclic peptides. Through extensive benchmarking and analysis against gold-standard conformer ensembles of cyclic peptides generated with metadynamics, we demonstrate how RINGER generates both high-quality and diverse geometries at a fraction of the computational cost. Our work lays the foundation for improved sampling of cyclic geometries and the development of geometric learning methods for peptides.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Shadow Cones: A Generalized Framework for Partial Order Embeddings
Authors:
Tao Yu,
Toni J. B. Liu,
Albert Tseng,
Christopher De Sa
Abstract:
Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the PoincarĂ© ball to model hierarchies. Here, we introduce the ``shadow cones" framework, a physics-inspired entailment cone construction. Specifically, we m…
▽ More
Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the Poincaré ball to model hierarchies. Here, we introduce the ``shadow cones" framework, a physics-inspired entailment cone construction. Specifically, we model partial orders as subset relations between shadows formed by a light source and opaque objects in hyperbolic space. The shadow cones framework generalizes entailment cones to a broad class of formulations and hyperbolic space models beyond the Poincaré ball. This results in clear advantages over existing constructions: for example, shadow cones possess better optimization properties over constructions limited to the Poincaré ball. Our experiments on datasets of various sizes and hierarchical structures show that shadow cones consistently and significantly outperform existing entailment cone constructions. These results indicate that shadow cones are an effective way to model partial orders in hyperbolic space, offering physically intuitive and novel insights about the nature of such structures.
△ Less
Submitted 8 April, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
GraphGUIDE: interpretable and controllable conditional graph generation with discrete Bernoulli diffusion
Authors:
Alex M. Tseng,
Nathaniel Diamant,
Tommaso Biancalani,
Gabriele Scalia
Abstract:
Diffusion models achieve state-of-the-art performance in generating realistic objects and have been successfully applied to images, text, and videos. Recent work has shown that diffusion can also be defined on graphs, including graph representations of drug-like molecules. Unfortunately, it remains difficult to perform conditional generation on graphs in a way which is interpretable and controllab…
▽ More
Diffusion models achieve state-of-the-art performance in generating realistic objects and have been successfully applied to images, text, and videos. Recent work has shown that diffusion can also be defined on graphs, including graph representations of drug-like molecules. Unfortunately, it remains difficult to perform conditional generation on graphs in a way which is interpretable and controllable. In this work, we propose GraphGUIDE, a novel framework for graph generation using diffusion models, where edges in the graph are flipped or set at each discrete time step. We demonstrate GraphGUIDE on several graph datasets, and show that it enables full control over the conditional generation of arbitrary structural properties without relying on predefined labels. Our framework for graph diffusion can have a large impact on the interpretable conditional generation of graphs, including the generation of drug-like molecules with desired properties in a way which is informed by experimental evidence.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Improving Graph Generation by Restricting Graph Bandwidth
Authors:
Nathaniel Diamant,
Alex M. Tseng,
Kangway V. Chuang,
Tommaso Biancalani,
Gabriele Scalia
Abstract:
Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that sign…
▽ More
Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available.
△ Less
Submitted 30 May, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Hierarchically branched diffusion models leverage dataset structure for class-conditional generation
Authors:
Alex M. Tseng,
Max Shen,
Tommaso Biancalani,
Gabriele Scalia
Abstract:
Class-labeled datasets, particularly those common in scientific domains, are rife with internal structure, yet current class-conditional diffusion models ignore these relationships and implicitly diffuse on all classes in a flat fashion. To leverage this structure, we propose hierarchically branched diffusion models as a novel framework for class-conditional generation. Branched diffusion models r…
▽ More
Class-labeled datasets, particularly those common in scientific domains, are rife with internal structure, yet current class-conditional diffusion models ignore these relationships and implicitly diffuse on all classes in a flat fashion. To leverage this structure, we propose hierarchically branched diffusion models as a novel framework for class-conditional generation. Branched diffusion models rely on the same diffusion process as traditional models, but learn reverse diffusion separately for each branch of a hierarchy. We highlight several advantages of branched diffusion models over the current state-of-the-art methods for class-conditional diffusion, including extension to novel classes in a continual-learning setting, a more sophisticated form of analogy-based conditional generation (i.e. transmutation), and a novel interpretability into the generation process. We extensively evaluate branched diffusion models on several benchmark and large real-world scientific datasets spanning many data modalities.
△ Less
Submitted 1 February, 2024; v1 submitted 21 December, 2022;
originally announced December 2022.
-
Conditional Diffusion with Less Explicit Guidance via Model Predictive Control
Authors:
Max W. Shen,
Ehsan Hajiramezanali,
Gabriele Scalia,
Alex Tseng,
Nathaniel Diamant,
Tommaso Biancalani,
Andreas Loukas
Abstract:
How much explicit guidance is necessary for conditional diffusion? We consider the problem of conditional sampling using an unconditional diffusion model and limited explicit guidance (e.g., a noised classifier, or a conditional diffusion model) that is restricted to a small number of time steps. We explore a model predictive control (MPC)-like approach to approximate guidance by simulating uncond…
▽ More
How much explicit guidance is necessary for conditional diffusion? We consider the problem of conditional sampling using an unconditional diffusion model and limited explicit guidance (e.g., a noised classifier, or a conditional diffusion model) that is restricted to a small number of time steps. We explore a model predictive control (MPC)-like approach to approximate guidance by simulating unconditional diffusion forward, and backpropagating explicit guidance feedback. MPC-approximated guides have high cosine similarity to real guides, even over large simulation distances. Adding MPC steps improves generative quality when explicit guidance is limited to five time steps.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis
Authors:
Albert Tseng,
Jennifer J. Sun,
Yisong Yue
Abstract:
Obtaining annotations for large training sets is expensive, especially in settings where domain knowledge is required, such as behavior analysis. Weak supervision has been studied to reduce annotation costs by using weak labels from task-specific labeling functions (LFs) to augment ground truth labels. However, domain experts still need to hand-craft different LFs for different tasks, limiting sca…
▽ More
Obtaining annotations for large training sets is expensive, especially in settings where domain knowledge is required, such as behavior analysis. Weak supervision has been studied to reduce annotation costs by using weak labels from task-specific labeling functions (LFs) to augment ground truth labels. However, domain experts still need to hand-craft different LFs for different tasks, limiting scalability. To reduce expert effort, we present AutoSWAP: a framework for automatically synthesizing data-efficient task-level LFs. The key to our approach is to efficiently represent expert knowledge in a reusable domain-specific language and more general domain-level LFs, with which we use state-of-the-art program synthesis techniques and a small labeled dataset to generate task-level LFs. Additionally, we propose a novel structural diversity cost that allows for efficient synthesis of diverse sets of LFs, further improving AutoSWAP's performance. We evaluate AutoSWAP in three behavior analysis domains and demonstrate that AutoSWAP outperforms existing approaches using only a fraction of the data. Our results suggest that AutoSWAP is an effective way to automatically generate LFs that can significantly reduce expert effort for behavior analysis.
△ Less
Submitted 11 May, 2022; v1 submitted 30 November, 2021;
originally announced November 2021.
-
An Adversarial Approach for Explaining the Predictions of Deep Neural Networks
Authors:
Arash Rahnama,
Andrew Tseng
Abstract:
Machine learning models have been successfully applied to a wide range of applications including computer vision, natural language processing, and speech recognition. A successful implementation of these models however, usually relies on deep neural networks (DNNs) which are treated as opaque black-box systems due to their incomprehensible complexity and intricate internal mechanism. In this work,…
▽ More
Machine learning models have been successfully applied to a wide range of applications including computer vision, natural language processing, and speech recognition. A successful implementation of these models however, usually relies on deep neural networks (DNNs) which are treated as opaque black-box systems due to their incomprehensible complexity and intricate internal mechanism. In this work, we present a novel algorithm for explaining the predictions of a DNN using adversarial machine learning. Our approach identifies the relative importance of input features in relation to the predictions based on the behavior of an adversarial attack on the DNN. Our algorithm has the advantage of being fast, consistent, and easy to implement and interpret. We present our detailed analysis that demonstrates how the behavior of an adversarial attack, given a DNN and a task, stays consistent for any input test data point proving the generality of our approach. Our analysis enables us to produce consistent and efficient explanations. We illustrate the effectiveness of our approach by conducting experiments using a variety of DNNs, tasks, and datasets. Finally, we compare our work with other well-known techniques in the current literature.
△ Less
Submitted 28 September, 2020; v1 submitted 20 May, 2020;
originally announced May 2020.
-
Learning Calibratable Policies using Programmatic Style-Consistency
Authors:
Eric Zhan,
Albert Tseng,
Yisong Yue,
Adith Swaminathan,
Matthew Hausknecht
Abstract:
We study the problem of controllable generation of long-term sequential behaviors, where the goal is to calibrate to multiple behavior styles simultaneously. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are two questions that pose significant challenges when generating long-term behaviors: how should we specify the factors of variation to cont…
▽ More
We study the problem of controllable generation of long-term sequential behaviors, where the goal is to calibrate to multiple behavior styles simultaneously. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are two questions that pose significant challenges when generating long-term behaviors: how should we specify the factors of variation to control, and how can we ensure that the generated behavior faithfully demonstrates combinatorially many styles? We leverage programmatic labeling functions to specify controllable styles, and derive a formal notion of style-consistency as a learning objective, which can then be solved using conventional policy learning approaches. We evaluate our framework using demonstrations from professional basketball players and agents in the MuJoCo physics environment, and show that existing approaches that do not explicitly enforce style-consistency fail to generate diverse behaviors whereas our learned policies can be calibrated for up to 1024 distinct style combinations.
△ Less
Submitted 16 July, 2020; v1 submitted 2 October, 2019;
originally announced October 2019.
-
Deep Learning for Malicious Flow Detection
Authors:
Yun-Chun Chen,
Yu-Jhe Li,
Aragorn Tseng,
Tsungnan Lin
Abstract:
Cyber security has grown up to be a hot issue in recent years. How to identify potential malware becomes a challenging task. To tackle this challenge, we adopt deep learning approaches and perform flow detection on real data. However, real data often encounters an issue of imbalanced data distribution which will lead to a gradient dilution issue. When training a neural network, this problem will n…
▽ More
Cyber security has grown up to be a hot issue in recent years. How to identify potential malware becomes a challenging task. To tackle this challenge, we adopt deep learning approaches and perform flow detection on real data. However, real data often encounters an issue of imbalanced data distribution which will lead to a gradient dilution issue. When training a neural network, this problem will not only result in a bias toward the majority class but show the inability to learn from the minority classes. In this paper, we propose an end-to-end trainable Tree-Shaped Deep Neural Network (TSDNN) which classifies the data in a layer-wise manner. To better learn from the minority classes, we propose a Quantity Dependent Backpropagation (QDBP) algorithm which incorporates the knowledge of the disparity between classes. We evaluate our method on an imbalanced data set. Experimental result demonstrates that our approach outperforms the state-of-the-art methods and justifies that the proposed method is able to overcome the difficulty of imbalanced learning. We also conduct a partial flow experiment which shows the feasibility of real-time detection and a zero-shot learning experiment which justifies the generalization capability of deep learning in cyber security.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Current Streamline Flow on Current-induced Effects in Highly Asymmetric Molecular Junctions
Authors:
Bailey C. Hsu,
Allen Tseng,
Yu-Chang Chen
Abstract:
From first-principles approaches, we illustrate that the current-induced forces and the selection rule for inelastic effects are highly relevant to the current density in an asymmetric molecular junction. The curved flow of current streamline around the asymmetric molecule may induce a net torque, which tends to rotate the benzene molecule, similar to the way a stream of water rotates a waterwheel…
▽ More
From first-principles approaches, we illustrate that the current-induced forces and the selection rule for inelastic effects are highly relevant to the current density in an asymmetric molecular junction. The curved flow of current streamline around the asymmetric molecule may induce a net torque, which tends to rotate the benzene molecule, similar to the way a stream of water rotates a waterwheel. Thus, the Pt/benzene junction offers a practical system in the exploration of the possibility of atomic-scale motors. We also enumerate examples to show that the use of selection rule can lead to misjudgement of the importance of normal modes in the inelastic profiles when the detailed information about the current density is not considered.
△ Less
Submitted 9 June, 2011;
originally announced June 2011.