Search | arXiv e-print repository

An Empirical Study of Mamba-based Language Models

Authors: Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

Abstract: Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a contr… ▽ More Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.03288 [pdf, other]

Embarrassingly Parallel GFlowNets

Authors: Tiago da Silva, Luiz Max Carvalho, Amauri Souza, Samuel Kaski, Diego Mesquita

Abstract: GFlowNets are a promising alternative to MCMC sampling for discrete compositional random variables. Training GFlowNets requires repeated evaluations of the unnormalized target distribution or reward function. However, for large-scale posterior sampling, this may be prohibitive since it incurs traversing the data several times. Moreover, if the data are distributed across clients, employing standar… ▽ More GFlowNets are a promising alternative to MCMC sampling for discrete compositional random variables. Training GFlowNets requires repeated evaluations of the unnormalized target distribution or reward function. However, for large-scale posterior sampling, this may be prohibitive since it incurs traversing the data several times. Moreover, if the data are distributed across clients, employing standard GFlowNets leads to intensive client-server communication. To alleviate both these issues, we propose embarrassingly parallel GFlowNet (EP-GFlowNet). EP-GFlowNet is a provably correct divide-and-conquer method to sample from product distributions of the form $R(\cdot) \propto R_1(\cdot) ... R_N(\cdot)$ -- e.g., in parallel or federated Bayes, where each $R_n$ is a local posterior defined on a data partition. First, in parallel, we train a local GFlowNet targeting each $R_n$ and send the resulting models to the server. Then, the server learns a global GFlowNet by enforcing our newly proposed \emph{aggregating balance} condition, requiring a single communication step. Importantly, EP-GFlowNets can also be applied to multi-objective optimization and model reuse. Our experiments illustrate the EP-GFlowNets's effectiveness on many tasks, including parallel Bayesian phylogenetics, multi-objective multiset, sequence generation, and federated Bayesian structure learning. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted to ICML 2024

arXiv:2405.21060 [pdf, other]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Authors: Tri Dao, Albert Gu

Abstract: While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention,… ▽ More While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2405.20670 [pdf]

Twitter should now be referred to as X: How academics, journals and publishers need to make the nomenclatural transition

Authors: Jaime A. Teixeira da Silva, Serhii Nazarovets

Abstract: Here, we note how academics, journals and publishers should no longer refer to the social media platform Twitter as such, rather as X. Relying on Google Scholar, we found 16 examples of papers published in the last months of 2023 - essentially during the transition period between Twitter and X - that used Twitter and X, but in different ways. Unlike that transition period in which the binary Twitt… ▽ More Here, we note how academics, journals and publishers should no longer refer to the social media platform Twitter as such, rather as X. Relying on Google Scholar, we found 16 examples of papers published in the last months of 2023 - essentially during the transition period between Twitter and X - that used Twitter and X, but in different ways. Unlike that transition period in which the binary Twitter/X could have been used in academic papers, we suggest that papers should no longer refer to Twitter as Twitter, but only as X, except for historical studies about that social media platform, because such use would be factually incorrect. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.06870 [pdf, other]

Noise-Tolerant Codebooks for Semi-Quantitative Group Testing: Application to Spatial Genomics

Authors: Kok Hao Chen, Duc Tu Dao, Han Mao Kiah, Van Long Phuoc Pham, Eitan Yaakobi

Abstract: Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper b… ▽ More Motivated by applications in spatial genomics, we revisit group testing (Dorfman~1943) and propose the class of $λ$-{\sf ADD}-codes, studying such codes with certain distance $d$ and codelength $n$. When $d$ is constant, we provide explicit code constructions with rates close to $1/2$. When $d$ is proportional to $n$, we provide a GV-type lower bound whose rates are efficiently computable. Upper bounds for such codes are also studied. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: To appear in ISIT 2024 Proceedings

arXiv:2403.18101 [pdf, other]

Towards Explainable Clustering: A Constrained Declarative based Approach

Authors: Mathieu Guilbert, Christel Vrain, Thi-Bich-Hanh Dao

Abstract: The domain of explainable AI is of interest in all Machine Learning fields, and it is all the more important in clustering, an unsupervised task whose result must be validated by a domain expert. We aim at finding a clustering that has high quality in terms of classic clustering criteria and that is explainable, and we argue that these two dimensions must be considered when building the clustering… ▽ More The domain of explainable AI is of interest in all Machine Learning fields, and it is all the more important in clustering, an unsupervised task whose result must be validated by a domain expert. We aim at finding a clustering that has high quality in terms of classic clustering criteria and that is explainable, and we argue that these two dimensions must be considered when building the clustering. We consider that a good global explanation of a clustering should give the characteristics of each cluster taking into account their abilities to describe its objects (coverage) while distinguishing it from the other clusters (discrimination). Furthermore, we aim at leveraging expert knowledge, at different levels, on the structure of the expected clustering or on its explanations. In our framework an explanation of a cluster is a set of patterns, and we propose a novel interpretable constrained clustering method called ECS for declarative clustering with Explainabilty-driven Cluster Selection that integrates structural or domain expert knowledge expressed by means of constraints. It is based on the notion of coverage and discrimination that are formalized at different levels (cluster / clustering), each allowing for exceptions through parameterized thresholds. Our method relies on four steps: generation of a set of partitions, computation of frequent patterns for each cluster, pruning clusters that violates some constraints, and selection of clusters and associated patterns to build an interpretable clustering. This last step is combinatorial and we have developed a Constraint-Programming (CP) model to solve it. The method can integrate prior knowledge in the form of user constraints, both before or in the CP model. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.14709 [pdf, other]

ClimateQ&A: Bridging the gap between climate scientists and the general public

Authors: Natalia De La Calzada, Théo Alves Da Costa, Annabelle Blangero, Nicolas Chesneau

Abstract: This research paper investigates public views on climate change and biodiversity loss by analyzing questions asked to the ClimateQ&A platform. ClimateQ&A is a conversational agent that uses LLMs to respond to queries based on over 14,000 pages of scientific literature from the IPCC and IPBES reports. Launched online in March 2023, the tool has gathered over 30,000 questions, mainly from a French a… ▽ More This research paper investigates public views on climate change and biodiversity loss by analyzing questions asked to the ClimateQ&A platform. ClimateQ&A is a conversational agent that uses LLMs to respond to queries based on over 14,000 pages of scientific literature from the IPCC and IPBES reports. Launched online in March 2023, the tool has gathered over 30,000 questions, mainly from a French audience. Its chatbot interface allows for the free formulation of questions related to nature*. While its main goal is to make nature science more accessible, it also allows for the collection and analysis of questions and their themes. Unlike traditional surveys involving closed questions, this novel method offers a fresh perspective on individual interrogations about nature. Running NLP clustering algorithms on a sample of 3,425 questions, we find that a significant 25.8% inquire about how climate change and biodiversity loss will affect them personally (e.g., where they live or vacation, their consumption habits) and the specific impacts of their actions on nature (e.g., transportation or food choices). This suggests that traditional methods of surveying may not identify all existing knowledge gaps, and that relying solely on IPCC and IPBES reports may not address all individual inquiries about climate and biodiversity, potentially affecting public understanding and action on these issues. *we use 'nature' as an umbrella term for 'climate change' and 'biodiversity loss' △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

arXiv:2403.10304 [pdf, other]

KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using Wikidata

Authors: Guilherme Lima, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira

Abstract: We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined map**s to expose a unified view of the integrated bases while kee**… ▽ More We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined map**s to expose a unified view of the integrated bases while kee** track of the context and provenance of their statements. The result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried either through an efficient filter interface or using SPARQL. We present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.03234 [pdf, other]

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Authors: Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov

Abstract: Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off… ▽ More Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. △ Less

Submitted 5 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: ICML 2024; Code to reproduce our experiments is available at https://github.com/kuleshov-group/caduceus

arXiv:2402.19173 [pdf, other]

StarCoder 2 and The Stack v2: The Next Generation

Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo , et al. (41 additional authors not shown)

Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data… ▽ More The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.14712 [pdf, other]

Gilbert-Varshamov Bound for Codes in $L_1$ Metric using Multivariate Analytic Combinatorics

Authors: Keshav Goyal, Duc Tu Dao, Mladen Kovačević, Han Mao Kiah

Abstract: Analytic combinatorics in several variables refers to a suite of tools that provide sharp asymptotic estimates for certain combinatorial quantities. In this paper, we apply these tools to determine the Gilbert--Varshamov lower bound on the rate of optimal codes in $L_1$ metric. Several different code spaces are analyzed, including the simplex and the hypercube in $\mathbb{Z^n}$, all of which are i… ▽ More Analytic combinatorics in several variables refers to a suite of tools that provide sharp asymptotic estimates for certain combinatorial quantities. In this paper, we apply these tools to determine the Gilbert--Varshamov lower bound on the rate of optimal codes in $L_1$ metric. Several different code spaces are analyzed, including the simplex and the hypercube in $\mathbb{Z^n}$, all of which are inspired by concrete data storage and transmission models such as the sticky insertion channel, the permutation channel, the adjacent transposition (bit-shift) channel, the multilevel flash memory channel, etc. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 33 pages, 3 figures, submitted to IEEE Transactions on Information Theory

arXiv:2402.10193 [pdf, other]

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Authors: James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

Abstract: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into t… ▽ More Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings. △ Less

Submitted 27 February, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2401.17824 [pdf, other]

A Survey of Pre-trained Language Models for Processing Scientific Text

Authors: Xanh Ho, Anh Khoa Duong Nguyen, An Tuan Dao, Junfeng Jiang, Yuki Chida, Kaito Sugimoto, Huy Quoc To, Florian Boudin, Akiko Aizawa

Abstract: The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Kee** pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each ot… ▽ More The number of Language Models (LMs) dedicated to processing scientific text is on the rise. Kee** pace with the rapid growth of scientific LMs (SciLMs) has become a daunting task for researchers. To date, no comprehensive surveys on SciLMs have been undertaken, leaving this issue unaddressed. Given the constant stream of new SciLMs, appraising the state-of-the-art and how they compare to each other remain largely unknown. This work fills that gap and provides a comprehensive review of SciLMs, including an extensive analysis of their effectiveness across different domains, tasks and datasets, and a discussion on the challenges that lie ahead. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: Resources are available at https://github.com/Alab-NII/Awesome-SciLM

arXiv:2401.10774 [pdf, other]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Authors: Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

Abstract: Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementa… ▽ More Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x. △ Less

Submitted 14 June, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: The code for this implementation is available at https://github.com/FasterDecoding/Medusa

arXiv:2401.09252 [pdf, other]

doi 10.1145/3519021

3D Scene Geometry Estimation from 360$^\circ$ Imagery: A Survey

Authors: Thiago Lopes Trugillo da Silveira, Paulo Gamarra Lessa Pinto, Jeffri Erwin Murrugarra Llerena, Claudio Rosito Jung

Abstract: This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360… ▽ More This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360$^\circ$, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and map**). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Published in ACM Computing Surveys

Journal ref: ACM Comput. Surv. 55, 4, Article 68, 2023

arXiv:2312.17205 [pdf, other]

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

Authors: Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

Abstract: The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of… ▽ More The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild. △ Less

Submitted 11 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Project Page: https://bomcon123456.github.io/efhq/

arXiv:2312.16626 [pdf, other]

Sorting of Smartphone Components for Recycling Through Convolutional Neural Networks

Authors: Álvaro G. Becker, Marcelo P. Cenci, Thiago L. T. da Silveira, Hugo M. Veit

Abstract: The recycling of waste electrical and electronic equipment is an essential tool in allowing for a circular economy, presenting the potential for significant environmental and economic gain. However, traditional material separation techniques, based on physical and chemical processes, require substantial investment and do not apply to all cases. In this work, we investigate using an image classific… ▽ More The recycling of waste electrical and electronic equipment is an essential tool in allowing for a circular economy, presenting the potential for significant environmental and economic gain. However, traditional material separation techniques, based on physical and chemical processes, require substantial investment and do not apply to all cases. In this work, we investigate using an image classification neural network as a potential means to control an automated material separation process in treating smartphone waste, acting as a more efficient, less costly, and more widely applicable alternative to existing tools. We produced a dataset with 1,127 images of pyrolyzed smartphone components, which was then used to train and assess a VGG-16 image classification model. The model achieved 83.33% accuracy, lending credence to the viability of using such a neural network in material separation. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2312.03046 [pdf, other]

Diversified in-domain synthesis with efficient fine-tuning for few-shot classification

Authors: Victor G. Turrisi da Costa, Nicola Dall'Asen, Yiming Wang, Nicu Sebe, Elisa Ricci

Abstract: Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified In-domain Synthesis with Efficient Fine-tuning (DI… ▽ More Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified In-domain Synthesis with Efficient Fine-tuning (DISEF), a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. DISEF consists of two main components. First, we propose a novel text-to-image augmentation pipeline that, by leveraging the real samples and their rich semantics coming from an advanced captioning model, promotes in-domain sample diversity for better generalization. Second, we emphasize the importance of effective model fine-tuning in few-shot recognition, proposing to use Low-Rank Adaptation (LoRA) for joint adaptation of the text and image encoders in a Vision Language Model. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification. Code is available at https://github.com/vturrisi/disef. △ Less

Submitted 6 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: 14 pages, 6 figures, 8 tables

arXiv:2312.00752 [pdf, other]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Authors: Albert Gu, Tri Dao

Abstract: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long… ▽ More Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation. △ Less

Submitted 31 May, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.05281 [pdf, other]

Finding Software Vulnerabilities in Open-Source C Projects via Bounded Model Checking

Authors: Janislley Oliveira de Sousa, Bruno Carvalho de Farias, Thales Araujo da Silva, Eddie Batista de Lima Filho, Lucas C. Cordeiro

Abstract: Computer-based systems have solved several domain problems, including industrial, military, education, and wearable. Nevertheless, such arrangements need high-quality software to guarantee security and safety as both are mandatory for modern software products. We advocate that bounded model-checking techniques can efficiently detect vulnerabilities in general software systems. However, such an app… ▽ More Computer-based systems have solved several domain problems, including industrial, military, education, and wearable. Nevertheless, such arrangements need high-quality software to guarantee security and safety as both are mandatory for modern software products. We advocate that bounded model-checking techniques can efficiently detect vulnerabilities in general software systems. However, such an approach struggles to scale up and verify extensive code bases. Consequently, we have developed and evaluated a methodology to verify large software systems using a state-of-the-art bounded model checker. In particular, we pre-process input source-code files and guide the respective model checker to explore them systematically. Moreover, the proposed scheme includes a function-wise prioritization strategy, which readily provides results for code entities according to a scale of importance. Experimental results using a real implementation of the proposed methodology show that it can efficiently verify large software systems. Besides, it presented low peak memory allocation when executed. We have evaluated our approach by verifying twelve popular open-source C projects, where we have found real software vulnerabilities that their developers confirmed. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: 27 pages, submitted to STTT journal

arXiv:2310.18324 [pdf, ps, other]

"A Nova Eletricidade: Aplicações, Riscos e Tendências da IA Moderna -- "The New Electricity": Applications, Risks, and Trends in Current AI

Authors: Ana L. C. Bazzan, Anderson R. Tavares, André G. Pereira, Cláudio R. Jung, Jacob Scharcanski, Joel Luis Carbonera, Luís C. Lamb, Mariana Recamonde-Mendoza, Thiago L. T. da Silveira, Viviane Moreira

Abstract: The thought-provoking analogy between AI and electricity, made by computer scientist and entrepreneur Andrew Ng, summarizes the deep transformation that recent advances in Artificial Intelligence (AI) have triggered in the world. This chapter presents an overview of the ever-evolving landscape of AI, written in Portuguese. With no intent to exhaust the subject, we explore the AI applications that… ▽ More The thought-provoking analogy between AI and electricity, made by computer scientist and entrepreneur Andrew Ng, summarizes the deep transformation that recent advances in Artificial Intelligence (AI) have triggered in the world. This chapter presents an overview of the ever-evolving landscape of AI, written in Portuguese. With no intent to exhaust the subject, we explore the AI applications that are redefining sectors of the economy, impacting society and humanity. We analyze the risks that may come along with rapid technological progress and future trends in AI, an area that is on the path to becoming a general-purpose technology, just like electricity, which revolutionized society in the 19th and 20th centuries. A provocativa comparação entre IA e eletricidade, feita pelo cientista da computação e empreendedor Andrew Ng, resume a profunda transformação que os recentes avanços em Inteligência Artificial (IA) têm desencadeado no mundo. Este capítulo apresenta uma visão geral pela paisagem em constante evolução da IA. Sem pretensões de exaurir o assunto, exploramos as aplicações que estão redefinindo setores da economia, impactando a sociedade e a humanidade. Analisamos os riscos que acompanham o rápido progresso tecnológico e as tendências futuras da IA, área que trilha o caminho para se tornar uma tecnologia de propósito geral, assim como a eletricidade, que revolucionou a sociedade dos séculos XIX e XX. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: In Portuguese

MSC Class: 68 ACM Class: I.2

arXiv:2310.17157 [pdf, other]

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Authors: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen

Abstract: Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware.… ▽ More Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Journal ref: Proceedings of the 40th International Conference on Machine Learning, 2023, 919

arXiv:2309.12032 [pdf, other]

Human-in-the-Loop Causal Discovery under Latent Confounding using Ancestral GFlowNets

Authors: Tiago da Silva, Eliezer Silva, Adèle Ribeiro, António Góis, Dominik Heider, Samuel Kaski, Diego Mesquita

Abstract: Structure learning is the crux of causal inference. Notably, causal discovery (CD) algorithms are brittle when data is scarce, possibly inferring imprecise causal relations that contradict expert knowledge -- especially when considering latent confounders. To aggravate the issue, most CD methods do not provide uncertainty estimates, making it hard for users to interpret results and improve the inf… ▽ More Structure learning is the crux of causal inference. Notably, causal discovery (CD) algorithms are brittle when data is scarce, possibly inferring imprecise causal relations that contradict expert knowledge -- especially when considering latent confounders. To aggravate the issue, most CD methods do not provide uncertainty estimates, making it hard for users to interpret results and improve the inference process. Surprisingly, while CD is a human-centered affair, no works have focused on building methods that both 1) output uncertainty estimates that can be verified by experts and 2) interact with those experts to iteratively refine CD. To solve these issues, we start by proposing to sample (causal) ancestral graphs proportionally to a belief distribution based on a score function, such as the Bayesian information criterion (BIC), using generative flow networks. Then, we leverage the diversity in candidate graphs and introduce an optimal experimental design to iteratively probe the expert about the relations among variables, effectively reducing the uncertainty of our belief over ancestral graphs. Finally, we update our samples to incorporate human feedback via importance sampling. Importantly, our method does not require causal sufficiency (i.e., unobserved confounders may exist). Experiments with synthetic observational data show that our method can accurately sample from distributions over ancestral graphs and that we can greatly improve inference quality with human aid. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2308.11763 [pdf, other]

Efficient set-theoretic algorithms for computing high-order Forman-Ricci curvature on abstract simplicial complexes

Authors: Danillo Barros de Souza, Jonatas T. S. da Cunha, Fernando A. N. Santos, Jürgen Jost, Serafim Rodrigues

Abstract: Forman-Ricci curvature (FRC) is a potent and powerful tool for analysing empirical networks, as the distribution of the curvature values can identify structural information that is not readily detected by other geometrical methods. Crucially, FRC captures higher-order structural information of clique complexes of a graph or Vietoris-Rips complexes, which is not readily accessible to alternative me… ▽ More Forman-Ricci curvature (FRC) is a potent and powerful tool for analysing empirical networks, as the distribution of the curvature values can identify structural information that is not readily detected by other geometrical methods. Crucially, FRC captures higher-order structural information of clique complexes of a graph or Vietoris-Rips complexes, which is not readily accessible to alternative methods. However, existing FRC platforms are prohibitively computationally expensive. Therefore, herein we develop an efficient set-theoretic formulation for computing such high-order FRC in simplicial complexes. Significantly, our set theory representation reveals previous computational bottlenecks and also accelerates the computation of FRC. Finally, We provide a pseudo-code, a software implementation coined FastForman, as well as a benchmark comparison with alternative implementations. We envisage that FastForman will be used in Topological and Geometrical Data analysis for high-dimensional complex data sets. Moreover, our development paves the way for future generalisations towards efficient computations of FRC on cell complexes. △ Less

Submitted 9 May, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

arXiv:2307.08691 [pdf, other]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Authors: Tri Dao

Abstract: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in th… ▽ More Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization). △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2305.06161 [pdf, other]

StarCoder: may the source be with you!

Authors: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu , et al. (42 additional authors not shown)

Abstract: The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large colle… ▽ More The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license. △ Less

Submitted 13 December, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.05040 [pdf, other]

doi 10.1007/s11548-023-02909-y

Unsupervised out-of-distribution detection for safer robotically guided retinal microsurgery

Authors: Alain Jungo, Lars Doorenbos, Tommaso Da Col, Maarten Beelen, Martin Zinkernagel, Pablo Márquez-Neila, Raphael Sznitman

Abstract: Purpose: A fundamental problem in designing safe machine learning systems is identifying when samples presented to a deployed model differ from those observed at training time. Detecting so-called out-of-distribution (OoD) samples is crucial in safety-critical applications such as robotically guided retinal microsurgery, where distances between the instrument and the retina are derived from sequen… ▽ More Purpose: A fundamental problem in designing safe machine learning systems is identifying when samples presented to a deployed model differ from those observed at training time. Detecting so-called out-of-distribution (OoD) samples is crucial in safety-critical applications such as robotically guided retinal microsurgery, where distances between the instrument and the retina are derived from sequences of 1D images that are acquired by an instrument-integrated optical coherence tomography (iiOCT) probe. Methods: This work investigates the feasibility of using an OoD detector to identify when images from the iiOCT probe are inappropriate for subsequent machine learning-based distance estimation. We show how a simple OoD detector based on the Mahalanobis distance can successfully reject corrupted samples coming from real-world ex vivo porcine eyes. Results: Our results demonstrate that the proposed approach can successfully detect OoD samples and help maintain the performance of the downstream task within reasonable levels. MahaAD outperformed a supervised approach trained on the same kind of corruptions and achieved the best performance in detecting OoD cases from a collection of iiOCT samples with real-world corruptions. Conclusion: The results indicate that detecting corrupted iiOCT data through OoD detection is feasible and does not need prior knowledge of possible corruptions. Consequently, MahaAD could aid in ensuring patient safety during robotically guided microsurgery by preventing deployed prediction models from estimating distances that put the patient at risk. △ Less

Submitted 3 May, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: Accepted at IPCAI 2023

arXiv:2303.11059 [pdf, other]

Six-degree-of-freedom Localization Under Multiple Permanent Magnets Actuation

Authors: Tomas da Veiga, Giovanni Pittiglio, Michael Brockdorff, James H. Chandler, Pietro Valdastri

Abstract: Localization of magnetically actuated medical robots is essential for accurate actuation, closed loop control and delivery of functionality. Despite extensive progress in the use of magnetic field and inertial measurements for pose estimation, these have been either under single external permanent magnet actuation or coil systems. With the advent of new magnetic actuation systems comprised of mult… ▽ More Localization of magnetically actuated medical robots is essential for accurate actuation, closed loop control and delivery of functionality. Despite extensive progress in the use of magnetic field and inertial measurements for pose estimation, these have been either under single external permanent magnet actuation or coil systems. With the advent of new magnetic actuation systems comprised of multiple external permanent magnets for increased control and manipulability, new localization techniques are necessary to account for and leverage the additional magnetic field sources. In this letter, we introduce a novel magnetic localization technique in the Special Euclidean Group SE(3) for multiple external permanent magnetic field actuation and control systems. The method relies on a milli-meter scale three-dimensional accelerometer and a three-dimensional magnetic field sensor and is able to estimate the full 6 degree-of-freedom pose without any prior pose information. We demonstrated the localization system with two external permanent magnets and achieved localization errors of 8.5 ? 2.4 mm in position norm and 3.7 ? 3.6? in orientation, across a cubic workspace with 20 cm length. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: Under second round of review at Robotics and Automation Letters

arXiv:2303.09489 [pdf, other]

Effectively Modeling Time Series with Simple Discrete State Spaces

Authors: Michael Zhang, Khaled K. Saab, Michael Poli, Tri Dao, Karan Goel, Christopher Ré

Abstract: Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limit… ▽ More Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limitations with these prior approaches, proving their SSM representations cannot express autoregressive time series processes. We thus introduce SpaceTime, a new state-space time series architecture that improves all three criteria. For expressivity, we propose a new SSM parameterization based on the companion matrix -- a canonical representation for discrete-time processes -- which enables SpaceTime's SSM layers to learn desirable autoregressive processes. For long horizon forecasting, we introduce a "closed-loop" variation of the companion SSM, which enables SpaceTime to predict many future time-steps by generating its own layer-wise inputs. For efficient training and inference, we introduce an algorithm that reduces the memory and compute of a forward pass with the companion matrix. With sequence length $\ell$ and state-space size $d$, we go from $\tilde{O}(d \ell)$ naïvely to $\tilde{O}(d + \ell)$. In experiments, our contributions lead to state-of-the-art results on extensive and diverse benchmarks, with best or second-best AUROC on 6 / 7 ECG and speech time series classification, and best MSE on 14 / 16 Informer forecasting tasks. Furthermore, we find SpaceTime (1) fits AR($p$) processes that prior deep SSMs fail on, (2) forecasts notably more accurately on longer horizons than prior state-of-the-art, and (3) speeds up training on real-world ETTh1 data by 73% and 80% relative wall-clock time over Transformers and LSTMs. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 45 pages, 8 figures, 20 tables, ICLR 2023

arXiv:2303.01842 [pdf, ps, other]

Independent Control of Two Magnetic Robots using External Permanent Magnets: A Feasibility Study

Authors: Joshua Davy, Tomas da Veiga, Giovanni Pittiglio, James H. Chandler, Pietro Valdastri

Abstract: The ability to have multiple magnetic robots operate independently in the same workspace would increase the clinical potential of these systems allowing collaborative operation. In this work, we investigate the feasibility of actuating two magnetic robots operating within the same workspace using external permanent magnets. Unlike actuation systems based on pairs of electromagnetic coils, the use… ▽ More The ability to have multiple magnetic robots operate independently in the same workspace would increase the clinical potential of these systems allowing collaborative operation. In this work, we investigate the feasibility of actuating two magnetic robots operating within the same workspace using external permanent magnets. Unlike actuation systems based on pairs of electromagnetic coils, the use of multiple permanent magnets comes with the advantage of a large workspace which better suits the clinical setting. In this work, we present an optimization routine capable of generating the required poses for the external magnets in order to control the position and orientation of two magnetic robots. We show that at a distance of 15cm, minimal coupling between the magnetic robots can be achieved (3.9\% crosstalk) each embedded with 5mm diameter, 5mm length NdFeB magnets. At smaller distances, we observe that the ability to independently control the robot torques decreases, but forces can still achieve independent control even with alignment of the robots. We test our developed control system in a simulation of two magnetic robots following pre-planned trajectories in close proximity (60 mm) showing a mean positional error of 8.7 mm and mean angular error of 16.7 degrees. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: 7 pages, 6 figures, conference

arXiv:2302.13714 [pdf, other]

On the Design of Codes for DNA Computing: Secondary Structure Avoidance Codes

Authors: Tuan Thanh Nguyen, Kui Cai, Han Mao Kiah, Duc Tu Dao, Kees A. Schouhamer Immink

Abstract: In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While s… ▽ More In this work, we investigate a challenging problem, which has been considered to be an important criterion in designing codewords for DNA computing purposes, namely secondary structure avoidance in single-stranded DNA molecules. In short, secondary structure refers to the tendency of a single-stranded DNA sequence to fold back upon itself, thus becoming inactive in the computation process. While some design criteria that reduces the possibility of secondary structure formation has been proposed by Milenkovic and Kashyap (2006), the main contribution of this work is to provide an explicit construction of DNA codes that completely avoid secondary structure of arbitrary stem length. Formally, given codeword length n and arbitrary integer m>=2, we provide efficient methods to construct DNA codes of length n that avoid secondary structure of any stem length more than or equal to m. Particularly, when m = 3, our constructions yield a family of DNA codes of rate 1.3031 bits/nt, while the highest rate found in the prior art was 1.1609 bits/nt. In addition, for m>=3log n + 4, we provide an efficient encoder that incurs only one redundant symbol. △ Less

Submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.12133 [pdf, other]

Practical Analyses of How Common Social Media Platforms and Photo Storage Services Handle Uploaded Images

Authors: Duc-Tien Dang-Nguyen, Vegard Velle Sjøen, Dinh-Hai Le, Thien-Phu Dao, Anh-Duy Tran, Minh-Triet Tran

Abstract: The research done in this study has delved deeply into the changes made to digital images that are uploaded to three of the major social media platforms and image storage services in today's society: Facebook, Flickr, and Google Photos. In addition to providing up-to-date data on an ever-changing landscape of different social media networks' digital fingerprints, a deep analysis of the social netw… ▽ More The research done in this study has delved deeply into the changes made to digital images that are uploaded to three of the major social media platforms and image storage services in today's society: Facebook, Flickr, and Google Photos. In addition to providing up-to-date data on an ever-changing landscape of different social media networks' digital fingerprints, a deep analysis of the social networks' filename conventions has resulted in two new approaches in (i) estimating the true upload date of Flickr photos, regardless of whether the dates have been changed by the user or not, and regardless of whether the image is available to the public or has been deleted from the platform; (ii) revealing the photo ID of a photo uploaded to Facebook based solely on the file name of the photo. △ Less

Submitted 23 February, 2023; originally announced February 2023.

arXiv:2302.10866 [pdf, other]

Hyena Hierarchy: Towards Larger Convolutional Language Models

Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

Abstract: Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attentio… ▽ More Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K. △ Less

Submitted 19 April, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: Additional details

arXiv:2302.08905 [pdf, other]

GraphLED: A graph-based approach to process and visualise linked engineering documents

Authors: Vanessa Telles da Silva, Lucas de Angelo Martins Ribeiro, Willian Borges de Lemos, Sílvia Silva da Costa Botelho, Nelson Lopes Duarte Filho, Marcelo Rita Pias

Abstract: The architecture, engineering and construction (AEC) sector extensively uses documents supporting product and process development. As part of this, organisations should handle big data of hundreds, or even thousands, of technical documents strongly linked together, including CAD design of industrial plants, equipment purchase orders, quality certificates, and part material analysis. However, analy… ▽ More The architecture, engineering and construction (AEC) sector extensively uses documents supporting product and process development. As part of this, organisations should handle big data of hundreds, or even thousands, of technical documents strongly linked together, including CAD design of industrial plants, equipment purchase orders, quality certificates, and part material analysis. However, analysing such records is daunting for users because it gets complicated to sift through hundreds of documents to establish valuable relationships. This paper addresses how knowledge extracted from linked engineering documents contributes to industrial digitalisation under IT/OT convergence. The proposed GraphLED is a system tasked with data processing, graph-based modelling, and colourful visualisation of related documents. The graph-based approach ensures an improved understanding of linked information because the graph structure offers a promising tool to model the underlying data properties of engineering documents. Preliminary system validation indicates quality improvements are possible in the OCR-based data (85.9% of ambiguous text data removed). This work has the potential to benefit the industry by improving the reliability and resilience of industrial production systems through automated summaries of large quantities of documents and their linkage. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2302.06646 [pdf, other]

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Authors: Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré

Abstract: State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance… ▽ More State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is kee** the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2$\times$, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2$\times$ faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2301.06031 [pdf]

A Review on the effectiveness of Dimensional Reduction with Computational Forensics: An Application on Malware Analysis

Authors: Aye Thaw Da Naing, Justin Soh Beng Guan, Yarzar Shwe Win, Jonathan Pan

Abstract: The Android operating system is pervasively adopted as the operating system platform of choice for smart devices. However, the strong adoption has also resulted in exponential growth in the number of Android based malicious software or malware. To deal with such cyber threats as part of cyber investigation and digital forensics, computational techniques in the form of machine learning algorithms a… ▽ More The Android operating system is pervasively adopted as the operating system platform of choice for smart devices. However, the strong adoption has also resulted in exponential growth in the number of Android based malicious software or malware. To deal with such cyber threats as part of cyber investigation and digital forensics, computational techniques in the form of machine learning algorithms are applied for such malware identification, detection and forensics analysis. However, such Computational Forensics modelling techniques are constrained the volume, velocity, variety and veracity of the malware landscape. This in turn would affect its identification and detection effectiveness. Such consequence would inherently induce the question of sustainability with such solution approach. One approach to optimise effectiveness is to apply dimensional reduction techniques like Principal Component Analysis with the intent to enhance algorithmic performance. In this paper, we evaluate the effectiveness of the application of Principle Component Analysis on Computational Forensics task of detecting Android based malware. We applied our research hypothesis to three different datasets with different machine learning algorithms. Our research result showed that the dimensionally reduced dataset would result in a measure of degradation in accuracy performance. △ Less

Submitted 15 January, 2023; originally announced January 2023.

Comments: 18 pages

arXiv:2301.03322 [pdf, other]

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Authors: Giacomo Zara, Victor Guilherme Turrisi da Costa, Subhankar Roy, Paolo Rota, Elisa Ricci

Abstract: In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown… ▽ More In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art. △ Less

Submitted 9 January, 2023; originally announced January 2023.

Comments: Currently under review at Computer Vision and Image Understanding (CVIU) journal

arXiv:2212.14052 [pdf, other]

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Authors: Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré

Abstract: State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between S… ▽ More State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark. △ Less

Submitted 28 April, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

Comments: ICLR 2023 Camera-Ready (Notable-top-25% / Spotlight)

arXiv:2211.14453 [pdf, other]

Transform Once: Efficient Operator Learning in Frequency Domain

Authors: Michael Poli, Stefano Massaroli, Federico Berto, **ykoo Park, Tri Dao, Christopher Ré, Stefano Ermon

Abstract: Spectral analysis provides one of the most effective paradigms for information-preserving dimensionality reduction, as simple descriptions of naturally occurring signals are often obtained via few terms of periodic basis functions. In this work, we study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time: fr… ▽ More Spectral analysis provides one of the most effective paradigms for information-preserving dimensionality reduction, as simple descriptions of naturally occurring signals are often obtained via few terms of periodic basis functions. In this work, we study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time: frequency-domain models (FDMs). Existing FDMs are based on complex-valued transforms i.e. Fourier Transforms (FT), and layers that perform computation on the spectrum and input data separately. This design introduces considerable computational overhead: for each layer, a forward and inverse FT. Instead, this work introduces a blueprint for frequency domain learning through a single transform: transform once (T1). To enable efficient, direct learning in the frequency domain we derive a variance-preserving weight initialization scheme and investigate methods for frequency selection in reduced-order FDMs. Our results noticeably streamline the design process of FDMs, pruning redundant transforms, and leading to speedups of 3x to 10x that increase with data resolution and model size. We perform extensive experiments on learning the solution operator of spatio-temporal dynamics, including incompressible Navier-Stokes, turbulent flows around airfoils and high-resolution video of smoke. T1 models improve on the test performance of FDMs while requiring significantly less computation (5 hours instead of 32 for our large-scale experiment), with over 20% reduction in average predictive error across tasks. △ Less

Submitted 25 November, 2022; originally announced November 2022.

Comments: Published at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2211.01438 [pdf, other]

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

Authors: Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva, Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza Silovsky, Ruchir Travadi, Xiaodan Zhuang

Abstract: This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries,… ▽ More This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries, in terms of recognition accuracy and latency. We then explore the use of variable masking, where the attention masks are sampled from a target distribution at training time, to build models that can work in different configurations. Finally, we investigate how a single configurable model can be used to perform both first pass streaming recognition and second pass acoustic rescoring. Experiments show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking, both with and without FastEmit. We also show that variable masking improves the accuracy by up to 8% relative in the acoustic re-scoring scenario. △ Less

Submitted 18 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: To appear in ICASSP 2023

Journal ref: International Conference on Acoustics, Speech, and Signal Processing, 2023 International Conference on Acoustics, Speech, and Signal Processing International Conference on Acoustics, Speech, and Signal Processing

arXiv:2210.12214 [pdf, ps, other]

Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text Generation

Authors: Thien Nguyen, Nathalie Tran, Liuhui Deng, Thiago Fraga da Silva, Matthew Radzihovsky, Roger Hsiao, Henry Mason, Stefan Braun, Erik McDermott, Dogan Can, Pawel Swietojanski, Lyan Verwimp, Sibel Oyman, Tresi Arvizo, Honza Silovsky, Arnab Ghoshal, Mathieu Martel, Bharat Ram Ambati, Mohamed Ali

Abstract: Code-switching describes the practice of using more than one language in the same sentence. In this study, we investigate how to optimize a neural transducer based bilingual automatic speech recognition (ASR) model for code-switching speech. Focusing on the scenario where the ASR model is trained without supervised code-switching data, we found that semi-supervised training and synthetic code-swit… ▽ More Code-switching describes the practice of using more than one language in the same sentence. In this study, we investigate how to optimize a neural transducer based bilingual automatic speech recognition (ASR) model for code-switching speech. Focusing on the scenario where the ASR model is trained without supervised code-switching data, we found that semi-supervised training and synthetic code-switched data can improve the bilingual ASR system on code-switching speech. We analyze how each of the neural transducer's encoders contributes towards code-switching performance by measuring encoder-specific recall values, and evaluate our English/Mandarin system on the ASCEND data set. Our final system achieves 25% mixed error rate (MER) on the ASCEND English/Mandarin code-switching test set -- reducing the MER by 2.1% absolute compared to the previous literature -- while maintaining good accuracy on the monolingual test sets. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: 5 pages, 1 figure, submitted to ICASSP 2023, *: equal contributions

arXiv:2210.06583 [pdf, other]

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

Authors: Eric Nguyen, Karan Goel, Albert Gu, Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus, Christopher Ré

Abstract: Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image… ▽ More Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image and video classification. Building on a recent line of work on deep state space models (SSMs), we propose S4ND, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multidimensional data including images and videos. We show that S4ND can model large-scale visual data in $1$D, $2$D, and $3$D as continuous multidimensional signals and demonstrates strong performance by simply swap** Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1.5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D. For videos, S4ND improves on an inflated $3$D ConvNeXt in activity classification on HMDB-51 by $4\%$. S4ND implicitly learns global, continuous convolutional kernels that are resolution invariant by construction, providing an inductive bias that enables generalization across multiple resolutions. By develo** a simple bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training time) resolution performance, outperforming a baseline Conv2D by $40\%$ on CIFAR-10 when trained on $8 \times 8$ and tested on $32 \times 32$ images. When trained with progressive resizing, S4ND comes within $\sim 1\%$ of a high-resolution model while training $22\%$ faster. △ Less

Submitted 13 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: NeurIPS 2022

arXiv:2210.02390 [pdf, other]

Bayesian Prompt Learning for Image-Language Model Generalization

Authors: Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, Brais Martinez

Abstract: Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generaliza… ▽ More Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. Code available at: https://github.com/saic-fi/Bayesian-Prompt-Learning △ Less

Submitted 20 August, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Accepted at ICCV 2023

arXiv:2209.13774 [pdf, other]

ButterflyFlow: Building Invertible Layers with Butterfly Matrices

Authors: Chenlin Meng, Linqi Zhou, Kristy Choi, Tri Dao, Stefano Ermon

Abstract: Normalizing flows model complex probability distributions using maps obtained by composing invertible layers. Special linear layers such as masked and 1x1 convolutions play a key role in existing architectures because they increase expressive power while having tractable Jacobians and inverses. We propose a new family of invertible linear layers based on butterfly layers, which are known to theore… ▽ More Normalizing flows model complex probability distributions using maps obtained by composing invertible layers. Special linear layers such as masked and 1x1 convolutions play a key role in existing architectures because they increase expressive power while having tractable Jacobians and inverses. We propose a new family of invertible linear layers based on butterfly layers, which are known to theoretically capture complex linear structures including permutations and periodicity, yet can be inverted efficiently. This representational power is a key advantage of our approach, as such structures are common in many real-world datasets. Based on our invertible butterfly layers, we construct a new class of normalizing flow models called ButterflyFlow. Empirically, we demonstrate that ButterflyFlows not only achieve strong density estimation results on natural images such as MNIST, CIFAR-10, and ImageNet 32x32, but also obtain significantly better log-likelihoods on structured datasets such as galaxy images and MIMIC-III patient cohorts -- all while being more efficient in terms of memory and computation than relevant baselines. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: ICML 2022

arXiv:2207.12842 [pdf, other]

Unsupervised Domain Adaptation for Video Transformers in Action Recognition

Authors: Victor G. Turrisi da Costa, Giacomo Zara, Paolo Rota, Thiago Oliveira-Santos, Nicu Sebe, Vittorio Murino, Elisa Ricci

Abstract: Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose… ▽ More Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose a simple and novel UDA approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle. We report results on two video action recognition benchmarks for UDA, showing state-of-the-art performance on HMDB$\leftrightarrow$UCF, as well as on Kinetics$\rightarrow$NEC-Drone, which is more challenging. This demonstrates the effectiveness of our method in handling different levels of domain shift. The source code is available at https://github.com/vturrisi/UDAVT. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: Accepted at ICPR 2022

arXiv:2207.11638 [pdf, other]

doi 10.1016/j.image.2017.06.014

DCT Approximations Based on Chen's Factorization

Authors: C. J. Tablada, T. L. T. da Silveira, R. J. Cintra, F. M. Bayer

Abstract: In this paper, two 8-point multiplication-free DCT approximations based on the Chen's factorization are proposed and their fast algorithms are also derived. Both transformations are assessed in terms of computational cost, error energy, and coding gain. Experiments with a JPEG-like image compression scheme are performed and results are compared with competing methods. The proposed low-complexity t… ▽ More In this paper, two 8-point multiplication-free DCT approximations based on the Chen's factorization are proposed and their fast algorithms are also derived. Both transformations are assessed in terms of computational cost, error energy, and coding gain. Experiments with a JPEG-like image compression scheme are performed and results are compared with competing methods. The proposed low-complexity transforms are scaled according to Jridi-Alfalou-Meher algorithm to effect 16- and 32-point approximations. The new sets of transformations are embedded into an HEVC reference software to provide a fully HEVC-compliant video coding scheme. We show that approximate transforms can outperform traditional transforms and state-of-the-art methods at a very low complexity cost. △ Less

Submitted 23 July, 2022; originally announced July 2022.

Comments: 19 pages, 8 figures, 5 tables

Journal ref: Signal Processing: Image Communication, Volume 58, October 2017, Pages 14-23

arXiv:2206.14097 [pdf, other]

Item Matching using Text Description and Similarity Search

Authors: Ana Paula Appel, Anderson Luis de Paula Silva, Adriana Reigota Silva, Caique Dutra Santos, Thiago Logo da Silva, Rafael Poggi de Araujo, Luiz Carlos Faray de Aquino

Abstract: In this paper, we focus on the problem of item matching using only the description. Those specific items not only lack a unique code but also contain short text descriptions, making the item matching process difficult. Our goal is to compare products using only the description provided by the purchase process. Therefore, evaluating other characteristics and differences can uncover possible flaws d… ▽ More In this paper, we focus on the problem of item matching using only the description. Those specific items not only lack a unique code but also contain short text descriptions, making the item matching process difficult. Our goal is to compare products using only the description provided by the purchase process. Therefore, evaluating other characteristics and differences can uncover possible flaws during the acquiring phase. However, the text of the items that we were working on was very small, with numbers due to the nature of the products and we have a limited amount of time to develop the solution which was 8 weeks. As result, we showed that working using a well-oriented methodology we were able to deliver a successful MVP and achieve the results expected with up to 55% match. △ Less

Submitted 1 July, 2022; v1 submitted 28 June, 2022; originally announced June 2022.

arXiv:2206.01299 [pdf, other]

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Authors: Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Abstract: Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation c… ▽ More Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AC-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve $O(1/\sqrt{T})$ convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions.We then show that AC-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead.We evaluated AC-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activations to 2-4 bits.AC-SGD provides up to 4.3X end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AC-SGD can be combined with state-of-the-art gradient compression algorithms to enable "end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision.This provides up to 4.9X end-to-end speed-up, without sacrificing model quality. △ Less

Submitted 6 March, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2206.01288 [pdf, other]

Decentralized Training of Foundation Models in Heterogeneous Environments

Authors: Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, Ce Zhang

Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be… ▽ More Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron). △ Less

Submitted 21 June, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2206.00122 [pdf, other]

doi 10.1109/TCSVT.2021.3134054

A Class of Low-complexity DCT-like Transforms for Image and Video Coding

Authors: T. L. T. da Silveira, D. R. Canterle, D. F. G. Coelho, V. A. Coutinho, F. M. Bayer, R. J. Cintra

Abstract: The discrete cosine transform (DCT) is a relevant tool in signal processing applications, mainly known for its good decorrelation properties. Current image and video coding standards -- such as JPEG and HEVC -- adopt the DCT as a fundamental building block for compression. Recent works have introduced low-complexity approximations for the DCT, which become paramount in applications demanding real-… ▽ More The discrete cosine transform (DCT) is a relevant tool in signal processing applications, mainly known for its good decorrelation properties. Current image and video coding standards -- such as JPEG and HEVC -- adopt the DCT as a fundamental building block for compression. Recent works have introduced low-complexity approximations for the DCT, which become paramount in applications demanding real-time computation and low-power consumption. The design of DCT approximations involves a trade-off between computational complexity and performance. This paper introduces a new multiparametric transform class encompassing the round-off DCT (RDCT) and the modified RDCT (MRDCT), two relevant multiplierless 8-point approximate DCTs. The associated fast algorithm is provided. Four novel orthogonal low-complexity 8-point DCT approximations are obtained by solving a multicriteria optimization problem. The optimal 8-point transforms are scaled to lengths 16 and 32 while kee** the arithmetic complexity low. The proposed methods are assessed by proximity and coding measures with respect to the exact DCT. Image and video coding experiments hardware realization are performed. The novel transforms perform close to or outperform the current state-of-the-art DCT approximations. △ Less

Submitted 8 December, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

Comments: Corrected a typo in the general expression for the diagonal matrix S(a) (Equation 11, Section 3.1). Manuscript has 20 pages, 8 figures, 9 tables

MSC Class: 94A08; 65D15

Journal ref: IEEE Transactions on Circuits and Systems for Video Technology, v. 32, n. 7, July 2022

Showing 1–50 of 94 results for author: Dao, T