-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Authors:
Yu Sun,
Xinhao Li,
Karan Dalal,
Jiarui Xu,
Arjun Vikram,
Genghan Zhang,
Yann Dubois,
Xinlei Chen,
Xiaolong Wang,
Sanmi Koyejo,
Tatsunori Hashimoto,
Carlos Guestrin
Abstract:
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and t…
▽ More
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Deconvolving Complex Neuronal Networks into Interpretable Task-Specific Connectomes
Authors:
Yifan Wang,
Vikram Ravindra,
Ananth Grama
Abstract:
Task-specific functional MRI (fMRI) images provide excellent modalities for studying the neuronal basis of cognitive processes. We use fMRI data to formulate and solve the problem of deconvolving task-specific aggregate neuronal networks into a set of basic building blocks called canonical networks, to use these networks for functional characterization, and to characterize the physiological basis…
▽ More
Task-specific functional MRI (fMRI) images provide excellent modalities for studying the neuronal basis of cognitive processes. We use fMRI data to formulate and solve the problem of deconvolving task-specific aggregate neuronal networks into a set of basic building blocks called canonical networks, to use these networks for functional characterization, and to characterize the physiological basis of these responses by map** them to regions of the brain. Our results show excellent task-specificity of canonical networks, i.e., the expression of a small number of canonical networks can be used to accurately predict tasks; generalizability across cohorts, i.e., canonical networks are conserved across diverse populations, studies, and acquisition protocols; and that canonical networks have strong anatomical and physiological basis. From a methods perspective, the problem of identifying these canonical networks poses challenges rooted in the high dimensionality, small sample size, acquisition variability, and noise. Our deconvolution technique is based on non-negative matrix factorization (NMF) that identifies canonical networks as factors of a suitably constructed matrix. We demonstrate that our method scales to large datasets, yields stable and accurate factors, and is robust to noise.
△ Less
Submitted 3 July, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.
-
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model
Authors:
Hieu T. Nguyen,
Yiwen Chen,
Vikram Voleti,
Varun Jampani,
Huaizu Jiang
Abstract:
We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise m…
▽ More
We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Authors:
Chengqian Ma,
Zhanxiang Hua,
Alexandra Anderson-Frey,
Vikram Iyer,
Xin Liu,
Lianhui Qin
Abstract:
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather ben…
▽ More
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3.5, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
△ Less
Submitted 23 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Faster Spectral Density Estimation and Sparsification in the Nuclear Norm
Authors:
Yujia **,
Ishani Karmarkar,
Christopher Musco,
Aaron Sidford,
Apoorv Vikram Singh
Abstract:
We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $n$-node undirected graph. We provide a randomized algorithm that, with $O(nε^{-2})$ queries to a degree and neighbor oracle and in $O(nε^{-3})$ time, estimates the spectrum up to $ε$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $O(nε^{-7})$…
▽ More
We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $n$-node undirected graph. We provide a randomized algorithm that, with $O(nε^{-2})$ queries to a degree and neighbor oracle and in $O(nε^{-3})$ time, estimates the spectrum up to $ε$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $O(nε^{-7})$ time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small $ε$, a $2^{O(ε^{-1})}$ time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call nuclear sparsification. We provide an $O(nε^{-2})$-query and $O(nε^{-2})$-time algorithm for computing $O(nε^{-2})$-sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first deterministic algorithm for spectral density estimation that scales linearly with $n$ (sublinear in the representation size of the graph).
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
Authors:
Neemesh Yadav,
Sarah Masud,
Vikram Goyal,
Vikram Goyal,
Md Shad Akhtar,
Tanmoy Chakraborty
Abstract:
Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflic…
▽ More
Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Item-Language Model for Conversational Recommendation
Authors:
Li Yang,
Anushya Subbiah,
Hardik Patel,
Judith Yue Li,
Yanwei Song,
Reza Mirghaderi,
Vikram Aggarwal
Abstract:
Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there ha…
▽ More
Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation
Authors:
Karan Goyal,
Mayank Goel,
Vikram Goyal,
Mukesh Mohania
Abstract:
Citing pertinent literature is pivotal to writing and reviewing a scientific document. Existing techniques mainly focus on the local context or the global context for recommending citations but fail to consider the actual human citation behaviour. We propose SymTax, a three-stage recommendation architecture that considers both the local and the global context, and additionally the taxonomical repr…
▽ More
Citing pertinent literature is pivotal to writing and reviewing a scientific document. Existing techniques mainly focus on the local context or the global context for recommending citations but fail to consider the actual human citation behaviour. We propose SymTax, a three-stage recommendation architecture that considers both the local and the global context, and additionally the taxonomical representations of query-candidate tuples and the Symbiosis prevailing amongst them. SymTax learns to embed the infused taxonomies in the hyperbolic space and uses hyperbolic separation as a latent feature to compute query-candidate similarity. We build a novel and large dataset ArSyTa containing 8.27 million citation contexts and describe the creation process in detail. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and design choice of each module in our framework. Also, combinatorial analysis from our experiments shed light on the choice of language models (LMs) and fusion embedding, and the inclusion of section heading as a signal. Our proposed module that captures the symbiotic relationship solely leads to performance gains of 26.66% and 39.25% in Recall@5 w.r.t. SOTA on ACL-200 and RefSeer datasets, respectively. The complete framework yields a gain of 22.56% in Recall@5 wrt SOTA on our proposed dataset. The code and dataset are available at https://github.com/goyalkaraniit/SymTax
△ Less
Submitted 26 May, 2024;
originally announced June 2024.
-
SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications
Authors:
Vikram Nitin,
Baishakhi Ray
Abstract:
Large language models (LLMs) are increasingly being used for the task of automated code translation, which has important real-world applications. However, most existing approaches use only the source code of a program as an input to an LLM, and do not consider the different kinds of specifications that can be extracted from a program. In this paper, we propose SpecTra, a multi-stage approach that…
▽ More
Large language models (LLMs) are increasingly being used for the task of automated code translation, which has important real-world applications. However, most existing approaches use only the source code of a program as an input to an LLM, and do not consider the different kinds of specifications that can be extracted from a program. In this paper, we propose SpecTra, a multi-stage approach that uses a novel self-consistency filter to first generate high-quality invariants, test cases, and natural language descriptions from a given program, and then uses these along with the source code to improve the quality of LLM-generated translations. We evaluate SpecTra on two code translation tasks - C to Rust, and C to Go - and show that it can enhance the performance of four popular LLMs on these tasks by up to 10 percentage points and a relative improvement of up to 23%. Our research suggests that generating high-quality specifications could be a promising and efficient way to improve the performance of LLMs for code translation.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts
Authors:
Bhagyashri Tushir,
Vikram K Ramanna,
Yuhong Liu,
Behnam Dezfouli
Abstract:
Identifying IoT devices is crucial for network monitoring, security enforcement, and inventory tracking. However, most existing identification methods rely on deep packet inspection, which raises privacy concerns and adds computational complexity. More importantly, existing works overlook the impact of wireless channel dynamics on the accuracy of layer-2 features, thereby limiting their effectiven…
▽ More
Identifying IoT devices is crucial for network monitoring, security enforcement, and inventory tracking. However, most existing identification methods rely on deep packet inspection, which raises privacy concerns and adds computational complexity. More importantly, existing works overlook the impact of wireless channel dynamics on the accuracy of layer-2 features, thereby limiting their effectiveness in real-world scenarios. In this work, we define and use the latency of specific probe-response packet exchanges, referred to as "device latency," as the main feature for device identification. Additionally, we reveal the critical impact of wireless channel dynamics on the accuracy of device identification based on device latency. Specifically, this work introduces "accumulation score" as a novel approach to capturing fine-grained channel dynamics and their impact on device latency when training machine learning models. We implement the proposed methods and measure the accuracy and overhead of device identification in real-world scenarios. The results confirm that by incorporating the accumulation score for balanced data collection and training machine learning algorithms, we achieve an F1 score of over 97% for device identification, even amidst wireless channel dynamics, a significant improvement over the 75% F1 score achieved by disregarding the impact of channel dynamics on data collection and device latency.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Robotic Path Planning Implementation using Search Algorithms
Authors:
Vikram Shahapur,
Blessing Dixon,
Urvishkumar Bharti
Abstract:
Till now, many path planning algorithms have been proposed in the literature. The objective of these algorithms is to find the quickest path between initial position to the end position in a certain environment. The complexity of these algorithms depends on the internal parameters such as motor speed or sensor range and on other external parameters, including the accuracy of the map, size of the e…
▽ More
Till now, many path planning algorithms have been proposed in the literature. The objective of these algorithms is to find the quickest path between initial position to the end position in a certain environment. The complexity of these algorithms depends on the internal parameters such as motor speed or sensor range and on other external parameters, including the accuracy of the map, size of the environment, and the number of obstacles. In this paper, we are giving information about how path planning algorithm finds the optimal path in an uneven terrain with a multiple obstacle using TurtleBot3 robot into the Gazebo environment using Dijkstra's and A(star).
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Multi-Subject Personalization
Authors:
Arushi Jain,
Shubham Paliwal,
Monika Sharma,
Vikram Jamwal,
Lovekesh Vig
Abstract:
Creative story illustration requires a consistent interplay of multiple characters or objects. However, conventional text-to-image models face significant challenges while producing images featuring multiple personalized subjects. For example, they distort the subject rendering, or the text descriptions fail to render coherent subject interactions. We present Multi-Subject Personalization (MSP) to…
▽ More
Creative story illustration requires a consistent interplay of multiple characters or objects. However, conventional text-to-image models face significant challenges while producing images featuring multiple personalized subjects. For example, they distort the subject rendering, or the text descriptions fail to render coherent subject interactions. We present Multi-Subject Personalization (MSP) to alleviate some of these challenges. We implement MSP using Stable Diffusion and assess our approach against other text-to-image models, showcasing its consistent generation of good-quality images representing intended subjects and interactions.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
CustomText: Customized Textual Image Generation using Diffusion Models
Authors:
Shubham Paliwal,
Arushi Jain,
Monika Sharma,
Vikram Jamwal,
Lovekesh Vig
Abstract:
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the s…
▽ More
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models. We call our proposed method CustomText. Our implementation leverages a pre-trained TextDiffuser model to enable control over font color, background, and types. Additionally, to address the challenge of accurately rendering small-sized fonts, we train the ControlNet model for a consistency decoder, significantly enhancing text-generation performance. We assess the performance of CustomText in comparison to previous methods of textual image generation on the publicly available CTW-1500 dataset and a self-curated dataset for small-text generation, showcasing superior results.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers
Authors:
Adit Jain,
Vikram Krishnamurthy
Abstract:
This paper studies the problem of autonomous agents performing Bayesian social learning for sequential detection when the observations of the state belong to a high-dimensional space and are expensive to analyze. Specifically, when the observations are textual, the Bayesian agent can use a large language model (LLM) as a map to get a low-dimensional private observation. The agent performs Bayesian…
▽ More
This paper studies the problem of autonomous agents performing Bayesian social learning for sequential detection when the observations of the state belong to a high-dimensional space and are expensive to analyze. Specifically, when the observations are textual, the Bayesian agent can use a large language model (LLM) as a map to get a low-dimensional private observation. The agent performs Bayesian learning and takes an action that minimizes the expected cost and is visible to subsequent agents. We prove that a sequence of such Bayesian agents herd in finite time to the public belief and take the same action disregarding the private observations. We propose a stop** time formulation for quickest time herding in social learning and optimally balance privacy and herding. Structural results are shown on the threshold nature of the optimal policy to the stop** time problem. We illustrate the application of our framework when autonomous Bayesian detectors aim to sequentially identify if a user is a hate speech peddler on an online platform by parsing text observations using an LLM. We numerically validate our results on real-world hate speech datasets. We show that autonomous Bayesian agents designed to flag hate speech peddlers in online platforms herd and misclassify the users when the public prior is strong. We also numerically show the effect of a threshold policy in delaying herding.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization
Authors:
Adit Jain,
Vikram Krishnamurthy
Abstract:
This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG.…
▽ More
This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function} or a non-informative measurement, depending on the oracle state and incentive. The learner's query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Investigating Interaction Modes and User Agency in Human-LLM Collaboration for Domain-Specific Data Analysis
Authors:
Jia**g Guo,
Vikram Mohanty,
Jorge Piazentin Ono,
Hongtao Hao,
Liang Gou,
Liu Ren
Abstract:
Despite demonstrating robust capabilities in performing tasks related to general-domain data-operation tasks, Large Language Models (LLMs) may exhibit shortcomings when applied to domain-specific tasks. We consider the design of domain-specific AI-powered data analysis tools from two dimensions: interaction and user agency. We implemented two design probes that fall on the two ends of the two dime…
▽ More
Despite demonstrating robust capabilities in performing tasks related to general-domain data-operation tasks, Large Language Models (LLMs) may exhibit shortcomings when applied to domain-specific tasks. We consider the design of domain-specific AI-powered data analysis tools from two dimensions: interaction and user agency. We implemented two design probes that fall on the two ends of the two dimensions: an open-ended high agency (OHA) prototype and a structured low agency (SLA) prototype. We conducted an interview study with nine data scientists to investigate (1) how users perceived the LLM outputs for data analysis assistance, and (2) how the two test design probes, OHA and SLA, affected user behavior, performance, and perceptions. Our study revealed insights regarding participants' interactions with LLMs, how they perceived the results, and their desire for explainability concerning LLM outputs, along with a noted need for collaboration with other users, and how they envisioned the utility of LLMs in their workflow.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Exponential time propagators for elastodynamics
Authors:
Paavai Pari,
Bikash Kanungo,
Vikram Gavini
Abstract:
We propose a computationally efficient and systematically convergent approach for elastodynamics simulations. We recast the second-order dynamical equation of elastodynamics into an equivalent first-order system of coupled equations, so as to express the solution in the form of a Magnus expansion. With any spatial discretization, it entails computing the exponential of a matrix acting upon a vecto…
▽ More
We propose a computationally efficient and systematically convergent approach for elastodynamics simulations. We recast the second-order dynamical equation of elastodynamics into an equivalent first-order system of coupled equations, so as to express the solution in the form of a Magnus expansion. With any spatial discretization, it entails computing the exponential of a matrix acting upon a vector. We employ an adaptive Krylov subspace approach to inexpensively and and accurately evaluate the action of the exponential matrix on a vector. In particular, we use an apriori error estimate to predict the optimal Kyrlov subspace size required for each time-step size. We show that the Magnus expansion truncated after its first term provides quadratic and superquadratic convergence in the time-step for nonlinear and linear elastodynamics, respectively. We demonstrate the accuracy and efficiency of the proposed method for one linear (linear cantilever beam) and three nonlinear (nonlinear cantilever beam, soft tissue elastomer, and hyperelastic rubber) benchmark systems. For a desired accuracy in energy, displacement, and velocity, our method allows for $10-100\times$ larger time-steps than conventional time-marching schemes such as Newmark-$β$ method. Computationally, it translates to a $\sim$$1000\times$ and $\sim$$10-100\times$ speed-up over conventional time-marching schemes for linear and nonlinear elastodynamics, respectively.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Adaptive Mechanism Design using Multi-Agent Revealed Preferences
Authors:
Luke Snow,
Vikram Krishnamurthy
Abstract:
This paper constructs an algorithmic framework for adaptively achieving the mechanism design objective, finding a mechanism inducing socially optimal Nash equilibria, without knowledge of the utility functions of the agents. We consider a probing scheme where the designer can iteratively enact mechanisms and observe Nash equilibria responses. We first derive necessary and sufficient conditions, ta…
▽ More
This paper constructs an algorithmic framework for adaptively achieving the mechanism design objective, finding a mechanism inducing socially optimal Nash equilibria, without knowledge of the utility functions of the agents. We consider a probing scheme where the designer can iteratively enact mechanisms and observe Nash equilibria responses. We first derive necessary and sufficient conditions, taking the form of linear program feasibility, for the existence of utility functions under which the empirical Nash equilibria responses are socially optimal. Then, we utilize this to construct a loss function with respect to the mechanism, and show that its global minimization occurs at mechanisms under which Nash equilibria system responses are also socially optimal. We develop a simulated annealing-based gradient algorithm, and prove that it converges in probability to this set of global minima, thus achieving adaptive mechanism design.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Using Capability Maps Tailored to Arm Range of Motion in VR Exergames for Rehabilitation
Authors:
Christian Lourido,
Zaid Waghoo,
Hassam Khan Wazir,
Nishtha Bhagat,
Vikram Kapila
Abstract:
Many neurological conditions, e.g., a stroke, can cause patients to experience upper limb (UL) motor impairments that hinder their daily activities. For such patients, while rehabilitation therapy is key for regaining autonomy and restoring mobility, its long-term nature entails ongoing time commitment and it is often not sufficiently engaging. Virtual reality (VR) can transform rehabilitation the…
▽ More
Many neurological conditions, e.g., a stroke, can cause patients to experience upper limb (UL) motor impairments that hinder their daily activities. For such patients, while rehabilitation therapy is key for regaining autonomy and restoring mobility, its long-term nature entails ongoing time commitment and it is often not sufficiently engaging. Virtual reality (VR) can transform rehabilitation therapy into engaging game-like tasks that can be tailored to patient-specific activities, set goals, and provide rehabilitation assessment. Yet, most VR systems lack built-in methods to track progress over time and alter rehabilitation programs accordingly. We propose using arm kinematic modeling and capability maps to allow a VR system to understand a user's physical capability and limitation. Next, we suggest two use cases for the VR system to utilize the user's capability map for tailoring rehabilitation programs. Finally, for one use case, it is shown that the VR system can emphasize and assess the use of specific UL joints.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Wireless Earphone-based Real-Time Monitoring of Breathing Exercises: A Deep Learning Approach
Authors:
Hassam Khan Wazir,
Zaid Waghoo,
Vikram Kapila
Abstract:
Several therapy routines require deep breathing exercises as a key component and patients undergoing such therapies must perform these exercises regularly. Assessing the outcome of a therapy and tailoring its course necessitates monitoring a patient's compliance with the therapy. While therapy compliance monitoring is routine in a clinical environment, it is challenging to do in an at-home setting…
▽ More
Several therapy routines require deep breathing exercises as a key component and patients undergoing such therapies must perform these exercises regularly. Assessing the outcome of a therapy and tailoring its course necessitates monitoring a patient's compliance with the therapy. While therapy compliance monitoring is routine in a clinical environment, it is challenging to do in an at-home setting. This is so because a home setting lacks access to specialized equipment and skilled professionals needed to effectively monitor the performance of a therapy routine by a patient. For some types of therapies, these challenges can be addressed with the use of consumer-grade hardware, such as earphones and smartphones, as practical solutions. To accurately monitor breathing exercises using wireless earphones, this paper proposes a framework that has the potential for assessing a patient's compliance with an at-home therapy. The proposed system performs real-time detection of breathing phases and channels with high accuracy by processing a $\mathbf{500}$ ms audio signal through two convolutional neural networks. The first network, called a channel classifier, distinguishes between nasal and oral breathing, and a pause. The second network, called a phase classifier, determines whether the audio segment is from inhalation or exhalation. According to $k$-fold cross-validation, the channel and phase classifiers achieved a maximum F1 score of $\mathbf{97.99\%}$ and $\mathbf{89.46\%}$, respectively. The results demonstrate the potential of using commodity earphones for real-time breathing channel and phase detection for breathing therapy compliance monitoring.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Authors:
Aleksandar Botev,
Soham De,
Samuel L Smith,
Anushan Fernando,
George-Cristian Muraru,
Ruba Haroun,
Leonard Berrada,
Razvan Pascanu,
Pier Giuseppe Sessa,
Robert Dadashi,
Léonard Hussenot,
Johan Ferret,
Sertan Girgin,
Olivier Bachem,
Alek Andreev,
Kathleen Kenealy,
Thomas Mesnard,
Cassidy Hardin,
Surya Bhupatiraju,
Shreya Pathak,
Laurent Sifre,
Morgane Rivière,
Mihir Sanjay Kale,
Juliette Love,
Pouya Tafti
, et al. (37 additional authors not shown)
Abstract:
We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned var…
▽ More
We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Biodegradable Interactive Materials
Authors:
Zhihan Zhang,
Mallory Parker,
Kuotian Liao,
Jerry Cao,
Anandghan Waghmare,
Joseph Breda,
Chris Matsumura,
Serena Eley,
Eleftheria Roumeli,
Shwetak Patel,
Vikram Iyer
Abstract:
The sense of touch is fundamental to how we interact with the physical and digital world. Conventional interactive surfaces and tactile interfaces use electronic sensors embedded into objects, however this approach poses serious challenges both for environmental sustainability and a future of truly ubiquitous interaction systems where information is encoded into everyday objects. In this work, we…
▽ More
The sense of touch is fundamental to how we interact with the physical and digital world. Conventional interactive surfaces and tactile interfaces use electronic sensors embedded into objects, however this approach poses serious challenges both for environmental sustainability and a future of truly ubiquitous interaction systems where information is encoded into everyday objects. In this work, we present Biodegradable Interactive Materials: backyard-compostable interactive interfaces that leverage information encoded in material properties. Inspired by natural systems, we propose an architecture that programmatically encodes multidimensional information into materials themselves and combines them with wearable devices that extend human senses to perceive the embedded data. We combine unrefined biological matter from plants and algae like chlorella with natural minerals like graphite and magnetite to produce materials with varying electrical, magnetic, and surface properties. We perform in-depth analysis using physics models, computational simulations, and real-world experiments to characterize their information density and develop decoding methods. Our passive, chip-less materials can robustly encode 12 bits of information, equivalent to 4096 unique classes. We further develop wearable device prototypes that can decode this information during touch interactions using off-the-shelf sensors. We demonstrate sample applications such as customized buttons, tactile maps, and interactive surfaces. We further demonstrate the natural degradation of these interactive materials in degrade outdoors within 21 days and perform a comparative environmental analysis of the benefits of this approach.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Eclipse Attack Detection on a Blockchain Network as a Non-Parametric Change Detection Problem
Authors:
Anurag Gupta,
Vikram Krishnamurthy,
Brian M. Sadler
Abstract:
This paper introduces a novel non-parametric change detection algorithm to identify eclipse attacks on a blockchain network; the non-parametric algorithm relies only on the empirical mean and variance of the dataset, making it highly adaptable. An eclipse attack occurs when malicious actors isolate blockchain users, disrupting their ability to reach consensus with the broader network, thereby dist…
▽ More
This paper introduces a novel non-parametric change detection algorithm to identify eclipse attacks on a blockchain network; the non-parametric algorithm relies only on the empirical mean and variance of the dataset, making it highly adaptable. An eclipse attack occurs when malicious actors isolate blockchain users, disrupting their ability to reach consensus with the broader network, thereby distorting their local copy of the ledger. To detect an eclipse attack, we monitor changes in the Fréchet mean and variance of the evolving blockchain communication network connecting blockchain users. First, we leverage the Johnson-Lindenstrauss lemma to project large-dimensional networks into a lower-dimensional space, preserving essential statistical properties. Subsequently, we employ a non-parametric change detection procedure, leading to a test statistic that converges weakly to a Brownian bridge process in the absence of an eclipse attack. This enables us to quantify the false alarm rate of the detector. Our detector can be implemented as a smart contract on the blockchain, offering a tamper-proof and reliable solution. Finally, we use numerical examples to compare the proposed eclipse attack detector with a detector based on the random forest model.
△ Less
Submitted 30 May, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
Authors:
Vikram Voleti,
Chun-Han Yao,
Mark Boss,
Adam Letts,
David Pankratz,
Dmitry Tochilkin,
Christian Laforte,
Robin Rombach,
Varun Jampani
Abstract:
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affec…
▽ More
We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affecting the performance of 3D object generation. In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. We also propose improved 3D optimization techniques to use SV3D and its NVS outputs for image-to-3D generation. Extensive experimental results on multiple datasets with 2D and 3D metrics as well as user study demonstrate SV3D's state-of-the-art performance on NVS as well as 3D reconstruction compared to prior works.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems
Authors:
Chu Li,
Zhihan Zhang,
Michael Saugstad,
Esteban Safranchik,
Minchu Kulkarni,
Xiaoyu Huang,
Shwetak Patel,
Vikram Iyer,
Tim Althoff,
Jon E. Froehlich
Abstract:
Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. W…
▽ More
Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We introduce LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge. Our technical evaluation shows that our LabelAId pipeline consistently outperforms state-of-the-art ML baselines, improving mistake inference accuracy by 36.7% with 50 downstream samples. We then implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility. A between-subjects study with 34 participants demonstrates that LabelAId significantly enhances label precision without compromising efficiency while also increasing labeler confidence. We discuss LabelAId's success factors, limitations, and its generalizability to other crowdsourced science domains.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Optimal Policy Sparsification and Low Rank Decomposition for Deep Reinforcement Learning
Authors:
Vikram Goddla
Abstract:
Deep reinforcement learning(DRL) has shown significant promise in a wide range of applications including computer games and robotics. Yet, training DRL policies consume extraordinary computing resources resulting in dense policies which are prone to overfitting. Moreover, inference with dense DRL policies limit their practical applications, especially in edge computing. Techniques such as pruning…
▽ More
Deep reinforcement learning(DRL) has shown significant promise in a wide range of applications including computer games and robotics. Yet, training DRL policies consume extraordinary computing resources resulting in dense policies which are prone to overfitting. Moreover, inference with dense DRL policies limit their practical applications, especially in edge computing. Techniques such as pruning and singular value decomposition have been used with deep learning models to achieve sparsification and model compression to limit overfitting and reduce memory consumption. However, these techniques resulted in sub-optimal performance with notable decay in rewards. $L_1$ and $L_2$ regularization techniques have been proposed for neural network sparsification and sparse auto-encoder development, but their implementation in DRL environments has not been apparent. We propose a novel $L_0$-norm-regularization technique using an optimal sparsity map to sparsify DRL policies and promote their decomposition to a lower rank without decay in rewards. We evaluated our $L_0$-norm-regularization technique across five different environments (Cartpole-v1, Acrobat-v1, LunarLander-v2, SuperMarioBros-7.1.v0 and Surgical Robot Learning) using several on-policy and off-policy algorithms. We demonstrated that the $L_0$-norm-regularized DRL policy in the SuperMarioBros environment achieved 93% sparsity and gained 70% compression when subjected to low-rank decomposition, while significantly outperforming the dense policy. Additionally, the $L_0$-norm-regularized DRL policy in the Surgical Robot Learning environment achieved a 36% sparsification and gained 46% compression when decomposed to a lower rank, while being performant. The results suggest that our custom $L_0$-norm-regularization technique for sparsification of DRL policies is a promising avenue to reduce computational resources and limit overfitting.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Stage: Query Execution Time Prediction in Amazon Redshift
Authors:
Ziniu Wu,
Ryan Marcus,
Zhengchun Liu,
Parimarjan Negi,
Vikram Nathan,
Pascal Pfeil,
Gaurav Saxena,
Mohammad Rahman,
Balakrishnan Narayanaswamy,
Tim Kraska
Abstract:
Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduli…
▽ More
Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduling, and execution resource control. Unfortunately, many existing execution time prediction techniques, including those used in Redshift, suffer from cold start issues, inaccurate estimation, and are not robust against workload/data changes.
In this paper, we propose a novel hierarchical execution time predictor: the Stage predictor. The Stage predictor is designed to leverage the unique characteristics and challenges faced by Redshift. The Stage predictor consists of three model states: an execution time cache, a lightweight local model optimized for a specific DB instance with uncertainty measurement, and a complex global model that is transferable across all instances in Redshift. We design a systematic approach to use these models that best leverages optimality (cache), instance-optimization (local model), and transferable knowledge about Redshift (global model). Experimentally, we show that the Stage predictor makes more accurate and robust predictions while maintaining a practical inference latency and memory overhead. Overall, the Stage predictor can improve the average query execution latency by $20\%$ on these instances compared to the prior query performance predictor in Redshift.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
EcoVal: An Efficient Data Valuation Framework for Machine Learning
Authors:
Ayush K Tarun,
Vikram S Chundawat,
Murari Mandal,
Hong Ming Tan,
Bowei Chen,
Mohan Kankanhalli
Abstract:
Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an…
▽ More
Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall data value can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a \textit{production function}, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models.
△ Less
Submitted 7 April, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Context-Aware Spectrum Coexistence of Terrestrial Beyond 5G Networks in Satellite Bands
Authors:
Ta Seen Reaz Niloy,
Zoheb Hasan,
Rob Smith,
Vikram R. Anapana,
Vijay K. Shah
Abstract:
Spectrum sharing between terrestrial 5G and incumbent networks in the satellite bands presents a promising avenue to satisfy the ever-increasing bandwidth demand of the next-generation wireless networks. However, protecting incumbent operations from harmful interference poses a fundamental challenge in accommodating terrestrial broadband cellular networks in the satellite bands. State-of-the-art s…
▽ More
Spectrum sharing between terrestrial 5G and incumbent networks in the satellite bands presents a promising avenue to satisfy the ever-increasing bandwidth demand of the next-generation wireless networks. However, protecting incumbent operations from harmful interference poses a fundamental challenge in accommodating terrestrial broadband cellular networks in the satellite bands. State-of-the-art spectrum-sharing policies usually consider several worst-case assumptions and ignore site-specific contextual factors in making spectrum-sharing decisions, and thus, often results in under-utilization of the shared band for the secondary licensees. To address such limitations, this paper introduces CAT3S (Context-Aware Terrestrial-Satellite Spectrum Sharing) framework that empowers the coexisting terrestrial 5G network to maximize utilization of the shared satellite band without creating harmful interference to the incumbent links by exploiting the contextual factors. CAT3S consists of the following two components: (i) context-acquisition unit to collect and process essential contextual information for spectrum sharing and (ii) context-aware base station (BS) control unit to optimize the set of operational BSs and their operation parameters (i.e., transmit power and active beams per sector). To evaluate the performance of the CAT3S, a realistic spectrum coexistence case study over the 12 GHz band is considered. Experiment results demonstrate that the proposed CAT3S achieves notably higher spectrum utilization than state-of-the-art spectrum-sharing policies in different weather contexts.
△ Less
Submitted 14 February, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection
Authors:
Sarah Masud,
Mohammad Aflah Khan,
Vikram Goyal,
Md Shad Akhtar,
Tanmoy Chakraborty
Abstract:
Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrai…
▽ More
Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.
△ Less
Submitted 3 February, 2024;
originally announced February 2024.
-
Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials
Authors:
Ivan Grega,
Ilyes Batatia,
Gábor Csányi,
Sri Karlapati,
Vikram S. Deshpande
Abstract:
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is…
▽ More
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
△ Less
Submitted 20 March, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Mean Estimation with User-Level Privacy for Spatio-Temporal IoT Datasets
Authors:
V. Arvind Rameshwar,
Anshoo Tandon,
Prajjwal Gupta,
Aditya Vikram Singh,
Novoneel Chakraborty,
Abhay Sharma
Abstract:
This paper considers the problem of the private release of sample means of speed values from traffic datasets. Our key contribution is the development of user-level differentially private algorithms that incorporate carefully chosen parameter values to ensure low estimation errors on real-world datasets, while ensuring privacy. We test our algorithms on ITMS (Intelligent Traffic Management System)…
▽ More
This paper considers the problem of the private release of sample means of speed values from traffic datasets. Our key contribution is the development of user-level differentially private algorithms that incorporate carefully chosen parameter values to ensure low estimation errors on real-world datasets, while ensuring privacy. We test our algorithms on ITMS (Intelligent Traffic Management System) data from an Indian city, where the speeds of different buses are drawn in a potentially non-i.i.d. manner from an unknown distribution, and where the number of speed samples contributed by different buses is potentially different. We then apply our algorithms to large synthetic datasets, generated based on the ITMS data. Here, we provide theoretical justification for the observed performance trends, and also provide recommendations for the choices of algorithm subroutines that result in low estimation errors. Finally, we characterize the best performance of pseudo-user creation-based algorithms on worst-case datasets via a minimax approach; this then gives rise to a novel procedure for the creation of pseudo-users, which optimizes the worst-case total estimation error. The algorithms discussed in the paper are readily applicable to general spatio-temporal IoT datasets for releasing a differentially private mean of a desired value.
△ Less
Submitted 25 April, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
AVELA -- A Vision for Engineering Literacy & Access: Understanding Why Technology Alone Is Not Enough
Authors:
Kyle Johnson,
Vicente Arroyos,
Celeste Garcia,
Liban Hussein,
Aisha Cora,
Tsewone Melaku,
Jay L. Cunningham,
R. Benjamin Shapiro,
Vikram Iyer
Abstract:
Unequal technology access for Black and Latine communities has been a persistent economic, social justice, and human rights issue despite increased technology accessibility due to advancements in consumer electronics like phones, tablets, and computers. We contextualize socio-technical access inequalities for Black and Latine urban communities and find that many students are hesitant to engage wit…
▽ More
Unequal technology access for Black and Latine communities has been a persistent economic, social justice, and human rights issue despite increased technology accessibility due to advancements in consumer electronics like phones, tablets, and computers. We contextualize socio-technical access inequalities for Black and Latine urban communities and find that many students are hesitant to engage with available technologies due to a lack of engaging support systems. We present a holistic student-led STEM engagement model through AVELA - A Vision for Engineering Literacy and Access leveraging culturally responsive lessons, mentor embodied community representation, and service learning. To evaluate the model's impact after 4 years of mentoring 200+ university student instructors in teaching to 2,500+ secondary school students in 100+ classrooms, we conducted 24 semi-structured interviews with college AnonymizedOrganization members. We identify access barriers and provide principled recommendations for designing future STEM education programs.
△ Less
Submitted 29 January, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Authors:
**g Yu Koh,
Robert Lo,
Lawrence Jang,
Vikram Duvvur,
Ming Chong Lim,
Po-Yu Huang,
Graham Neubig,
Shuyan Zhou,
Ruslan Salakhutdinov,
Daniel Fried
Abstract:
Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen…
▽ More
Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.
△ Less
Submitted 5 June, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Toward a Reinforcement-Learning-Based System for Adjusting Medication to Minimize Speech Disfluency
Authors:
Pavlos Constas,
Vikram Rawal,
Matthew Honorio Oliveira,
Andreas Constas,
Aditya Khan,
Kaison Cheung,
Najma Sultani,
Carrie Chen,
Micol Altomare,
Michael Akzam,
Jiacheng Chen,
Vhea He,
Lauren Altomare,
Heraa Murqi,
Asad Khan,
Nimit Amikumar Bhanshali,
Youssef Rachad,
Michael Guerzhoy
Abstract:
We propose a reinforcement learning (RL)-based system that would automatically prescribe a hypothetical patient medication that may help the patient with their mental health-related speech disfluency, and adjust the medication and the dosages in response to zero-cost frequent measurement of the fluency of the patient. We demonstrate the components of the system: a module that detects and evaluates…
▽ More
We propose a reinforcement learning (RL)-based system that would automatically prescribe a hypothetical patient medication that may help the patient with their mental health-related speech disfluency, and adjust the medication and the dosages in response to zero-cost frequent measurement of the fluency of the patient. We demonstrate the components of the system: a module that detects and evaluates speech disfluency on a large dataset we built, and an RL algorithm that automatically finds good combinations of medications. To support the two modules, we collect data on the effect of psychiatric medications for speech disfluency from the literature, and build a plausible patient simulation system. We demonstrate that the RL system is, under some circumstances, able to converge to a good medication regime. We collect and label a dataset of people with possible speech disfluency and demonstrate our methods using that dataset. Our work is a proof of concept: we show that there is promise in the idea of using automatic data collection to address speech disfluency.
△ Less
Submitted 5 February, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
Proportional Representation in Metric Spaces and Low-Distortion Committee Selection
Authors:
Yusuf Hakan Kalayci,
David Kempe,
Vikram Kher
Abstract:
We introduce a novel definition for a small set R of k points being "representative" of a larger set in a metric space. Given a set V (e.g., documents or voters) to represent, and a set C of possible representatives, our criterion requires that for any subset S comprising a theta fraction of V, the average distance of S to their best theta*k points in R should not be more than a factor gamma compa…
▽ More
We introduce a novel definition for a small set R of k points being "representative" of a larger set in a metric space. Given a set V (e.g., documents or voters) to represent, and a set C of possible representatives, our criterion requires that for any subset S comprising a theta fraction of V, the average distance of S to their best theta*k points in R should not be more than a factor gamma compared to their average distance to the best theta*k points among all of C. This definition is a strengthening of proportional fairness and core fairness, but - different from those notions - requires that large cohesive clusters be represented proportionally to their size.
Since there are instances for which - unless gamma is polynomially large - no solutions exist, we study this notion in a resource augmentation framework, implicitly stating the constraints for a set R of size k as though its size were only k/alpha, for alpha > 1. Furthermore, motivated by the application to elections, we mostly focus on the "ordinal" model, where the algorithm does not learn the actual distances; instead, it learns only for each point v in V and each candidate pairs c, c' which of c, c' is closer to v. Our main result is that the Expanding Approvals Rule (EAR) of Aziz and Lee is (alpha, gamma) representative with gamma <= 1 + 6.71 * (alpha)/(alpha-1).
Our results lead to three notable byproducts. First, we show that the EAR achieves constant proportional fairness in the ordinal model, giving the first positive result on metric proportional fairness with ordinal information. Second, we show that for the core fairness objective, the EAR achieves the same asymptotic tradeoff between resource augmentation and approximation as the recent results of Li et al., which used full knowledge of the metric. Finally, our results imply a very simple single-winner voting rule with metric distortion at most 44.
△ Less
Submitted 23 January, 2024; v1 submitted 16 December, 2023;
originally announced December 2023.
-
Exploiting Representation Bias for Data Distillation in Abstractive Text Summarization
Authors:
Yash Kumar Atri,
Vikram Goyal,
Tanmoy Chakraborty
Abstract:
Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance…
▽ More
Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance, and therefore, a need to revisit the quality of training samples and the learning protocol of deep learning models is a must. In this paper, we aim to discretize the vector space of the abstractive text summarization models to understand the characteristics learned between the input embedding space and the models' encoder space. We show that deep models fail to capture the diversity of the input space. Further, the distribution of data points on the encoder space indicates that an unchecked increase in the training samples does not add value; rather, a tear-down of data samples is highly needed to make the models focus on variability and faithfulness. We employ clustering techniques to learn the diversity of a model's sample space and how data points are mapped from the embedding space to the encoder space and vice versa. Further, we devise a metric to filter out redundant data points to make the model more robust and less data hungry. We benchmark our proposed method using quantitative metrics, such as Rouge, and qualitative metrics, such as BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit the models from learning the diversity from the varied input samples.
△ Less
Submitted 20 December, 2023; v1 submitted 10 December, 2023;
originally announced December 2023.
-
Training Chain-of-Thought via Latent-Variable Inference
Authors:
Du Phan,
Matthew D. Hoffman,
David Dohan,
Sholto Douglas,
Tuan Anh Le,
Aaron Parisi,
Pavel Sountsov,
Charles Sutton,
Sharad Vikram,
Rif A. Saurous
Abstract:
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training se…
▽ More
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.
△ Less
Submitted 28 November, 2023;
originally announced December 2023.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Authors:
Andreas Blattmann,
Tim Dockhorn,
Sumith Kulal,
Daniel Mendelevitch,
Maciej Kilian,
Dominik Lorenz,
Yam Levi,
Zion English,
Vikram Voleti,
Adam Letts,
Varun Jampani,
Robin Rombach
Abstract:
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary wi…
▽ More
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models
Authors:
Zachary Englhardt,
Chengqian Ma,
Margaret E. Morris,
Xuhai "Orson" Xu,
Chun-Cheng Chang,
Lianhui Qin,
Daniel McDuff,
Xin Liu,
Shwetak Patel,
Vikram Iyer
Abstract:
Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, develo** analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental hea…
▽ More
Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, develo** analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental health. To address these challenges, we take a novel approach that leverages large language models (LLMs) to synthesize clinically useful insights from multi-sensor data. We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data such as step count and sleep relate to conditions like depression and anxiety. We first demonstrate binary depression classification with LLMs achieving accuracies of 61.1% which exceed the state of the art. While it is not robust for clinical use, this leads us to our key finding: even more impactful and valued than classification is a new human-AI collaboration approach in which clinician experts interactively query these tools and combine their domain expertise and context about the patient with AI generated reasoning to support clinical decision-making. We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
△ Less
Submitted 25 November, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems
Authors:
Vikram Mohanty,
Jude Lim,
Kurt Luther
Abstract:
As AI models evolve, understanding the influence of underlying models on user experience and performance in AI-infused systems becomes critical, particularly while transitioning between different model versions. We studied the influence of model change by conducting two complementary studies in the context of AI-based facial recognition for historical person identification tasks. First, we ran an…
▽ More
As AI models evolve, understanding the influence of underlying models on user experience and performance in AI-infused systems becomes critical, particularly while transitioning between different model versions. We studied the influence of model change by conducting two complementary studies in the context of AI-based facial recognition for historical person identification tasks. First, we ran an online experiment where crowd workers interacted with two different facial recognition models: an older version and a recently updated, developer-certified more accurate model. Second, we studied a real-world deployment of these models on a popular historical photo platform through a diary study with 10 users. Our findings sheds light on models affecting human-AI team performance, users' abilities to differentiate between different models, the folk theories they develop, and how these theories influence their preferences. Drawing from these insights, we discuss design implications for updating models in AI-infused systems.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
DeltaLCA: Comparative Life-Cycle Assessment for Electronics Design
Authors:
Zhihan Zhang,
Felix Hähnlein,
Yuxuan Mei,
Zachary Englhardt,
Shwetak Patel,
Adriana Schulz,
Vikram Iyer
Abstract:
Reducing the environmental footprint of electronics and computing devices requires new tools that empower designers to make informed decisions about sustainability during the design process itself. This is not possible with current tools for life cycle assessment (LCA) which require substantial domain expertise and time to evaluate the numerous chips and other components that make up a device. We…
▽ More
Reducing the environmental footprint of electronics and computing devices requires new tools that empower designers to make informed decisions about sustainability during the design process itself. This is not possible with current tools for life cycle assessment (LCA) which require substantial domain expertise and time to evaluate the numerous chips and other components that make up a device. We observe first that informed decision-making does not require absolute metrics and can instead be done by comparing designs. Second, we can use domain-specific heuristics to perform these comparisons. We combine these insights to develop DeltaLCA, an open-source interactive design tool that addresses the dual challenges of automating life cycle inventory generation and data availability by performing comparative analyses of electronics designs. Users can upload standard design files from Electronic Design Automation (EDA) software and the tool will guide them through determining which one has greater carbon footprint. DeltaLCA leverages electronics-specific LCA datasets and heuristics and tries to automatically rank the two designs, prompting users to provide additional information only when necessary. We show through case studies DeltaLCA achieves the same result as evaluating full LCAs, and that it accelerates LCA comparisons from eight expert-hours to a single click for devices with ~30 components, and 15 minutes for more complex devices with ~100 components.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Reviewing Developments of Graph Convolutional Network Techniques for Recommendation Systems
Authors:
Haojun Zhu,
Vikram Kapoor,
Priya Sharma
Abstract:
The Recommender system is a vital information service on today's Internet. Recently, graph neural networks have emerged as the leading approach for recommender systems. We try to review recent literature on graph neural network-based recommender systems, covering the background and development of both recommender systems and graph neural networks. Then categorizing recommender systems by their set…
▽ More
The Recommender system is a vital information service on today's Internet. Recently, graph neural networks have emerged as the leading approach for recommender systems. We try to review recent literature on graph neural network-based recommender systems, covering the background and development of both recommender systems and graph neural networks. Then categorizing recommender systems by their settings and graph neural networks by spectral and spatial models, we explore the motivation behind incorporating graph neural networks into recommender systems. We also analyze challenges and open problems in graph construction, embedding propagation and aggregation, and computation efficiency. This guides us to better explore the future directions and developments in this domain.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection
Authors:
Akshit **dal,
Vikram Goyal,
Saket Anand,
Chetan Arora
Abstract:
Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the…
▽ More
Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
nvblox: GPU-Accelerated Incremental Signed Distance Field Map**
Authors:
Alexander Millane,
Helen Oleynikova,
Emilie Wirbel,
Remo Steiner,
Vikram Ramasamy,
David Tingdahl,
Roland Siegwart
Abstract:
Dense, volumetric maps are essential to enable robot navigation and interaction with the environment. To achieve low latency, dense maps are typically computed onboard the robot, often on computationally constrained hardware. Previous works leave a gap between CPU-based systems for robotic map** which, due to computation constraints, limit map resolution or scale, and GPU-based reconstruction sy…
▽ More
Dense, volumetric maps are essential to enable robot navigation and interaction with the environment. To achieve low latency, dense maps are typically computed onboard the robot, often on computationally constrained hardware. Previous works leave a gap between CPU-based systems for robotic map** which, due to computation constraints, limit map resolution or scale, and GPU-based reconstruction systems which omit features that are critical to robotic path planning, such as computation of the Euclidean Signed Distance Field (ESDF). We introduce a library, nvblox, that aims to fill this gap, by GPU-accelerating robotic volumetric map**. Nvblox delivers a significant performance improvement over the state of the art, achieving up to a 177x speed-up in surface reconstruction, and up to a 31x improvement in distance field computation, and is available open-source.
△ Less
Submitted 15 March, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
SonoSAMTrack -- Segment and Track Anything on Ultrasound Images
Authors:
Hariharan Ravishankar,
Rohan Patil,
Vikram Melapudi,
Harsh Suthar,
Stephan Anzengruber,
Parminder Bhatia,
Kass-Hout Taha,
Pavan Annangi
Abstract:
In this paper, we present SonoSAMTrack - that combines a promptable foundational model for segmenting objects of interest on ultrasound images called SonoSAM, with a state-of-the art contour tracking model to propagate segmentations on 2D+t and 3D ultrasound datasets. Fine-tuned and tested exclusively on a rich, diverse set of objects from $\approx200$k ultrasound image-mask pairs, SonoSAM demonst…
▽ More
In this paper, we present SonoSAMTrack - that combines a promptable foundational model for segmenting objects of interest on ultrasound images called SonoSAM, with a state-of-the art contour tracking model to propagate segmentations on 2D+t and 3D ultrasound datasets. Fine-tuned and tested exclusively on a rich, diverse set of objects from $\approx200$k ultrasound image-mask pairs, SonoSAM demonstrates state-of-the-art performance on 7 unseen ultrasound data-sets, outperforming competing methods by a significant margin. We also extend SonoSAM to 2-D +t applications and demonstrate superior performance making it a valuable tool for generating dense annotations and segmentation of anatomical structures in clinical workflows. Further, to increase practical utility of the work, we propose a two-step process of fine-tuning followed by knowledge distillation to a smaller footprint model without comprising the performance. We present detailed qualitative and quantitative comparisons of SonoSAM with state-of-the-art methods showcasing efficacy of the method. This is followed by demonstrating the reduction in number of clicks in a dense video annotation problem of adult cardiac ultrasound chamber segmentation using SonoSAMTrack.
△ Less
Submitted 16 November, 2023; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Conditional Generative Modeling for Images, 3D Animations, and Video
Authors:
Vikram Voleti
Abstract:
This dissertation attempts to drive innovation in the field of generative modeling for computer vision, by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video. Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for genera…
▽ More
This dissertation attempts to drive innovation in the field of generative modeling for computer vision, by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video. Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation. In all instances, we incorporate conditional information to enhance the synthesis of visual data, improving the efficiency of the generation process as well as the generated content.
We introduce the use of Neural ODEs to model video dynamics using an encoder-decoder architecture, demonstrating their ability to predict future video frames despite being trained solely to reconstruct current frames. Next, we propose a conditional variant of continuous normalizing flows that enables higher-resolution image generation based on lower-resolution input, achieving comparable image quality while reducing parameters and training time. Our next contribution presents a pipeline that takes human images as input, automatically aligns a user-specified 3D character with the pose of the human, and facilitates pose editing based on partial inputs. Next, we derive the relevant mathematical details for denoising diffusion models that use non-isotropic Gaussian processes, and show comparable generation quality. Finally, we devise a novel denoising diffusion framework capable of solving all three video tasks of prediction, generation, and interpolation. We perform ablation studies, and show SOTA results on multiple datasets.
Our contributions are published articles at peer-reviewed venues. Overall, our research aims to make a meaningful contribution to the pursuit of more efficient and flexible generative models, with the potential to shape the future of computer vision.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
A Traffic Control Framework for Uncrewed Aircraft Systems
Authors:
Ananay Vikram Gupta,
Aaditya Prakash Kattekola,
Ansh Vikram Gupta,
Dacharla Venkata Abhiram,
Kamesh Namuduri,
Ravichandran Subramanian
Abstract:
The exponential growth of Advanced Air Mobility (AAM) services demands assurances of safety in the airspace. This research a Traffic Control Framework (TCF) for develo** digital flight rules for Uncrewed Aircraft System (UAS) flying in designated air corridors. The proposed TCF helps model, deploy, and test UAS control, agents, regardless of their hardware configurations. This paper investigates…
▽ More
The exponential growth of Advanced Air Mobility (AAM) services demands assurances of safety in the airspace. This research a Traffic Control Framework (TCF) for develo** digital flight rules for Uncrewed Aircraft System (UAS) flying in designated air corridors. The proposed TCF helps model, deploy, and test UAS control, agents, regardless of their hardware configurations. This paper investigates the importance of digital flight rules in preventing collisions in the context of AAM. TCF is introduced as a platform for develo** strategies for managing traffic towards enhanced autonomy in the airspace. It allows for assessment and evaluation of autonomous navigation, route planning, obstacle avoidance, and adaptive decision making for UAS. It also allows for the introduction and evaluation of advance technologies Artificial Intelligence (AI) and Machine Learning (ML) in a simulation environment before deploying them in the real world. TCF can be used as a tool for comprehensive UAS traffic analysis, including KPI measurements. It offers flexibility for further testing and deployment laying the foundation for improved airspace safety - a vital aspect of UAS technological advancement. Finally, this papers demonstrates the capabilities of the proposed TCF in managing UAS traffic at intersections and its impact on overall traffic flow in air corridors, noting the bottlenecks and the inverse relationship safety and traffic volume.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.