-
Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet
Authors:
Manish Dhakal,
Arman Chhetri,
Aman Kumar Gupta,
Prabin Lamichhane,
Suraj Pandey,
Subarna Shakya
Abstract:
This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform map** of audio frames and their corresponding texts. Mel Frequen…
▽ More
This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform map** of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Present and Future of AI in Renewable Energy Domain : A Comprehensive Survey
Authors:
Abdur Rashid,
Parag Biswas,
Angona Biswas,
MD Abdullah Al Nasim,
Kishor Datta Gupta,
Roy George
Abstract:
Artificial intelligence (AI) has become a crucial instrument for streamlining processes in various industries, including electrical power systems, as a result of recent digitalization. Algorithms for artificial intelligence are data-driven models that are based on statistical learning theory and are used as a tool to take use of the data that the power system and its users generate. Initially, we…
▽ More
Artificial intelligence (AI) has become a crucial instrument for streamlining processes in various industries, including electrical power systems, as a result of recent digitalization. Algorithms for artificial intelligence are data-driven models that are based on statistical learning theory and are used as a tool to take use of the data that the power system and its users generate. Initially, we perform a thorough literature analysis of artificial intelligence (AI) applications related to renewable energy (RE). Next, we present a thorough analysis of renewable energy factories and assess their suitability, along with a list of the most widely used and appropriate AI algorithms. Nine AI-based strategies are identified here to assist Renewable Energy (RE) in contemporary power systems. This survey paper comprises an extensive review of the several AI techniques used for renewable energy as well as a methodical analysis of the literature for the study of various intelligent system application domains across different disciplines of renewable energy. This literature review identifies the performance and outcomes of nine different research methods by assessing them, and it aims to distill valuable insights into their strengths and limitations. This study also addressed three main topics: using AI technology for renewable power generation, utilizing AI for renewable energy forecasting, and optimizing energy systems. Additionally, it explored AI's superiority over conventional models in controllability, data handling, cyberattack prevention, smart grid implementation, robotics- AI's significance in sha** the future of the energy industry. Furthermore, this article outlines future directions in the integration of AI for renewable energy.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
AI-Driven Approaches for Optimizing Power Consumption: A Comprehensive Survey
Authors:
Parag Biswas,
Abdur Rashid,
Angona Biswas,
Md Abdullah Al Nasim,
Kishor Datta Gupta,
Roy George
Abstract:
Reduced environmental effect, lower operating costs, and a stable and sustainable energy supply for current and future generations are the main reasons why power optimization is important. Power optimization makes ensuring that energy is used more effectively, cutting down on waste and optimizing the utilization of resources.In today's world, power optimization and artificial intelligence (AI) int…
▽ More
Reduced environmental effect, lower operating costs, and a stable and sustainable energy supply for current and future generations are the main reasons why power optimization is important. Power optimization makes ensuring that energy is used more effectively, cutting down on waste and optimizing the utilization of resources.In today's world, power optimization and artificial intelligence (AI) integration are essential to changing the way energy is produced, used, and distributed. Real-time monitoring and analysis of power usage trends is made possible by AI-driven algorithms and predictive analytics, which enable dynamic modifications to effectively satisfy demand. Efficiency and sustainability are increased when power consumption is optimized in different sectors thanks to the use of intelligent systems. This survey paper comprises an extensive review of the several AI techniques used for power optimization as well as a methodical analysis of the literature for the study of various intelligent system application domains across different disciplines of power consumption.This literature review identifies the performance and outcomes of 17 different research methods by assessing them, and it aims to distill valuable insights into their strengths and limitations. Furthermore, this article outlines future directions in the integration of AI for power consumption optimization.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Memory Faults in Activation-sparse Quantized Deep Neural Networks: Analysis and Mitigation using Sharpness-aware Training
Authors:
Akul Malhotra,
Sumeet Kumar Gupta
Abstract:
Improving the hardware efficiency of deep neural network (DNN) accelerators with techniques such as quantization and sparsity enhancement have shown an immense promise. However, their inference accuracy in non-ideal real-world settings (such as in the presence of hardware faults) is yet to be systematically analyzed. In this work, we investigate the impact of memory faults on activation-sparse qua…
▽ More
Improving the hardware efficiency of deep neural network (DNN) accelerators with techniques such as quantization and sparsity enhancement have shown an immense promise. However, their inference accuracy in non-ideal real-world settings (such as in the presence of hardware faults) is yet to be systematically analyzed. In this work, we investigate the impact of memory faults on activation-sparse quantized DNNs (AS QDNNs). We show that a high level of activation sparsity comes at the cost of larger vulnerability to faults, with AS QDNNs exhibiting up to 11.13% lower accuracy than the standard QDNNs. We establish that the degraded accuracy correlates with a sharper minima in the loss landscape for AS QDNNs, which makes them more sensitive to perturbations in the weight values due to faults. Based on this observation, we employ sharpness-aware quantization (SAQ) training to mitigate the impact of memory faults. The AS and standard QDNNs trained with SAQ have up to 19.50% and 15.82% higher inference accuracy, respectively compared to their conventionally trained equivalents. Moreover, we show that SAQ-trained AS QDNNs show higher accuracy in faulty settings than standard QDNNs trained conventionally. Thus, sharpness-aware training can be instrumental in achieving sparsity-related latency benefits without compromising on fault tolerance.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
On Improving Error Resilience of Neural End-to-End Speech Coders
Authors:
Kishan Gupta,
Nicola Pia,
Srikanth Korse,
Andreas Brendel,
Guillaume Fuchs,
Markus Multrus
Abstract:
Error resilient tools like Packet Loss Concealment (PLC) and Forward Error Correction (FEC) are essential to maintain a reliable speech communication for applications like Voice over Internet Protocol (VoIP), where packets are frequently delayed and lost. In recent times, end-to-end neural speech codecs have seen a significant rise, due to their ability to transmit speech signal at low bitrates bu…
▽ More
Error resilient tools like Packet Loss Concealment (PLC) and Forward Error Correction (FEC) are essential to maintain a reliable speech communication for applications like Voice over Internet Protocol (VoIP), where packets are frequently delayed and lost. In recent times, end-to-end neural speech codecs have seen a significant rise, due to their ability to transmit speech signal at low bitrates but few considerations were made about their error resilience in a real system. Recently introduced Neural End-to-End Speech Codec (NESC) can reproduce high quality natural speech at low bitrates. We extend its robustness to packet losses by adding a low complexity network to predict the codebook indices in latent space. Furthermore, we propose a method to add an in-band FEC at an additional bitrate of 0.8 kbps. Both subjective and objective assessment indicate the effectiveness of proposed methods, and demonstrate that coupling PLC and FEC provide significant robustness against packet losses.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
UVIS: Unsupervised Video Instance Segmentation
Authors:
Shuaiyi Huang,
Saksham Suri,
Kamal Gupta,
Sai Saketh Rambhatla,
Ser-nam Lim,
Abhinav Shrivastava
Abstract:
Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes fro…
▽ More
Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Second-Order Algorithms for Finding Local Nash Equilibria in Zero-Sum Games
Authors:
Kushagra Gupta,
Xinjie Liu,
Ufuk Topcu,
David Fridovich-Keil
Abstract:
Zero-sum games arise in a wide variety of problems, including robust optimization and adversarial learning. However, algorithms deployed for finding a local Nash equilibrium in these games often converge to non-Nash stationary points. This highlights a key challenge: for any algorithm, the stability properties of its underlying dynamical system can cause non-Nash points to be potential attractors.…
▽ More
Zero-sum games arise in a wide variety of problems, including robust optimization and adversarial learning. However, algorithms deployed for finding a local Nash equilibrium in these games often converge to non-Nash stationary points. This highlights a key challenge: for any algorithm, the stability properties of its underlying dynamical system can cause non-Nash points to be potential attractors. To overcome this challenge, algorithms must account for subtleties involving the curvatures of players' costs. To this end, we leverage dynamical system theory and develop a second-order algorithm for finding a local Nash equilibrium in the smooth, possibly nonconvex-nonconcave, zero-sum game setting. First, we prove that this novel method guarantees convergence to only local Nash equilibria with a local linear convergence rate. We then interpret a version of this method as a modified Gauss-Newton algorithm with local superlinear convergence to the neighborhood of a point that satisfies first-order local Nash equilibrium conditions. In comparison, current related state-of-the-art methods do not offer convergence rate guarantees. Furthermore, we show that this approach naturally generalizes to settings with convex and potentially coupled constraints while retaining earlier guarantees of convergence to only local (generalized) Nash equilibria.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Aurora: A Foundation Model of the Atmosphere
Authors:
Cristian Bodnar,
Wessel P. Bruinsma,
Ana Lucic,
Megan Stanley,
Johannes Brandstetter,
Patrick Garvan,
Maik Riechert,
Jonathan Weyn,
Haiyu Dong,
Anna Vaughan,
Jayesh K. Gupta,
Kit Tambiratnam,
Alex Archibald,
Elizabeth Heider,
Max Welling,
Richard E. Turner,
Paris Perdikaris
Abstract:
Deep learning foundation models are revolutionizing many facets of science by leveraging vast amounts of data to learn general-purpose representations that can be adapted to tackle diverse downstream tasks. Foundation models hold the promise to also transform our ability to model our planet and its subsystems by exploiting the vast expanse of Earth system data. Here we introduce Aurora, a large-sc…
▽ More
Deep learning foundation models are revolutionizing many facets of science by leveraging vast amounts of data to learn general-purpose representations that can be adapted to tackle diverse downstream tasks. Foundation models hold the promise to also transform our ability to model our planet and its subsystems by exploiting the vast expanse of Earth system data. Here we introduce Aurora, a large-scale foundation model of the atmosphere trained on over a million hours of diverse weather and climate data. Aurora leverages the strengths of the foundation modelling approach to produce operational forecasts for a wide variety of atmospheric prediction problems, including those with limited training data, heterogeneous variables, and extreme events. In under a minute, Aurora produces 5-day global air pollution predictions and 10-day high-resolution weather forecasts that outperform state-of-the-art classical simulation tools and the best specialized deep learning models. Taken together, these results indicate that foundation models can transform environmental forecasting.
△ Less
Submitted 28 May, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Exploring Ordinality in Text Classification: A Comparative Study of Explicit and Implicit Techniques
Authors:
Siva Rajesh Kasa,
Aniket Goel,
Karan Gupta,
Sumegh Roychowdhury,
Anish Bhanushali,
Nikhil Pattisapu,
Prasanna Srinivasa Murthy
Abstract:
Ordinal Classification (OC) is a widely encountered challenge in Natural Language Processing (NLP), with applications in various domains such as sentiment analysis, rating prediction, and more. Previous approaches to tackle OC have primarily focused on modifying existing or creating novel loss functions that \textbf{explicitly} account for the ordinal nature of labels. However, with the advent of…
▽ More
Ordinal Classification (OC) is a widely encountered challenge in Natural Language Processing (NLP), with applications in various domains such as sentiment analysis, rating prediction, and more. Previous approaches to tackle OC have primarily focused on modifying existing or creating novel loss functions that \textbf{explicitly} account for the ordinal nature of labels. However, with the advent of Pretrained Language Models (PLMs), it became possible to tackle ordinality through the \textbf{implicit} semantics of the labels as well. This paper provides a comprehensive theoretical and empirical examination of both these approaches. Furthermore, we also offer strategic recommendations regarding the most effective approach to adopt based on specific settings.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
CPS-LLM: Large Language Model based Safe Usage Plan Generator for Human-in-the-Loop Human-in-the-Plant Cyber-Physical System
Authors:
Ayan Banerjee,
Aranyak Maity,
Payal Kamboj,
Sandeep K. S. Gupta
Abstract:
We explore the usage of large language models (LLM) in human-in-the-loop human-in-the-plant cyber-physical systems (CPS) to translate a high-level prompt into a personalized plan of actions, and subsequently convert that plan into a grounded inference of sequential decision-making automated by a real-world CPS controller to achieve a control goal. We show that it is relatively straightforward to c…
▽ More
We explore the usage of large language models (LLM) in human-in-the-loop human-in-the-plant cyber-physical systems (CPS) to translate a high-level prompt into a personalized plan of actions, and subsequently convert that plan into a grounded inference of sequential decision-making automated by a real-world CPS controller to achieve a control goal. We show that it is relatively straightforward to contextualize an LLM so it can generate domain-specific plans. However, these plans may be infeasible for the physical system to execute or the plan may be unsafe for human users. To address this, we propose CPS-LLM, an LLM retrained using an instruction tuning framework, which ensures that generated plans not only align with the physical system dynamics of the CPS but are also safe for human users. The CPS-LLM consists of two innovative components: a) a liquid time constant neural network-based physical dynamics coefficient estimator that can derive coefficients of dynamical models with some unmeasured state variables; b) the model coefficients are then used to train an LLM with prompts embodied with traces from the dynamical system and the corresponding model coefficients. We show that when the CPS-LLM is integrated with a contextualized chatbot such as BARD it can generate feasible and safe plans to manage external events such as meals for automated insulin delivery systems used by Type 1 Diabetes subjects.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Simple and Efficient Quantization Techniques for Neural Speech Coding
Authors:
Andreas Brendel,
Nicola Pia,
Kishan Gupta,
Guillaume Fuchs,
Markus Multrus
Abstract:
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This…
▽ More
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This discrete representation is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. Furthermore, we propose a new causal network architecture for neural speech coding that shows good performance at very low computational complexity.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses
Authors:
Gaurav Kumar Gupta,
Aditi Singh,
Sijo Valayakkad Manikandan,
Abul Ehtesham
Abstract:
The recent swift development of LLMs like GPT-4, Gemini, and GPT-3.5 offers a transformative opportunity in medicine and healthcare, especially in digital diagnostics. This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses, and it demonstrates how each of these models could significantly increase diagnostic…
▽ More
The recent swift development of LLMs like GPT-4, Gemini, and GPT-3.5 offers a transformative opportunity in medicine and healthcare, especially in digital diagnostics. This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses, and it demonstrates how each of these models could significantly increase diagnostic accuracy and efficiency. Through a series of diagnostic prompts based on symptoms from medical databases, GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Meanwhile, Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model when physicians are trying to make high-risk diagnoses. GPT-3.5, though slightly less advanced, is a good tool for medical diagnostics. This study highlights the need to study LLMs for healthcare and clinical practices with more care and attention, ensuring that any system utilizing LLMs promotes patient privacy and complies with health information privacy laws such as HIPAA compliance, as well as the social consequences that affect the varied individuals in complex healthcare contexts. This study marks the start of a larger future effort to study the various ways in which assigning ethical concerns to LLMs task of learning from human biases could unearth new ways to apply AI in complex medical settings.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM
Authors:
Laksh Nanwani,
Kumaraditya Gupta,
Aditya Mathur,
Swayam Agrawal,
A. H. Abdul Hafez,
K. Madhava Krishna
Abstract:
Humans excel at forming mental maps of their surroundings, equip** them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasi…
▽ More
Humans excel at forming mental maps of their surroundings, equip** them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models
Authors:
Shashi Kant Gupta,
Aditya Basu,
Mauro Nievas,
Jerrin Thomas,
Nathan Wolfrath,
Adhitya Ramamurthi,
Bradley Taylor,
Anai N. Kothari,
Regina Schwind,
Therica M. Miller,
Sorena Nadaf-Rahrov,
Yanshan Wang,
Hrituraj Singh
Abstract:
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients miss…
▽ More
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.
△ Less
Submitted 26 April, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Integrating Physiological Data with Large Language Models for Empathic Human-AI Interaction
Authors:
Poorvesh Dongre,
Majid Behravan,
Kunal Gupta,
Mark Billinghurst,
Denis Gračanin
Abstract:
This paper explores enhancing empathy in Large Language Models (LLMs) by integrating them with physiological data. We propose a physiological computing approach that includes develo** deep learning models that use physiological data for recognizing psychological states and integrating the predicted states with LLMs for empathic interaction. We showcase the application of this approach in an Empa…
▽ More
This paper explores enhancing empathy in Large Language Models (LLMs) by integrating them with physiological data. We propose a physiological computing approach that includes develo** deep learning models that use physiological data for recognizing psychological states and integrating the predicted states with LLMs for empathic interaction. We showcase the application of this approach in an Empathic LLM (EmLLM) chatbot for stress monitoring and control. We also discuss the results of a pilot study that evaluates this EmLLM chatbot based on its ability to accurately predict user stress, provide human-like responses, and assess the therapeutic alliance with the user.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Scaling Instructable Agents Across Many Simulated Worlds
Authors:
SIMA Team,
Maria Abi Raad,
Arun Ahuja,
Catarina Barros,
Frederic Besse,
Andrew Bolt,
Adrian Bolton,
Bethanie Brownfield,
Gavin Buttimore,
Max Cant,
Sarah Chakera,
Stephanie C. Y. Chan,
Jeff Clune,
Adrian Collister,
Vikki Copeman,
Alex Cullum,
Ishita Dasgupta,
Dario de Cesare,
Julia Di Trapani,
Yani Donchev,
Emma Dunleavy,
Martin Engelcke,
Ryan Faulkner,
Frankie Garcia,
Charles Gbadamosi
, et al. (68 additional authors not shown)
Abstract:
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructio…
▽ More
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
△ Less
Submitted 17 April, 2024; v1 submitted 13 March, 2024;
originally announced April 2024.
-
On Naisargik Images of Varshamov-Tenengolts and Helberg Codes
Authors:
Kalp Pandya,
Devdeep Shetranjiwala,
Naisargi Savaliya,
Manish K. Gupta
Abstract:
The VT and Helberg codes, both in binary and non-binary forms, stand as elegant solutions for rectifying insertion and deletion errors. In this paper we consider the quaternary versions of these codes. It is well known that many optimal binary non-linear codes like Kerdock and Prepreta can be depicted as Gray images (isometry) of codes defined over $\mathbb{Z}_4$. Thus a natural question arises: C…
▽ More
The VT and Helberg codes, both in binary and non-binary forms, stand as elegant solutions for rectifying insertion and deletion errors. In this paper we consider the quaternary versions of these codes. It is well known that many optimal binary non-linear codes like Kerdock and Prepreta can be depicted as Gray images (isometry) of codes defined over $\mathbb{Z}_4$. Thus a natural question arises: Can we find similar maps between quaternary and binary spaces which gives interesting properties when applied to the VT and Helberg codes. We found several such maps called Naisargik (natural) maps and we study the images of quaternary VT and Helberg codes under these maps. Naisargik and inverse Naisargik images gives interesting error-correcting properties for VT and Helberg codes. If two Naisargik images of VT code generates an intersecting one deletion sphere, then the images holds the same weights. A quaternary Helberg code designed to correct $s$ deletions can effectively rectify $s+1$ deletion errors when considering its Naisargik image, and $s$-deletion correcting binary Helberg code can corrects $\lfloor\frac{s}{2}\rfloor$ errors with inverse Naisargik image.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology
Authors:
Shashi Kant Gupta,
Aditya Basu,
Bradley Taylor,
Anai Kothari,
Hrituraj Singh
Abstract:
Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed R…
▽ More
Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding
Authors:
Yash Mehan,
Kumaraditya Gupta,
Rohit Jayanti,
Anirudh Govil,
Sourav Garg,
Madhava Krishna
Abstract:
Understanding the structural organisation of 3D indoor scenes in terms of rooms is often accomplished via floorplan extraction. Robotic tasks such as planning and navigation require a semantic understanding of the scene as well. This is typically achieved via object-level semantic segmentation. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this w…
▽ More
Understanding the structural organisation of 3D indoor scenes in terms of rooms is often accomplished via floorplan extraction. Robotic tasks such as planning and navigation require a semantic understanding of the scene as well. This is typically achieved via object-level semantic segmentation. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
A Bird-Eye view on DNA Storage Simulators
Authors:
Sanket Doshi,
Mihir Gohel,
Manish K. Gupta
Abstract:
In the current world due to the huge demand for storage, DNA-based storage solution sounds quite promising because of their longevity, low power consumption, and high capacity. However in real life storing data in the form of DNA is quite expensive, and challenging. Therefore researchers and developers develop such kind of software that helps simulate real-life DNA storage without worrying about t…
▽ More
In the current world due to the huge demand for storage, DNA-based storage solution sounds quite promising because of their longevity, low power consumption, and high capacity. However in real life storing data in the form of DNA is quite expensive, and challenging. Therefore researchers and developers develop such kind of software that helps simulate real-life DNA storage without worrying about the cost. This paper aims to review some of the software that performs DNA storage simulations in different domains. The paper also explains the core concepts such as synthesis, sequencing, clustering, reconstruction, GC window, K-mer window, etc and some overview on existing algorithms. Further, we present 3 different softwares on the basis of domain, implementation techniques, and customer/commercial usability.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
Measuring Style Similarity in Diffusion Models
Authors:
Gowthami Somepalli,
Anubhav Gupta,
Kamal Gupta,
Shramay Palta,
Micah Goldblum,
Jonas Gei**,
Abhinav Shrivastava,
Tom Goldstein
Abstract:
Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a…
▽ More
Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at https://github.com/learn2phoenix/CSD.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Transfer Learning with Point Transformers
Authors:
Kartik Gupta,
Rahul Vippala,
Sahima Srivastava
Abstract:
Point Transformers are near state-of-the-art models for classification, segmentation, and detection tasks on Point Cloud data. They utilize a self attention based mechanism to model large range spatial dependencies between multiple point sets. In this project we explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model…
▽ More
Point Transformers are near state-of-the-art models for classification, segmentation, and detection tasks on Point Cloud data. They utilize a self attention based mechanism to model large range spatial dependencies between multiple point sets. In this project we explore two things: classification performance of these attention based networks on ModelNet10 dataset and then, we use the trained model to classify 3D MNIST dataset after finetuning. We also train the model from scratch on 3D MNIST dataset to compare the performance of finetuned and from-scratch model on the MNIST dataset. We observe that since the two datasets have a large difference in the degree of the distributions, transfer learned models do not outperform the from-scratch models in this case. Although we do expect transfer learned models to converge faster since they already know the lower level edges, corners, etc features from the ModelNet10 dataset.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Authors:
Taishi Nakamura,
Mayank Mishra,
Simone Tedeschi,
Yekun Chai,
Jason T Stillerman,
Felix Friedrich,
Prateek Yadav,
Tanmay Laud,
Vu Minh Chien,
Terry Yue Zhuo,
Diganta Misra,
Ben Bogin,
Xuan-Son Vu,
Marzena Karpinska,
Arnav Varma Dantuluri,
Wojciech Kusa,
Tommaso Furlanello,
Rio Yokota,
Niklas Muennighoff,
Suhas Pai,
Tosin Adewumi,
Veronika Laippala,
Xiaozhe Yao,
Adalberto Junior,
Alpay Ariyak
, et al. (20 additional authors not shown)
Abstract:
Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where…
▽ More
Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations. To promote responsible open-source LLM development, Aurora-M and its variants are released at https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407 .
△ Less
Submitted 23 April, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Optimal Blackjack Strategy Recommender: A Comprehensive Study on Computer Vision Integration for Enhanced Gameplay
Authors:
Krishnanshu Gupta,
Devon Bolt,
Ben Hinchliff
Abstract:
This research project investigates the application of several computer vision techniques for playing card detection and recognition in the context of the popular casino game, blackjack. The primary objective is to develop a robust system that is capable of detecting and accurately classifying playing cards in real-time, and displaying the optimal move recommendation based on the given image of the…
▽ More
This research project investigates the application of several computer vision techniques for playing card detection and recognition in the context of the popular casino game, blackjack. The primary objective is to develop a robust system that is capable of detecting and accurately classifying playing cards in real-time, and displaying the optimal move recommendation based on the given image of the current game. The proposed methodology involves using K-Means for image segmentation, card reprojection and feature extraction, training of the KNN classifier using a labeled dataset, and integration of the detection system into a Blackjack Basic Strategy recommendation algorithm. Further, the study aims to observe the effectiveness of this approach in detecting various card designs under different lighting conditions and occlusions. Overall, the project examines the potential benefits of incorporating computer vision techniques, with a specific focus on card detection, into commonly played games aiming to enhance player decision-making and optimize strategic outcomes. The results obtained from our experimental evaluations with models developed under considerable time constraints, highlight the potential for practical implementation in real-world casino environments and across other similarly structured games.
△ Less
Submitted 29 March, 2024;
originally announced April 2024.
-
Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders
Authors:
Rohan Kumar Gupta,
Rohit Sinha
Abstract:
Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlap** symptoms among multiple mental disorders. Consequently, the behavioural data collected for menta…
▽ More
Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlap** symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
Authors:
Saksham Suri,
Matthew Walmer,
Kamal Gupta,
Abhinav Shrivastava
Abstract:
We present a simple self-supervised method to enhance the performance of ViT features for dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and compact postprocessing network that can be applied to enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to train with a self-supervised objective, and it boosts the density of ViT features for m…
▽ More
We present a simple self-supervised method to enhance the performance of ViT features for dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and compact postprocessing network that can be applied to enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to train with a self-supervised objective, and it boosts the density of ViT features for minimal extra inference cost. Furthermore, we demonstrate that LiFT can be applied with approaches that use additional task-specific downstream modules, as we integrate LiFT with ViTDet for COCO detection and segmentation. Despite the simplicity of LiFT, we find that it is not simply learning a more complex version of bilinear interpolation. Instead, our LiFT training protocol leads to several desirable emergent properties that benefit ViT features in dense downstream tasks. This includes greater scale invariance for features, and better object boundary maps. By simply training LiFT for a few epochs, we show improved performance on keypoint correspondence, detection, segmentation, and object discovery tasks. Overall, LiFT provides an easy way to unlock the benefits of denser feature arrays for a fraction of the computational cost. For more details, refer to our project page at https://www.cs.umd.edu/~sakshams/LiFT/.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Construction of all MDS and involutory MDS matrices
Authors:
Yogesh Kumar,
P. R. Mishra,
Susanta Samanta,
Kishan Chand Gupta,
Atul Gaur
Abstract:
In this paper, we propose two algorithms for a hybrid construction of all $n\times n$ MDS and involutory MDS matrices over a finite field $\mathbb{F}_{p^m}$, respectively. The proposed algorithms effectively narrow down the search space to identify $(n-1) \times (n-1)$ MDS matrices, facilitating the generation of all $n \times n$ MDS and involutory MDS matrices over $\mathbb{F}_{p^m}$. To the best…
▽ More
In this paper, we propose two algorithms for a hybrid construction of all $n\times n$ MDS and involutory MDS matrices over a finite field $\mathbb{F}_{p^m}$, respectively. The proposed algorithms effectively narrow down the search space to identify $(n-1) \times (n-1)$ MDS matrices, facilitating the generation of all $n \times n$ MDS and involutory MDS matrices over $\mathbb{F}_{p^m}$. To the best of our knowledge, existing literature lacks methods for generating all $n\times n$ MDS and involutory MDS matrices over $\mathbb{F}_{p^m}$. In our approach, we introduce a representative matrix form for generating all $n\times n$ MDS and involutory MDS matrices over $\mathbb{F}_{p^m}$. The determination of these representative MDS matrices involves searching through all $(n-1)\times (n-1)$ MDS matrices over $\mathbb{F}_{p^m}$. Our contributions extend to proving that the count of all $3\times 3$ MDS matrices over $\mathbb{F}_{2^m}$ is precisely $(2^m-1)^5(2^m-2)(2^m-3)(2^{2m}-9\cdot 2^m+21)$. Furthermore, we explicitly provide the count of all $4\times 4$ MDS and involutory MDS matrices over $\mathbb{F}_{2^m}$ for $m=2, 3, 4$.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Authors:
Qinyu Zhao,
Ming Xu,
Kartik Gupta,
Akshay Asthana,
Liang Zheng,
Stephen Gould
Abstract:
Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether t…
▽ More
Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against multi-modal jailbreaking attack, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, indicating potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicting uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improve models' performance but is still inferior to linear probing on these tasks.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Authors:
Adam Ibrahim,
Benjamin Thérien,
Kshitij Gupta,
Mats L. Richter,
Quentin Anthony,
Timothée Lesort,
Eugene Belilovsky,
Irina Rish
Abstract:
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptati…
▽ More
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
△ Less
Submitted 26 March, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Advancing Gene Selection in Oncology: A Fusion of Deep Learning and Sparsity for Precision Gene Selection
Authors:
Akhila Krishna,
Ravi Kant Gupta,
Pranav Jeevan,
Amit Sethi
Abstract:
Gene selection plays a pivotal role in oncology research for improving outcome prediction accuracy and facilitating cost-effective genomic profiling for cancer patients. This paper introduces two gene selection strategies for deep learning-based survival prediction models. The first strategy uses a sparsity-inducing method while the second one uses importance based gene selection for identifying r…
▽ More
Gene selection plays a pivotal role in oncology research for improving outcome prediction accuracy and facilitating cost-effective genomic profiling for cancer patients. This paper introduces two gene selection strategies for deep learning-based survival prediction models. The first strategy uses a sparsity-inducing method while the second one uses importance based gene selection for identifying relevant genes. Our overall approach leverages the power of deep learning to model complex biological data structures, while sparsity-inducing methods ensure the selection process focuses on the most informative genes, minimizing noise and redundancy. Through comprehensive experimentation on diverse genomic and survival datasets, we demonstrate that our strategy not only identifies gene signatures with high predictive power for survival outcomes but can also streamlines the process for low-cost genomic profiling. The implications of this research are profound as it offers a scalable and effective tool for advancing personalized medicine and targeted cancer therapies. By pushing the boundaries of gene selection methodologies, our work contributes significantly to the ongoing efforts in cancer genomics, promising improved diagnostic and prognostic capabilities in clinical settings.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization
Authors:
Han Guo,
Ramtin Hosseini,
Ruiyi Zhang,
Sai Ashish Somayajula,
Ranak Roy Chowdhury,
Rajesh K. Gupta,
Pengtao Xie
Abstract:
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches pr…
▽ More
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at: https://github.com/Alexiland/MLOMAE
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Large Language Models for Time Series: A Survey
Authors:
Xiyuan Zhang,
Ranak Roy Chowdhury,
Rajesh K. Gupta,
**gbo Shang
Abstract:
Large Language Models (LLMs) have seen significant use in domains such as natural language processing and computer vision. Going beyond text, image and graphics, LLMs present a significant potential for analysis of time series data, benefiting domains such as climate, IoT, healthcare, traffic, audio and finance. This survey paper provides an in-depth exploration and a detailed taxonomy of the vari…
▽ More
Large Language Models (LLMs) have seen significant use in domains such as natural language processing and computer vision. Going beyond text, image and graphics, LLMs present a significant potential for analysis of time series data, benefiting domains such as climate, IoT, healthcare, traffic, audio and finance. This survey paper provides an in-depth exploration and a detailed taxonomy of the various methodologies employed to harness the power of LLMs for time series analysis. We address the inherent challenge of bridging the gap between LLMs' original text data training and the numerical nature of time series data, and explore strategies for transferring and distilling knowledge from LLMs to numerical time series analysis. We detail various methodologies, including (1) direct prompting of LLMs, (2) time series quantization, (3) aligning techniques, (4) utilization of the vision modality as a bridging mechanism, and (5) the combination of LLMs with tools. Additionally, this survey offers a comprehensive overview of the existing multimodal time series and text datasets and delves into the challenges and future opportunities of this emerging field. We maintain an up-to-date Github repository which includes all the papers and datasets discussed in the survey.
△ Less
Submitted 6 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Towards Optimal Feature-Sha** Methods for Out-of-Distribution Detection
Authors:
Qinyu Zhao,
Ming Xu,
Kartik Gupta,
Akshay Asthana,
Liang Zheng,
Stephen Gould
Abstract:
Feature sha** refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained deep learning model, so as to better differentiate between in-distribution (ID) and OOD samples. However, existing feature-sha** methods usually employ rules m…
▽ More
Feature sha** refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained deep learning model, so as to better differentiate between in-distribution (ID) and OOD samples. However, existing feature-sha** methods usually employ rules manually designed for specific model architectures and OOD datasets, which consequently limit their generalization ability. To address this gap, we first formulate an abstract optimization framework for studying feature-sha** methods. We then propose a concrete reduction of the framework with a simple piecewise constant sha** function and show that existing feature-sha** methods approximate the optimal solution to the concrete optimization problem. Further, assuming that OOD data is inaccessible, we propose a formulation that yields a closed-form solution for the piecewise constant sha** function, utilizing solely the ID data. Through extensive experiments, we show that the feature-sha** function optimized by our method improves the generalization ability of OOD detection across a large variety of datasets and model architectures.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study
Authors:
W. Ronny Huang,
Cyril Allauzen,
Tongzhou Chen,
Kilol Gupta,
Ke Hu,
James Qin,
Yu Zhang,
Yongqiang Wang,
Shuo-Yiin Chang,
Tara N. Sainath
Abstract:
In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average…
▽ More
In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
On the Target Detection Performance of a Molecular Communication Network with Multiple Mobile Nanomachines
Authors:
Nithin V. Sabu,
Abhishek K. Gupta
Abstract:
A network of nanomachines (NMs) can be used to build a target detection system for a variety of promising applications. They have the potential to detect toxic chemicals, infectious bacteria, and biomarkers of dangerous diseases such as cancer within the human body. Many diseases and health disorders can be detected early and efficiently treated in the future by utilizing these systems. To fully g…
▽ More
A network of nanomachines (NMs) can be used to build a target detection system for a variety of promising applications. They have the potential to detect toxic chemicals, infectious bacteria, and biomarkers of dangerous diseases such as cancer within the human body. Many diseases and health disorders can be detected early and efficiently treated in the future by utilizing these systems. To fully grasp the potential of these systems, mathematical analysis is required. This paper describes an analytical framework for modeling and analyzing the performance of target detection systems composed of multiple mobile nanomachines of varying sizes with passive/absorbing boundaries. We consider both direct contact detection, in which NMs must physically contact the target to detect it, and indirect sensing, in which NMs must detect the marker molecules emitted by the target. The detection performance of such systems is calculated for degradable and non-degradable targets, as well as mobile and stationary targets. The derived expressions provide various insights, such as the effect of NM density and target degradation on detection probability.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
How Robust are LLMs to In-Context Majority Label Bias?
Authors:
Karan Gupta,
Sumegh Roychowdhury,
Siva Rajesh Kasa,
Santhosh Kumar Kasa,
Anish Bhanushali,
Nikhil Pattisapu,
Prasanna Srinivasa Murthy
Abstract:
In the In-Context Learning (ICL) setup, various forms of label biases can manifest. One such manifestation is majority label bias, which arises when the distribution of labeled examples in the in-context samples is skewed towards one or more specific classes making Large Language Models (LLMs) more prone to predict those labels. Such discrepancies can arise from various factors, including logistic…
▽ More
In the In-Context Learning (ICL) setup, various forms of label biases can manifest. One such manifestation is majority label bias, which arises when the distribution of labeled examples in the in-context samples is skewed towards one or more specific classes making Large Language Models (LLMs) more prone to predict those labels. Such discrepancies can arise from various factors, including logistical constraints, inherent biases in data collection methods, limited access to diverse data sources, etc. which are unavoidable in a real-world industry setup. In this work, we study the robustness of in-context learning in LLMs to shifts that occur due to majority label bias within the purview of text classification tasks. Prior works have shown that in-context learning with LLMs is susceptible to such biases. In our study, we go one level deeper and show that the robustness boundary varies widely for different models and tasks, with certain LLMs being highly robust (~90%) to majority label bias. Additionally, our findings also highlight the impact of model size and the richness of instructional prompts contributing towards model robustness. We restrict our study to only publicly available open-source models to ensure transparency and reproducibility.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
A Learning oriented DLP System based on Classification Model
Authors:
Kishu Gupta,
Ashwani Kush
Abstract:
Data is the key asset for organizations and data sharing is lifeline for organization growth; which may lead to data loss. Data leakage is the most critical issue being faced by organizations. In order to mitigate the data leakage issues data leakage prevention systems (DLPSs) are deployed at various levels by the organizations. DLPSs are capable to protect all kind of data i.e. DAR, DIM/DIT, DIU.…
▽ More
Data is the key asset for organizations and data sharing is lifeline for organization growth; which may lead to data loss. Data leakage is the most critical issue being faced by organizations. In order to mitigate the data leakage issues data leakage prevention systems (DLPSs) are deployed at various levels by the organizations. DLPSs are capable to protect all kind of data i.e. DAR, DIM/DIT, DIU. Statistical analysis, regular expression, data fingerprinting are common approaches exercised in DLP system. Out of these techniques; statistical analysis approach is most appropriate for proposed DLP model of data security. This paper defines a statistical DLP model for document classification. Model uses various statistical approaches like TF-IDF (Term Frequency- Inverse Document Frequency) a renowned term count/weighing function, Vectorization, Gradient boosting document classification etc. to classify the documents before allowing any access to it. Machine learning is used to test and train the model. Proposed model also introduces an extremely efficient and more accurate approach; IGBCA (Improvised Gradient Boosting Classification Algorithm); for document classification, to prevent them from possible data leakage. Results depicts that proposed model can classify documents with high accuracy and on basis of which data can be prevented from being loss.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
A Forecasting-Based DLP Approach for Data Security
Authors:
Kishu Gupta,
Ashwani Kush
Abstract:
Sensitive data leakage is the major growing problem being faced by enterprises in this technical era. Data leakage causes severe threats for organization of data safety which badly affects the reputation of organizations. Data leakage is the flow of sensitive data/information from any data holder to an unauthorized destination. Data leak prevention (DLP) is set of techniques that try to alleviate…
▽ More
Sensitive data leakage is the major growing problem being faced by enterprises in this technical era. Data leakage causes severe threats for organization of data safety which badly affects the reputation of organizations. Data leakage is the flow of sensitive data/information from any data holder to an unauthorized destination. Data leak prevention (DLP) is set of techniques that try to alleviate the threats which may hinder data security. DLP unveils guilty user responsible for data leakage and ensures that user without appropriate permission cannot access sensitive data and also provides protection to sensitive data if sensitive data is shared accidentally. In this paper, data leakage prevention (DLP) model is used to restrict/grant data access permission to user, based on the forecast of their access to data. This study provides a DLP solution using data statistical analysis to forecast the data access possibilities of any user in future based on the access to data in the past. The proposed approach makes use of renowned simple piecewise linear function for learning/training to model. The results show that the proposed DLP approach with high level of precision can correctly classify between users even in cases of extreme data access.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
The Expert Knowledge combined with AI outperforms AI Alone in Seizure Onset Zone Localization using resting state fMRI
Authors:
Payal Kamboj,
Ayan Banerjee,
Varina L. Boerwinkle,
Sandeep K. S. Gupta
Abstract:
We evaluated whether integration of expert guidance on seizure onset zone (SOZ) identification from resting state functional MRI (rs-fMRI) connectomics combined with deep learning (DL) techniques enhances the SOZ delineation in patients with refractory epilepsy (RE), compared to utilizing DL alone. Rs-fMRI were collected from 52 children with RE who had subsequently undergone ic-EEG and then, if i…
▽ More
We evaluated whether integration of expert guidance on seizure onset zone (SOZ) identification from resting state functional MRI (rs-fMRI) connectomics combined with deep learning (DL) techniques enhances the SOZ delineation in patients with refractory epilepsy (RE), compared to utilizing DL alone. Rs-fMRI were collected from 52 children with RE who had subsequently undergone ic-EEG and then, if indicated, surgery for seizure control (n = 25). The resting state functional connectomics data were previously independently classified by two expert epileptologists, as indicative of measurement noise, typical resting state network connectivity, or SOZ. An expert knowledge integrated deep network was trained on functional connectomics data to identify SOZ. Expert knowledge integrated with DL showed a SOZ localization accuracy of 84.8& and F1 score, harmonic mean of positive predictive value and sensitivity, of 91.7%. Conversely, a DL only model yielded an accuracy of less than 50% (F1 score 63%). Activations that initiate in gray matter, extend through white matter and end in vascular regions are seen as the most discriminative expert identified SOZ characteristics. Integration of expert knowledge of functional connectomics can not only enhance the performance of DL in localizing SOZ in RE, but also lead toward potentially useful explanations of prevalent co-activation patterns in SOZ. RE with surgical outcomes and pre-operative rs-fMRI studies can yield expert knowledge most salient for SOZ identification.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS
Authors:
Sharath Girish,
Kamal Gupta,
Abhinav Shrivastava
Abstract:
Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for bo…
▽ More
Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for both training and storage, as they require millions of Gaussians in their point cloud representation for each scene. We present a technique utilizing quantized embeddings to significantly reduce per-point memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. Our approach develops a pruning stage which results in scene representations with fewer Gaussians, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes. We reduce storage memory by more than an order of magnitude all while preserving the reconstruction quality. We validate the effectiveness of our approach on a variety of datasets and scenes preserving the visual quality while consuming 10-20x lesser memory and faster training/inference speed. Project page and code is available https://efficientgaussian.github.io
△ Less
Submitted 24 April, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks
Authors:
Kartik Gupta,
Akshay Asthana
Abstract:
Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segment…
▽ More
Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Tackling Concept Shift in Text Classification using Entailment-style Modeling
Authors:
Sumegh Roychowdhury,
Karan Gupta,
Siva Rajesh Kasa,
Prasanna Srinivasa Murthy,
Alok Chandra
Abstract:
Pre-trained language models (PLMs) have seen tremendous success in text classification (TC) problems in the context of Natural Language Processing (NLP). In many real-world text classification tasks, the class definitions being learned do not remain constant but rather change with time - this is known as Concept Shift. Most techniques for handling concept shift rely on retraining the old classifie…
▽ More
Pre-trained language models (PLMs) have seen tremendous success in text classification (TC) problems in the context of Natural Language Processing (NLP). In many real-world text classification tasks, the class definitions being learned do not remain constant but rather change with time - this is known as Concept Shift. Most techniques for handling concept shift rely on retraining the old classifiers with the newly labelled data. However, given the amount of training data required to fine-tune large DL models for the new concepts, the associated labelling costs can be prohibitively expensive and time consuming. In this work, we propose a reformulation, converting vanilla classification into an entailment-style problem that requires significantly less data to re-train the text classifier to adapt to new concepts. We demonstrate the effectiveness of our proposed method on both real world & synthetic datasets achieving absolute F1 gains upto 7% and 40% respectively in few-shot settings. Further, upon deployment, our solution also helped save 75% of labeling costs overall.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Hierarchical Optimization-based Control for Whole-body Loco-manipulation of Heavy Objects
Authors:
Alberto Rigo,
Muqun Hu,
Satyandra K. Gupta,
Quan Nguyen
Abstract:
In recent years, the field of legged robotics has seen growing interest in enhancing the capabilities of these robots through the integration of articulated robotic arms. However, achieving successful loco-manipulation, especially involving interaction with heavy objects, is far from straightforward, as object manipulation can introduce substantial disturbances that impact the robot's locomotion.…
▽ More
In recent years, the field of legged robotics has seen growing interest in enhancing the capabilities of these robots through the integration of articulated robotic arms. However, achieving successful loco-manipulation, especially involving interaction with heavy objects, is far from straightforward, as object manipulation can introduce substantial disturbances that impact the robot's locomotion. This paper presents a novel framework for legged loco-manipulation that considers whole-body coordination through a hierarchical optimization-based control framework. First, an online manipulation planner computes the manipulation forces and manipulated object task-based reference trajectory. Then, pose optimization aligns the robot's trajectory with kinematic constraints. The resultant robot reference trajectory is executed via a linear MPC controller incorporating the desired manipulation forces into its prediction model. Our approach has been validated in simulation and hardware experiments, highlighting the necessity of whole-body optimization compared to the baseline locomotion MPC when interacting with heavy objects. Experimental results with Unitree Aliengo, equipped with a custom-made robotic arm, showcase its ability to lift and carry an 8kg payload and manipulate doors.
△ Less
Submitted 19 March, 2024; v1 submitted 31 October, 2023;
originally announced November 2023.
-
On the Classification of Weierstrass Elliptic Curves over $\mathbb{Z}_n$
Authors:
Param Parekh,
Paavan Parekh,
Sourav Deb,
Manish K Gupta
Abstract:
The development of secure cryptographic protocols and the subsequent attack mechanisms have been placed in the literature with the utmost curiosity.
While sophisticated quantum attacks bring a concern to the classical cryptographic protocols present in the applications used in everyday life, the necessity of develo** post-quantum protocols is felt primarily.
In post-quantum cryptography, ell…
▽ More
The development of secure cryptographic protocols and the subsequent attack mechanisms have been placed in the literature with the utmost curiosity.
While sophisticated quantum attacks bring a concern to the classical cryptographic protocols present in the applications used in everyday life, the necessity of develo** post-quantum protocols is felt primarily.
In post-quantum cryptography, elliptic curve-base protocols are exciting to the researchers.
While the comprehensive study of elliptic curves over finite fields is well known, the extended study over finite rings is still missing.
In this work, we generalize the study of Weierstrass elliptic curves over finite ring $\mathbb{Z}_n$ through classification.
Several expressions to compute critical factors in studying elliptic curves are conferred.
An all-around computational classification on the Weierstrass elliptic curves over $\mathbb{Z}_n$ for rigorous understanding is also attached to this work.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Detection of Malicious DNS-over-HTTPS Traffic: An Anomaly Detection Approach using Autoencoders
Authors:
Sergio Salinas Monroy,
Aman Kumar Gupta,
Garrett Wahlstedt
Abstract:
To maintain the privacy of users' web browsing history, popular browsers encrypt their DNS traffic using the DNS-over-HTTPS (DoH) protocol. Unfortunately, encrypting DNS packets prevents many existing intrusion detection systems from using plaintext domain names to detect malicious traffic. In this paper, we design an autoencoder that is capable of detecting malicious DNS traffic by only observing…
▽ More
To maintain the privacy of users' web browsing history, popular browsers encrypt their DNS traffic using the DNS-over-HTTPS (DoH) protocol. Unfortunately, encrypting DNS packets prevents many existing intrusion detection systems from using plaintext domain names to detect malicious traffic. In this paper, we design an autoencoder that is capable of detecting malicious DNS traffic by only observing the encrypted DoH traffic. Compared to previous works, the proposed autoencoder looks for anomalies in DoH traffic, and thus can detect malicious traffic that has not been previously observed, i.e., zero-day attacks. We run extensive experiments to evaluate the performance of our proposed autoencoder and compare it to that of other anomaly detection algorithms, namely, local outlier factor, one-class support vector machine, isolation forest, and variational autoencoders. We find that our proposed autoencoder achieves the highest detection performance, with a median F-1 score of 99\% over several types of malicious traffic.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Combining Datasets with Different Label Sets for Improved Nucleus Segmentation and Classification
Authors:
Amruta Parulekar,
Utkarsh Kanwat,
Ravi Kant Gupta,
Medha Chippa,
Thomas Jacob,
Tripti Bameta,
Swapnil Rane,
Amit Sethi
Abstract:
Segmentation and classification of cell nuclei in histopathology images using deep neural networks (DNNs) can save pathologists' time for diagnosing various diseases, including cancers, by automating cell counting and morphometric assessments. It is now well-known that the accuracy of DNNs increases with the sizes of annotated datasets available for training. Although multiple datasets of histopat…
▽ More
Segmentation and classification of cell nuclei in histopathology images using deep neural networks (DNNs) can save pathologists' time for diagnosing various diseases, including cancers, by automating cell counting and morphometric assessments. It is now well-known that the accuracy of DNNs increases with the sizes of annotated datasets available for training. Although multiple datasets of histopathology images with nuclear annotations and class labels have been made publicly available, the set of class labels differ across these datasets. We propose a method to train DNNs for instance segmentation and classification on multiple datasets where the set of classes across the datasets are related but not the same. Specifically, our method is designed to utilize a coarse-to-fine class hierarchy, where the set of classes labeled and annotated in a dataset can be at any level of the hierarchy, as long as the classes are mutually exclusive. Within a dataset, the set of classes need not even be at the same level of the class hierarchy tree. Our results demonstrate that segmentation and classification metrics for the class set used by the test split of a dataset can improve by pre-training on another dataset that may even have a different set of classes due to the expansion of the training set enabled by our method. Furthermore, generalization to previously unseen datasets also improves by combining multiple other datasets with different sets of classes for training. The improvement is both qualitative and quantitative. The proposed method can be adapted for various loss functions, DNN architectures, and application domains.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Misusing Tools in Large Language Models With Visual Adversarial Examples
Authors:
Xiaohan Fu,
Zihan Wang,
Shuheng Li,
Rajesh K. Gupta,
Niloofar Mireshghallah,
Taylor Berg-Kirkpatrick,
Earlence Fernandes
Abstract:
Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversatio…
▽ More
Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
EvDNeRF: Reconstructing Event Data with Dynamic Neural Radiance Fields
Authors:
Anish Bhattacharya,
Ratnesh Madaan,
Fernando Cladera,
Sai Vemprala,
Rogerio Bonatti,
Kostas Daniilidis,
Ashish Kapoor,
Vijay Kumar,
Nikolai Matni,
Jayesh K. Gupta
Abstract:
We present EvDNeRF, a pipeline for generating event data and training an event-based dynamic NeRF, for the purpose of faithfully reconstructing eventstreams on scenes with rigid and non-rigid deformations that may be too fast to capture with a standard camera. Event cameras register asynchronous per-pixel brightness changes at MHz rates with high dynamic range, making them ideal for observing fast…
▽ More
We present EvDNeRF, a pipeline for generating event data and training an event-based dynamic NeRF, for the purpose of faithfully reconstructing eventstreams on scenes with rigid and non-rigid deformations that may be too fast to capture with a standard camera. Event cameras register asynchronous per-pixel brightness changes at MHz rates with high dynamic range, making them ideal for observing fast motion with almost no motion blur. Neural radiance fields (NeRFs) offer visual-quality geometric-based learnable rendering, but prior work with events has only considered reconstruction of static scenes. Our EvDNeRF can predict eventstreams of dynamic scenes from a static or moving viewpoint between any desired timestamps, thereby allowing it to be used as an event-based simulator for a given scene. We show that by training on varied batch sizes of events, we can improve test-time predictions of events at fine time resolutions, outperforming baselines that pair standard dynamic NeRFs with event generators. We release our simulated and real datasets, as well as code for multi-view event-based data generation and the training and evaluation of EvDNeRF models (https://github.com/anish-bhattacharya/EvDNeRF).
△ Less
Submitted 6 December, 2023; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Toward Scalable Visual Servoing Using Deep Reinforcement Learning and Optimal Control
Authors:
Salar Asayesh,
Hossein Sheikhi Darani,
Mo chen,
Mehran Mehrandezh,
Kamal Gupta
Abstract:
Classical pixel-based Visual Servoing (VS) approaches offer high accuracy but suffer from a limited convergence area due to optimization nonlinearity. Modern deep learning-based VS methods overcome traditional vision issues but lack scalability, requiring training on limited scenes. This paper proposes a hybrid VS strategy utilizing Deep Reinforcement Learning (DRL) and optimal control to enhance…
▽ More
Classical pixel-based Visual Servoing (VS) approaches offer high accuracy but suffer from a limited convergence area due to optimization nonlinearity. Modern deep learning-based VS methods overcome traditional vision issues but lack scalability, requiring training on limited scenes. This paper proposes a hybrid VS strategy utilizing Deep Reinforcement Learning (DRL) and optimal control to enhance both convergence area and scalability. The DRL component of our approach separately handles representation and policy learning to enhance scalability, generalizability, learning efficiency and ease domain adaptation. Moreover, the optimal control part ensures high end-point accuracy. Our method showcases remarkable achievements in terms of high convergence rates and minimal end-positioning errors using a 7-DOF manipulator. Importantly, it exhibits scalability across more than 1000 distinct scenes. Furthermore, we demonstrate its capacity for generalization to previously unseen datasets. Lastly, we illustrate the real-world applicability of our approach, highlighting its adaptability through single-shot domain transfer learning in environments with noise and occlusions. Real-robot experiments can be found at \url{https://sites.google.com/view/vsls}.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.