Search | arXiv e-print repository

PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Authors: Trang Le, Daniel Lazar, Suyoun Kim, Shan Jiang, Duc Le, Adithya Sagar, Aleksandr Livshits, Ahmed Aly, Akshat Shrivastava

Abstract: Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a no… ▽ More Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2405.17184 [pdf, other]

A Pioneering Roadmap for ML-Driven Algorithmic Advancements in Electrical Networks

Authors: Jochen L. Cremer, Adrian Kelly, Ricardo J. Bessa, Milos Subasic, Panagiotis N. Papadopoulos, Samuel Young, Amar Sagar, Antoine Marot

Abstract: To advance control, operation and planning tools of electrical networks with ML is not straightforward. 110 experts were surveyed showing where and how ML algorithmis could advance. This paper assesses this survey and research environment. Then it develops an innovation roadmap that helps align our research community towards a goal-oriented realisation of the opportunities that AI upholds. This pa… ▽ More To advance control, operation and planning tools of electrical networks with ML is not straightforward. 110 experts were surveyed showing where and how ML algorithmis could advance. This paper assesses this survey and research environment. Then it develops an innovation roadmap that helps align our research community towards a goal-oriented realisation of the opportunities that AI upholds. This paper finds that the R\&D environment of system operators (and the surrounding research ecosystem) needs adaptation to enable faster developments with AI while maintaining high testing quality and safety. This roadmap may interest research centre managers in system operators, academics, and labs dedicated to advancing the next generation of tooling for electrical networks. △ Less

Submitted 28 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: 5 pages

arXiv:2207.10643 [pdf, other]

STOP: A dataset for Spoken Task Oriented Semantic Parsing

Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems. Initial experimentation show end-to-end SLU models performing slightly worse than their cascaded counterparts, which we hope encourages future work in this direction. △ Less

Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

arXiv:2205.12705 [pdf]

COVID-19 Severity Classification on Chest X-ray Images

Authors: Aditi Sagar, Aman Swaraj, Karan Verma

Abstract: Biomedical imaging analysis combined with artificial intelligence (AI) methods has proven to be quite valuable in order to diagnose COVID-19. So far, various classification models have been used for diagnosing COVID-19. However, classification of patients based on their severity level is not yet analyzed. In this work, we classify covid images based on the severity of the infection. First, we pre-… ▽ More Biomedical imaging analysis combined with artificial intelligence (AI) methods has proven to be quite valuable in order to diagnose COVID-19. So far, various classification models have been used for diagnosing COVID-19. However, classification of patients based on their severity level is not yet analyzed. In this work, we classify covid images based on the severity of the infection. First, we pre-process the X-ray images using a median filter and histogram equalization. Enhanced X-ray images are then augmented using SMOTE technique for achieving a balanced dataset. Pre-trained Resnet50, VGG16 model and SVM classifier are then used for feature extraction and classification. The result of the classification model confirms that compared with the alternatives, with chest X-Ray images, the ResNet-50 model produced remarkable classification results in terms of accuracy (95%), recall (0.94), and F1-Score (0.92), and precision (0.91). △ Less

Submitted 25 May, 2022; originally announced May 2022.

arXiv:2201.05975 [pdf]

IRHA: An Intelligent RSSI based Home automation System

Authors: Samsil Arefin Mozumder, A S M Sharifuzzaman Sagar

Abstract: Human existence is getting more sophisticated and better in many areas due to remarkable advances in the fields of automation. Automated systems are favored over manual ones in the current environment. Home Automation is becoming more popular in this scenario, as people are drawn to the concept of a home environment that can automatically satisfy users' requirements. The key challenges in an intel… ▽ More Human existence is getting more sophisticated and better in many areas due to remarkable advances in the fields of automation. Automated systems are favored over manual ones in the current environment. Home Automation is becoming more popular in this scenario, as people are drawn to the concept of a home environment that can automatically satisfy users' requirements. The key challenges in an intelligent home are intelligent decision making, location-aware service, and compatibility for all users of different ages and physical conditions. Existing solutions address just one or two of these challenges, but smart home automation that is robust, intelligent, location-aware, and predictive is needed to satisfy the user's demand. This paper presents a location-aware intelligent RSSI-based home automation system (IRHA) that uses Wi-Fi signals to detect the user's location and control the appliances automatically. The fingerprinting method is used to map the Wi-Fi signals for different rooms, and the machine learning method, such as Decision Tree, is used to classify the signals for different rooms. The machine learning models are then implemented in the ESP32 microcontroller board to classify the rooms based on the real-time Wi-Fi signal, and then the result is sent to the main control board through the ESP32 MAC communication protocol to control the appliances automatically. The proposed method has achieved 97% accuracy in classifying the users' location. △ Less

Submitted 16 January, 2022; originally announced January 2022.

Comments: This article is submitted to the 2nd International Conference on Ubiquitous Computing and Intelligent Information Systems for possible presentation

arXiv:2201.05920 [pdf, other]

ViTBIS: Vision Transformer for Biomedical Image Segmentation

Authors: Abhinav Sagar

Abstract: In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded insid… ▽ More In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded inside it. Skip connections are used to connect encoder and decoder transformer blocks. Similarly, transformer blocks and multi scale architecture is used in decoder before being linearly projected to produce the output segmentation map. We test the performance of our network using Synapse multi-organ segmentation dataset, Automated cardiac diagnosis challenge dataset, Brain tumour MRI segmentation dataset and Spleen CT segmentation dataset. Without bells and whistles, our network outperforms most of the previous state of the art CNN and transformer based models using Dice score and the Hausdorff distance as the evaluation metrics. △ Less

Submitted 15 January, 2022; originally announced January 2022.

Comments: Published at Clinical Image-Based Procedures, Distributed and Collaborative Learning, Artificial Intelligence for Combating COVID-19 and Secure and Privacy-Preserving Machine Learning workshop at MICCAI 2021

Journal ref: Springer, Cham 2021

arXiv:2108.04349

AASeg: Attention Aware Network for Real Time Semantic Segmentation

Authors: Abhinav Sagar

Abstract: In this paper, we present a new network named Attention Aware Network (AASeg) for real time semantic image segmentation. Our network incorporates spatial and channel information using Spatial Attention (SA) and Channel Attention (CA) modules respectively. It also uses dense local multi-scale context information using Multi Scale Context (MSC) module. The feature maps are concatenated individually… ▽ More In this paper, we present a new network named Attention Aware Network (AASeg) for real time semantic image segmentation. Our network incorporates spatial and channel information using Spatial Attention (SA) and Channel Attention (CA) modules respectively. It also uses dense local multi-scale context information using Multi Scale Context (MSC) module. The feature maps are concatenated individually to produce the final segmentation map. We demonstrate the effectiveness of our method using a comprehensive analysis, quantitative experimental results and ablation study using Cityscapes, ADE20K and Camvid datasets. Our network performs better than most previous architectures with a 74.4\% Mean IOU on Cityscapes test dataset while running at 202.7 FPS. △ Less

Submitted 14 May, 2022; v1 submitted 27 July, 2021; originally announced August 2021.

Comments: This work makes assumptions which were found wrong later by the author

arXiv:2008.10399

Generate High Resolution Images With Generative Variational Autoencoder

Authors: Abhinav Sagar

Abstract: In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluat… ▽ More In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluate our network on 3 different datasets: MNIST, LSUN and CelebA dataset. Our network beats the previous state of the art using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics while generating much sharper images. This work is potentially very exciting as we are able to combine the advantages of generative models and inference models in a principled bayesian manner. △ Less

Submitted 21 June, 2021; v1 submitted 12 August, 2020; originally announced August 2020.

Comments: The network architecture used in this paper while training the model is not correct

arXiv:2008.09646

HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN

Authors: Abhinav Sagar

Abstract: In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in th… ▽ More In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in this work. Our network uses a combination of two loss terms: mean square pixel loss and an adversarial loss. The datasets used for training and testing our network are UCF101, Golf and Aeroplane Datasets. Using Inception Score and Fréchet Inception Distance as the evaluation metrics, our network outperforms previous state of the art networks on unsupervised video generation. △ Less

Submitted 12 July, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: The design of neural network was based on assumptions which was found to be wrong

arXiv:2008.07588 [pdf, other]

Uncertainty Quantification using Variational Inference for Biomedical Image Segmentation

Authors: Abhinav Sagar

Abstract: Deep learning motivated by convolutional neural networks has been highly successful in a range of medical imaging problems like image classification, image segmentation, image synthesis etc. However for validation and interpretability, not only do we need the predictions made by the model but also how confident it is while making those predictions. This is important in safety critical applications… ▽ More Deep learning motivated by convolutional neural networks has been highly successful in a range of medical imaging problems like image classification, image segmentation, image synthesis etc. However for validation and interpretability, not only do we need the predictions made by the model but also how confident it is while making those predictions. This is important in safety critical applications for the people to accept it. In this work, we used an encoder decoder architecture based on variational inference techniques for segmenting brain tumour images. We evaluate our work on the publicly available BRATS dataset using Dice Similarity Coefficient (DSC) and Intersection Over Union (IOU) as the evaluation metrics. Our model is able to segment brain tumours while taking into account both aleatoric uncertainty and epistemic uncertainty in a principled bayesian manner. △ Less

Submitted 10 August, 2021; v1 submitted 12 August, 2020; originally announced August 2020.

Comments: 11 pages, 4 figures

arXiv:2006.01250

RUHSNet: 3D Object Detection Using Lidar Data in Real Time

Authors: Abhinav Sagar

Abstract: In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We… ▽ More In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We compare the results with different backbone architectures including the standard ones like VGG, ResNet, Inception with our backbone. Also we present the optimization and ablation studies including designing an efficient anchor. We use the Kitti 3D Birds Eye View dataset for benchmarking and validating our results. Our work surpasses the state of the art in this domain both in terms of average precision and speed running at > 30 FPS. This makes it a feasible option to be deployed in real time applications including self driving cars. △ Less

Submitted 21 June, 2021; v1 submitted 9 May, 2020; originally announced June 2020.

Comments: The results in this paper is not correct as assumptions used while designing the network was found to be wrong

arXiv:2001.10822 [pdf, other]

Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

Authors: Pranay Dighe, Saurabh Adya, Nuoyu Li, Srikanth Vishnubhotla, Devang Naik, Adithya Sagar, Ying Ma, Stephen Pulman, Jason Williams

Abstract: Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using… ▽ More Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn't fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating ~87% of false triggers at 99% true positive rate (TPR). Furthermore, the proposed models are fast to train and efficient in parameter requirements. △ Less

Submitted 24 January, 2020; originally announced January 2020.

Showing 1–12 of 12 results for author: Sagar, A