-
PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
Authors:
Trang Le,
Daniel Lazar,
Suyoun Kim,
Shan Jiang,
Duc Le,
Adithya Sagar,
Aleksandr Livshits,
Ahmed Aly,
Akshat Shrivastava
Abstract:
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a no…
▽ More
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
A Pioneering Roadmap for ML-Driven Algorithmic Advancements in Electrical Networks
Authors:
Jochen L. Cremer,
Adrian Kelly,
Ricardo J. Bessa,
Milos Subasic,
Panagiotis N. Papadopoulos,
Samuel Young,
Amar Sagar,
Antoine Marot
Abstract:
To advance control, operation and planning tools of electrical networks with ML is not straightforward. 110 experts were surveyed showing where and how ML algorithmis could advance. This paper assesses this survey and research environment. Then it develops an innovation roadmap that helps align our research community towards a goal-oriented realisation of the opportunities that AI upholds. This pa…
▽ More
To advance control, operation and planning tools of electrical networks with ML is not straightforward. 110 experts were surveyed showing where and how ML algorithmis could advance. This paper assesses this survey and research environment. Then it develops an innovation roadmap that helps align our research community towards a goal-oriented realisation of the opportunities that AI upholds. This paper finds that the R\&D environment of system operators (and the surrounding research ecosystem) needs adaptation to enable faster developments with AI while maintaining high testing quality and safety. This roadmap may interest research centre managers in system operators, academics, and labs dedicated to advancing the next generation of tooling for electrical networks.
△ Less
Submitted 28 May, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
STOP: A dataset for Spoken Task Oriented Semantic Parsing
Authors:
Paden Tomasello,
Akshat Shrivastava,
Daniel Lazar,
Po-Chun Hsu,
Duc Le,
Adithya Sagar,
Ali Elkahky,
Jade Copet,
Wei-Ning Hsu,
Yossi Adi,
Robin Algayres,
Tu Ahn Nguyen,
Emmanuel Dupoux,
Luke Zettlemoyer,
Abdelrahman Mohamed
Abstract:
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi…
▽ More
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems. Initial experimentation show end-to-end SLU models performing slightly worse than their cascaded counterparts, which we hope encourages future work in this direction.
△ Less
Submitted 18 October, 2022; v1 submitted 28 June, 2022;
originally announced July 2022.
-
COVID-19 Severity Classification on Chest X-ray Images
Authors:
Aditi Sagar,
Aman Swaraj,
Karan Verma
Abstract:
Biomedical imaging analysis combined with artificial intelligence (AI) methods has proven to be quite valuable in order to diagnose COVID-19. So far, various classification models have been used for diagnosing COVID-19. However, classification of patients based on their severity level is not yet analyzed. In this work, we classify covid images based on the severity of the infection. First, we pre-…
▽ More
Biomedical imaging analysis combined with artificial intelligence (AI) methods has proven to be quite valuable in order to diagnose COVID-19. So far, various classification models have been used for diagnosing COVID-19. However, classification of patients based on their severity level is not yet analyzed. In this work, we classify covid images based on the severity of the infection. First, we pre-process the X-ray images using a median filter and histogram equalization. Enhanced X-ray images are then augmented using SMOTE technique for achieving a balanced dataset. Pre-trained Resnet50, VGG16 model and SVM classifier are then used for feature extraction and classification. The result of the classification model confirms that compared with the alternatives, with chest X-Ray images, the ResNet-50 model produced remarkable classification results in terms of accuracy (95%), recall (0.94), and F1-Score (0.92), and precision (0.91).
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
IRHA: An Intelligent RSSI based Home automation System
Authors:
Samsil Arefin Mozumder,
A S M Sharifuzzaman Sagar
Abstract:
Human existence is getting more sophisticated and better in many areas due to remarkable advances in the fields of automation. Automated systems are favored over manual ones in the current environment. Home Automation is becoming more popular in this scenario, as people are drawn to the concept of a home environment that can automatically satisfy users' requirements. The key challenges in an intel…
▽ More
Human existence is getting more sophisticated and better in many areas due to remarkable advances in the fields of automation. Automated systems are favored over manual ones in the current environment. Home Automation is becoming more popular in this scenario, as people are drawn to the concept of a home environment that can automatically satisfy users' requirements. The key challenges in an intelligent home are intelligent decision making, location-aware service, and compatibility for all users of different ages and physical conditions. Existing solutions address just one or two of these challenges, but smart home automation that is robust, intelligent, location-aware, and predictive is needed to satisfy the user's demand. This paper presents a location-aware intelligent RSSI-based home automation system (IRHA) that uses Wi-Fi signals to detect the user's location and control the appliances automatically. The fingerprinting method is used to map the Wi-Fi signals for different rooms, and the machine learning method, such as Decision Tree, is used to classify the signals for different rooms. The machine learning models are then implemented in the ESP32 microcontroller board to classify the rooms based on the real-time Wi-Fi signal, and then the result is sent to the main control board through the ESP32 MAC communication protocol to control the appliances automatically. The proposed method has achieved 97% accuracy in classifying the users' location.
△ Less
Submitted 16 January, 2022;
originally announced January 2022.
-
ViTBIS: Vision Transformer for Biomedical Image Segmentation
Authors:
Abhinav Sagar
Abstract:
In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded insid…
▽ More
In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded inside it. Skip connections are used to connect encoder and decoder transformer blocks. Similarly, transformer blocks and multi scale architecture is used in decoder before being linearly projected to produce the output segmentation map. We test the performance of our network using Synapse multi-organ segmentation dataset, Automated cardiac diagnosis challenge dataset, Brain tumour MRI segmentation dataset and Spleen CT segmentation dataset. Without bells and whistles, our network outperforms most of the previous state of the art CNN and transformer based models using Dice score and the Hausdorff distance as the evaluation metrics.
△ Less
Submitted 15 January, 2022;
originally announced January 2022.
-
AASeg: Attention Aware Network for Real Time Semantic Segmentation
Authors:
Abhinav Sagar
Abstract:
In this paper, we present a new network named Attention Aware Network (AASeg) for real time semantic image segmentation. Our network incorporates spatial and channel information using Spatial Attention (SA) and Channel Attention (CA) modules respectively. It also uses dense local multi-scale context information using Multi Scale Context (MSC) module. The feature maps are concatenated individually…
▽ More
In this paper, we present a new network named Attention Aware Network (AASeg) for real time semantic image segmentation. Our network incorporates spatial and channel information using Spatial Attention (SA) and Channel Attention (CA) modules respectively. It also uses dense local multi-scale context information using Multi Scale Context (MSC) module. The feature maps are concatenated individually to produce the final segmentation map. We demonstrate the effectiveness of our method using a comprehensive analysis, quantitative experimental results and ablation study using Cityscapes, ADE20K and Camvid datasets. Our network performs better than most previous architectures with a 74.4\% Mean IOU on Cityscapes test dataset while running at 202.7 FPS.
△ Less
Submitted 14 May, 2022; v1 submitted 27 July, 2021;
originally announced August 2021.
-
Generate High Resolution Images With Generative Variational Autoencoder
Authors:
Abhinav Sagar
Abstract:
In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluat…
▽ More
In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluate our network on 3 different datasets: MNIST, LSUN and CelebA dataset. Our network beats the previous state of the art using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics while generating much sharper images. This work is potentially very exciting as we are able to combine the advantages of generative models and inference models in a principled bayesian manner.
△ Less
Submitted 21 June, 2021; v1 submitted 12 August, 2020;
originally announced August 2020.
-
HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN
Authors:
Abhinav Sagar
Abstract:
In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in th…
▽ More
In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in this work. Our network uses a combination of two loss terms: mean square pixel loss and an adversarial loss. The datasets used for training and testing our network are UCF101, Golf and Aeroplane Datasets. Using Inception Score and Fréchet Inception Distance as the evaluation metrics, our network outperforms previous state of the art networks on unsupervised video generation.
△ Less
Submitted 12 July, 2021; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Uncertainty Quantification using Variational Inference for Biomedical Image Segmentation
Authors:
Abhinav Sagar
Abstract:
Deep learning motivated by convolutional neural networks has been highly successful in a range of medical imaging problems like image classification, image segmentation, image synthesis etc. However for validation and interpretability, not only do we need the predictions made by the model but also how confident it is while making those predictions. This is important in safety critical applications…
▽ More
Deep learning motivated by convolutional neural networks has been highly successful in a range of medical imaging problems like image classification, image segmentation, image synthesis etc. However for validation and interpretability, not only do we need the predictions made by the model but also how confident it is while making those predictions. This is important in safety critical applications for the people to accept it. In this work, we used an encoder decoder architecture based on variational inference techniques for segmenting brain tumour images. We evaluate our work on the publicly available BRATS dataset using Dice Similarity Coefficient (DSC) and Intersection Over Union (IOU) as the evaluation metrics. Our model is able to segment brain tumours while taking into account both aleatoric uncertainty and epistemic uncertainty in a principled bayesian manner.
△ Less
Submitted 10 August, 2021; v1 submitted 12 August, 2020;
originally announced August 2020.
-
RUHSNet: 3D Object Detection Using Lidar Data in Real Time
Authors:
Abhinav Sagar
Abstract:
In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We…
▽ More
In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We compare the results with different backbone architectures including the standard ones like VGG, ResNet, Inception with our backbone. Also we present the optimization and ablation studies including designing an efficient anchor. We use the Kitti 3D Birds Eye View dataset for benchmarking and validating our results. Our work surpasses the state of the art in this domain both in terms of average precision and speed running at > 30 FPS. This makes it a feasible option to be deployed in real time applications including self driving cars.
△ Less
Submitted 21 June, 2021; v1 submitted 9 May, 2020;
originally announced June 2020.
-
Lattice-based Improvements for Voice Triggering Using Graph Neural Networks
Authors:
Pranay Dighe,
Saurabh Adya,
Nuoyu Li,
Srikanth Vishnubhotla,
Devang Naik,
Adithya Sagar,
Ying Ma,
Stephen Pulman,
Jason Williams
Abstract:
Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using…
▽ More
Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn't fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating ~87% of false triggers at 99% true positive rate (TPR). Furthermore, the proposed models are fast to train and efficient in parameter requirements.
△ Less
Submitted 24 January, 2020;
originally announced January 2020.