Skip to main content

Showing 1–50 of 59 results for author: Le, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.10223  [pdf, other

    cs.LG cs.SD eess.AS

    Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

    Authors: Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoe Abrams, Morgan McGuire

    Abstract: We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve M… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Published in Interspeech 2024

  2. arXiv:2406.07823  [pdf, other

    cs.CL cs.SD eess.AS

    PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

    Authors: Trang Le, Daniel Lazar, Suyoun Kim, Shan Jiang, Duc Le, Adithya Sagar, Aleksandr Livshits, Ahmed Aly, Akshat Shrivastava

    Abstract: Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a no… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  3. arXiv:2405.00681  [pdf, other

    eess.SP cs.IT cs.NI eess.SY

    Delay and Overhead Efficient Transmission Scheduling for Federated Learning in UAV Swarms

    Authors: Duc N. M. Hoang, Vu Tuan Truong, Hung Duy Le, Long Bao Le

    Abstract: This paper studies the wireless scheduling design to coordinate the transmissions of (local) model parameters of federated learning (FL) for a swarm of unmanned aerial vehicles (UAVs). The overall goal of the proposed design is to realize the FL training and aggregation processes with a central aggregator exploiting the sensory data collected by the UAVs but it considers the multi-hop wireless net… ▽ More

    Submitted 22 February, 2024; originally announced May 2024.

    Comments: accepted to WCNC'24

  4. arXiv:2404.07385  [pdf, other

    eess.SY

    Lyapunov-Based Deep Residual Neural Network (ResNet) Adaptive Control

    Authors: Omkar Sudhir Patil, Duc M. Le, Emily J. Griffis, Warren E. Dixon

    Abstract: Deep Neural Network (DNN)-based controllers have emerged as a tool to compensate for unstructured uncertainties in nonlinear dynamical systems. A recent breakthrough in the adaptive control literature provides a Lyapunov-based approach to derive weight adaptation laws for each layer of a fully-connected feedforward DNN-based adaptive controller. However, deriving weight adaptation laws from a Lyap… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  5. arXiv:2403.17392  [pdf, other

    cs.RO eess.SY nlin.AO

    Natural-artificial hybrid swarm: Cyborg-insect group navigation in unknown obstructed soft terrain

    Authors: Yang Bai, Phuoc Thanh Tran Ngoc, Huu Duoc Nguyen, Duc Long Le, Quang Huy Ha, Kazuki Kai, Yu Xiang See To, Yaosheng Deng, Jie Song, Naoki Wakamiya, Hirotaka Sato, Masaki Ogura

    Abstract: Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable roboti… ▽ More

    Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

  6. arXiv:2403.08947  [pdf, other

    eess.IV cs.CV

    Robust COVID-19 Detection in CT Images with CLIP

    Authors: Li Lin, Yamini Sri Krubha, Zhenhuan Yang, Cheng Ren, Thuc Duy Le, Irene Amerini, Xin Wang, Shu Hu

    Abstract: In the realm of medical imaging, particularly for COVID-19 detection, deep learning models face substantial challenges such as the necessity for extensive computational resources, the paucity of well-annotated datasets, and a significant amount of unlabeled data. In this work, we introduce the first lightweight detector designed to overcome these obstacles, leveraging a frozen CLIP image encoder a… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  7. arXiv:2402.17467  [pdf, other

    cs.IR cs.AI cs.SD eess.AS

    Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

    Authors: Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller, Dorien Herremans

    Abstract: Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 36 pages, 5 figures, 4 tables

  8. arXiv:2312.10518  [pdf, other

    cs.SD cs.AI eess.AS

    Seq2seq for Automatic Paraphasia Detection in Aphasic Speech

    Authors: Matthew Perez, Duc Le, Amrit Romana, Elise Jones, Keli Licata, Emily Mower Provost

    Abstract: Paraphasias are speech errors that are often characteristic of aphasia and they represent an important signal in assessing disease severity and subtype. Traditionally, clinicians manually identify paraphasias by transcribing and analyzing speech-language samples, which can be a time-consuming and burdensome process. Identifying paraphasias automatically can greatly help clinicians with the transcr… ▽ More

    Submitted 16 December, 2023; originally announced December 2023.

  9. arXiv:2312.08723  [pdf, other

    cs.SD cs.LG eess.AS

    StemGen: A music generation model that listens

    Authors: Julian D. Parker, Janne Spijkervet, Katerina Kosta, Furkan Yesiler, Boris Kuznetsov, Ju-Chiang Wang, Matt Avent, Jitong Chen, Duc Le

    Abstract: End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be… ▽ More

    Submitted 16 January, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted for publication at ICASSP 2024

  10. arXiv:2311.03318  [pdf, other

    cs.SD cs.IR eess.AS

    A Foundation Model for Music Informatics

    Authors: Minz Won, Yun-Ning Hung, Duc Le

    Abstract: This paper investigates foundation models tailored for music informatics, a domain currently challenged by the scarcity of labeled data and generalization issues. To this end, we conduct an in-depth comparative study among various foundation model variants, examining key determinants such as model architectures, tokenization methods, temporal resolution, data, and model scalability. This research… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: 5 pages

  11. arXiv:2310.01353  [pdf, other

    eess.AS cs.SD

    Scaling Up Music Information Retrieval Training with Semi-Supervised Learning

    Authors: Yun-Ning Hung, Ju-Chiang Wang, Minz Won, Duc Le

    Abstract: In the era of data-driven Music Information Retrieval (MIR), the scarcity of labeled data has been one of the major concerns to the success of an MIR task. In this work, we leverage the semi-supervised teacher-student training approach to improve MIR tasks. For training, we scale up the unlabeled music data to 240k hours, which is much larger than any public MIR datasets. We iteratively create and… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  12. arXiv:2307.16834  [pdf

    cs.CV cs.AI cs.LG eess.IV

    Benchmarking Jetson Edge Devices with an End-to-end Video-based Anomaly Detection System

    Authors: Hoang Viet Pham, Thinh Gia Tran, Chuong Dinh Le, An Dinh Le, Hien Bich Vo

    Abstract: Innovative enhancement in embedded system platforms, specifically hardware accelerations, significantly influence the application of deep learning in real-world scenarios. These innovations translate human labor efforts into automated intelligent systems employed in various areas such as autonomous driving, robotics, Internet-of-Things (IoT), and numerous other impactful applications. NVIDIA's Jet… ▽ More

    Submitted 12 September, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted in Future of Information and Communication Conference (FICC) 2024

  13. arXiv:2307.12134  [pdf, other

    cs.CL cs.SD eess.AS

    Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

    Authors: Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer

    Abstract: End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently. This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR), and outperforms traditional pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems still show weakness wh… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: INTERSPEECH 2023

  14. arXiv:2305.16333  [pdf, ps, other

    cs.CL cs.AI cs.LG eess.AS

    Text Generation with Speech Synthesis for ASR Data Augmentation

    Authors: Zhuangqun Huang, Gil Keren, Ziran Jiang, Shashank Jain, David Goss-Grubbs, Nelson Cheng, Farnaz Abtahi, Duc Le, David Zhang, Antony D'Avirro, Ethan Campbell-Taylor, Jessie Salas, Irina-Elena Veliche, Xi Chen

    Abstract: Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work mainly focuses on synthetic speech generation for ASR data augmentation, its combination with text generation methods is considerably less explored. In this work, we explore text augmentation for ASR using large-scale pre-tr… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  15. arXiv:2212.14840  [pdf

    physics.med-ph eess.IV physics.bio-ph

    Normalized Blood Flow Index in Optical Coherence Tomography Angiography Provides a Sensitive Biomarker of Early Diabetic Retinopathy

    Authors: Albert K. Dadzie, David Le, Mansour Abtahi, Behrouz Ebrahimi, Taeyoon Son, Jennifer I. Lim, Xincheng Yao

    Abstract: Purpose: To evaluate the sensitivity of normalized blood flow index (NBFI) for detecting early diabetic retinopathy (DR). Methods: Optical coherence tomography angiography (OCTA) images of 30 eyes from 20 healthy controls, 21 eyes of diabetic patients with no DR (NoDR) and 26 eyes from 22 patients with mild non-proliferative DR (NPDR) were analyzed in this study. The OCTA images were centered on t… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

  16. arXiv:2212.14353  [pdf, other

    cs.DC eess.SP

    Sheaf-theoretic self-filtering network of low-cost sensors for local air quality monitoring: A causal approach

    Authors: Anh-Duy Pham, Chuong Dinh Le, Hoang Viet Pham, Thinh Gia Tran, Dat Thanh Vo, Chau Long Tran, An Dinh Le, Hien Bich Vo

    Abstract: Sheaf theory, which is a complex but powerful tool supported by topological theory, offers more flexibility and precision than traditional graph theory when it comes to modeling relationships between multiple features. In the realm of air quality monitoring, this can be incredibly useful in detecting sudden changes in local dust particle density, which can be difficult to accurately measure using… ▽ More

    Submitted 29 December, 2022; originally announced December 2022.

  17. arXiv:2212.13257  [pdf

    physics.med-ph cs.CV eess.IV eess.SY physics.optics

    A portable widefield fundus camera with high dynamic range imaging capability

    Authors: Alfa Rossi, Mojtaba Rahimi, David Le, Taeyoon son, Michael J. Heiferman, R. V. Paul Chan, Xincheng Yao

    Abstract: Fundus photography is indispensable for clinical detection and management of eye diseases. Limited image contrast and field of view (FOV) are common limitations of conventional fundus cameras, making it difficult to detect subtle abnormalities at the early stages of eye diseases. Further improvements of image contrast and FOV coverage are important to improve early disease detection and reliable t… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: 12 pages, 8 figures

  18. arXiv:2212.07650  [pdf, other

    eess.AS

    Improving Fast-slow Encoder based Transducer with Streaming Deliberation

    Authors: Ke Li, Jay Mahadeokar, **xi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, Duc Le

    Abstract: This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while kee** its latency low by integrating a streaming deliberation model. Specifically, the deliberation model leverages partial hypotheses from the streaming fast encoder and impl… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  19. arXiv:2212.04313  [pdf

    eess.SY

    Scalable, low-cost, and versatile system design for air pollution and traffic density monitoring and analysis

    Authors: Thinh Gia Tran, Dat Thanh Vo, Long Chau Tran, Hoang Viet Pham, Chuong Dinh Le, An Dinh Le, Duy Anh Pham, Hien Bich Vo

    Abstract: Vietnam requires a sustainable urbanization, for which city sensing is used in planning and de-cision-making. Large cities need portable, scalable, and inexpensive digital technology for this purpose. End-to-end air quality monitoring companies such as AirVisual and Plume Air have shown their reliability with portable devices outfitted with superior air sensors. They are pricey, yet homeowners use… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

  20. arXiv:2211.05756  [pdf, other

    cs.CL cs.SD eess.AS

    Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

    Authors: Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer

    Abstract: End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. W… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  21. arXiv:2211.00896  [pdf, other

    eess.AS cs.SD

    Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers

    Authors: Duc Le, Frank Seide, Yuhao Wang, Yang Li, Kjell Schubert, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models like the RNN-T for on-device ASR, optimizing RNN-T's runtime efficiency is of great interest. While previous work has primarily focused on the optimization of RNN-… ▽ More

    Submitted 4 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted for publication at ICASSP 2023

  22. arXiv:2211.00174  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

    Authors: Suyoun Kim, Ke Li, Lucas Kabela, Rongqing Huang, Jiedan Zhu, Ozlem Kalinli, Duc Le

    Abstract: Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while kee** latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-p… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Journal ref: Findings of EMNLP 2022 short

  23. arXiv:2210.12097  [pdf, other

    eess.SP cs.LG

    Robust Singular Values based on L1-norm PCA

    Authors: Duc Le, Panos P. Markopoulos

    Abstract: Singular-Value Decomposition (SVD) is a ubiquitous data analysis method in engineering, science, and statistics. Singular-value estimation, in particular, is of critical importance in an array of engineering applications, such as channel estimation in communication systems, electromyography signal analysis, and image compression, to name just a few. Conventional SVD of a data matrix coincides with… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

  24. arXiv:2210.06297  [pdf, other

    eess.SP cs.AI cs.LG

    Multimodality Multi-Lead ECG Arrhythmia Classification using Self-Supervised Learning

    Authors: Thinh Phan, Duc Le, Patel Brijesh, Donald Adjeroh, **gxian Wu, Morten Olgaard Jensen, Ngan Le

    Abstract: Electrocardiogram (ECG) signal is one of the most effective sources of information mainly employed for the diagnosis and prediction of cardiovascular diseases (CVDs) connected with the abnormalities in heart rhythm. Clearly, single modality ECG (i.e. time series) cannot convey its complete characteristics, thus, exploiting both time and time-frequency modalities in the form of time-series data and… ▽ More

    Submitted 30 September, 2022; originally announced October 2022.

  25. arXiv:2209.05735  [pdf, other

    eess.AS cs.CL

    Learning ASR pathways: A sparse multilingual ASR model

    Authors: Mu Yang, Andros Tjandra, Chunxi Liu, David Zhang, Duc Le, Ozlem Kalinli

    Abstract: Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activa… ▽ More

    Submitted 28 September, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted by ICASSP 2023

  26. arXiv:2207.10643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    STOP: A dataset for Spoken Task Oriented Semantic Parsing

    Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

    Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More

    Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

  27. arXiv:2204.08858  [pdf, other

    eess.AS cs.SD

    An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

    Authors: Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

    Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination,… ▽ More

    Submitted 21 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted to SLT 2022

  28. arXiv:2204.01893  [pdf, other

    cs.CL eess.AS

    Deliberation Model for On-Device Spoken Language Understanding

    Authors: Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as a generalized decoder, ou… ▽ More

    Submitted 6 September, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Accepted for publication at INTERSPEECH 2022

  29. arXiv:2203.15773  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming parallel transducer beam search with fast-slow cascaded encoders

    Authors: Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer

    Abstract: Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Interspeech 2022 submission

  30. arXiv:2201.12625  [pdf

    eess.IV cs.CV q-bio.TO

    ADC-Net: An Open-Source Deep Learning Network for Automated Dispersion Compensation in Optical Coherence Tomography

    Authors: Shaiban Ahmed, David Le, Taeyoon Son, Tobiloba Adejumo, Xincheng Yao, Department of Biomedical Engineering, University of Illinois at Chicago, Department of Ophthalmology, Visual Science, University of Illinois at Chicago

    Abstract: Chromatic dispersion is a common problem to degrade the system resolution in optical coherence tomography (OCT). This study is to develop a deep learning network for automated dispersion compensation (ADC-Net) in OCT. The ADC-Net is based on a redesigned UNet architecture which employs an encoder-decoder pipeline. The input section encompasses partially compensated OCT B-scans with individual reti… ▽ More

    Submitted 29 January, 2022; originally announced January 2022.

    Comments: 18 pages, 5 figures

  31. arXiv:2201.11867  [pdf, other

    cs.CL cs.SD eess.AS

    Neural-FST Class Language Model for End-to-End Speech Recognition

    Authors: Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu, Bo Wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer

    Abstract: We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework. Our method utilizes a background NNLM which models generic background text together with a collection of domain-specific entities modeled as individual FSTs. Each outpu… ▽ More

    Submitted 31 January, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

    Comments: Accepted for publication at ICASSP 2022

  32. arXiv:2112.10957  [pdf, other

    eess.SP

    RSSI prediction using Machine Learning models

    Authors: Tung Giang Le, Huy Tung Quach, Thu Thao Dao Le, Manh Hoang Tran

    Abstract: In this study, we present a method to predict the Received signal strength indication (RSSI) in an area of the base station. Traditional attenuated wave propagation models are often time consuming as well as computationally complex, depending on the unique factors of the medium. This study focuses on providing a solution to predict signal quality using coordinate values of many points in the consi… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 6 pages, in Vietnamese

  33. arXiv:2112.07775  [pdf

    q-bio.TO eess.IV

    Depth-resolved vascular profile features for artery-vein classification in OCT and OCT angiography of human retina

    Authors: Tobiloba Adejumo, Tae-Hoon Kim, David Le, Taeyoon Son, Guangying Ma, Xincheng Yao

    Abstract: This study is to characterize reflectance profiles of retinal blood vessels in optical coherence tomography (OCT), and to validate these vascular features to guide artery-vein classification in OCT angiography (OCTA) of human retina. Depth-resolved OCT reveals unique features of retinal arteries and veins. Retinal arteries show hyper-reflective boundaries at both upper (inner side towards the vitr… ▽ More

    Submitted 6 February, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: 11 pages, 4 figures

  34. arXiv:2111.05948  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling ASR Improves Zero and Few Shot Learning

    Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

    Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such a… ▽ More

    Submitted 29 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

  35. arXiv:2104.02232  [pdf, other

    cs.SD cs.CL eess.AS

    Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

    Authors: Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provid… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021 (under review)

  36. arXiv:2104.02207  [pdf, other

    cs.SD cs.CL eess.AS

    Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

    Authors: Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

    Abstract: As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate… ▽ More

    Submitted 11 August, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proc. of Interspeech 2021

  37. arXiv:2104.02194  [pdf, other

    cs.CL cs.LG eess.AS

    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

    Authors: Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

    Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that… ▽ More

    Submitted 11 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted for presentation at INTERSPEECH 2021

  38. arXiv:2012.15029  [pdf, other

    eess.IV

    VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations

    Authors: Ha Q. Nguyen, Khanh Lam, Linh T. Le, Hieu H. Pham, Dat Q. Tran, Dung B. Nguyen, Dung D. Le, Chi M. Pham, Hang T. T. Tong, Diep H. Dinh, Cuong D. Do, Luu T. Doan, Cuong N. Nguyen, Binh T. Nguyen, Que V. Nguyen, Au D. Hoang, Hien N. Phan, Anh T. Nguyen, Phuong H. Ho, Dat T. Ngo, Nghia T. Nguyen, Nhan T. Nguyen, Minh Dao, Van Vu

    Abstract: Most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. In this work, we describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam… ▽ More

    Submitted 20 March, 2022; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: 11 pages, under review by Nature Scientific Data

  39. arXiv:2012.06834  [pdf, ps, other

    eess.SY

    Deep Reinforcement Learning for Tropical Air Free-Cooled Data Center Control

    Authors: Duc Van Le, Rongrong Wang, Yingbo Liu, Rui Tan, Yew-Wah Wong, Yonggang Wen

    Abstract: Air free-cooled data centers (DCs) have not existed in the tropical zone due to the unique challenges of year-round high ambient temperature and relative humidity (RH). The increasing availability of servers that can tolerate higher temperatures and RH due to the regulatory bodies' prompts to raise DC temperature setpoints sheds light upon the feasibility of air free-cooled DCs in tropics. However… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

    Journal ref: ACM Transactions on Sensor Networks, Special Issue on Computational Intelligence in Internet of Things, 2021

  40. arXiv:2011.07754  [pdf, other

    cs.CL eess.AS

    Deep Shallow Fusion for RNN-T Personalization

    Authors: Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer

    Abstract: End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the la… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: To appear at SLT 2021

  41. arXiv:2011.07673  [pdf, other

    eess.SY cs.MA

    Spatiotemporal Characteristics of Ride-sourcing Operation in Urban Area

    Authors: Simon Oh, Daniel Kondor, Ravi Seshadri, Meng Zhou, Diem-Trinh Le, Moshe Ben-Akiva

    Abstract: The emergence of ride-sourcing platforms has brought an innovative alternative in transportation, radically changed travel behaviors, and suggested new directions for transportation planners and operators. This paper provides an exploratory analysis on the operations of a ride-sourcing service using large-scale data on service performance. Observations over multiple days in Singapore suggest repro… ▽ More

    Submitted 15 November, 2020; originally announced November 2020.

    Comments: 18 pages, 11 figures, 5 tables

  42. arXiv:2011.03109  [pdf, other

    cs.CL cs.SD eess.AS

    Improving RNN Transducer Based ASR with Auxiliary Tasks

    Authors: Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig

    Abstract: End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing aux… ▽ More

    Submitted 8 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

  43. arXiv:2011.03072  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Alignment Restricted Streaming Recurrent Neural Network Transducer

    Authors: Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer

    Abstract: There is a growing interest in the speech community in develo** Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for lon… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for presentation at IEEE Spoken Language Technology Workshop (SLT) 2021

  44. arXiv:2010.10759  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

    Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, Mike Seltzer

    Abstract: This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in t… ▽ More

    Submitted 30 December, 2020; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  45. Classification of Huntington Disease using Acoustic and Lexical Features

    Authors: Matthew Perez, Wenyu **, Duc Le, Noelle Carlozzi, Praveen Dayalu, Angela Roberts, Emily Mower Provost

    Abstract: Speech is a critical biomarker for Huntington Disease (HD), with changes in speech increasing in severity as the disease progresses. Speech analyses are currently conducted using either transcriptions created manually by trained professionals or using global rating scales. Manual transcription is both expensive and time-consuming and global rating scales may lack sufficient sensitivity and fidelit… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: 4 pages

  46. arXiv:2006.03742  [pdf

    eess.IV q-bio.QM

    AV-Net: Deep learning for fully automated artery-vein classification in optical coherence tomography angiography

    Authors: Minhaj Alam, David Le, Taeyoon Son, Jennifer I. Lim, Xincheng Yao

    Abstract: This study is to demonstrate deep learning for automated artery-vein (AV) classification in optical coherence tomography angiography (OCTA). The AV-Net, a fully convolutional network (FCN) based on modified U-shaped CNN architecture, incorporates enface OCT and OCTA to differentiate arteries and veins. For the multi-modal training process, the enface OCT works as a near infrared fundus image to pr… ▽ More

    Submitted 5 June, 2020; originally announced June 2020.

  47. arXiv:2005.09137  [pdf, other

    eess.AS cs.CL

    Weak-Attention Suppression For Transformer Based Speech Recognition

    Authors: Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

    Abstract: Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propo… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to interspeech 2020

  48. arXiv:2002.04977  [pdf, ps, other

    physics.data-an cond-mat.supr-con eess.SP

    Critical Temperature Prediction for a Superconductor: A Variational Bayesian Neural Network Approach

    Authors: Thanh Dung Le, Rita Noumeir, Huu Luong Quach, Ji Hyung Kim, Jung Ho Kim, Ho Min Kim

    Abstract: Much research in recent years has focused on using empirical machine learning approaches to extract useful insights on the structure-property relationships of superconductor material. Notably, these approaches are bringing extreme benefits when superconductivity data often come from costly and arduously experimental work. However, this assessment cannot be based solely on an open black-box machine… ▽ More

    Submitted 29 January, 2020; originally announced February 2020.

    Comments: IEEE Transactions on Applied Superconductivity, 2020

  49. arXiv:1910.12977  [pdf, other

    eess.AS cs.CL cs.SD

    Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

    Authors: Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

    Abstract: We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-at… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

  50. arXiv:1910.12612  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

    Authors: Duc Le, Thilo Koehler, Christian Fuegen, Michael L. Seltzer

    Abstract: Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel… ▽ More

    Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: To appear at ICASSP 2020