Search | arXiv e-print repository

CoCPF: Coordinate-based Continuous Projection Field for Ill-Posed Inverse Problem in Imaging

Authors: Zixuan Chen, Lingxiao Yang, Jian-Huang Lai, Xiaohua Xie

Abstract: Sparse-view computed tomography (SVCT) reconstruction aims to acquire CT images based on sparsely-sampled measurements. It allows the subjects exposed to less ionizing radiation, reducing the lifetime risk of develo** cancers. Recent researches employ implicit neural representation (INR) techniques to reconstruct CT images from a single SV sinogram. However, due to ill-posedness, these INR-based… ▽ More Sparse-view computed tomography (SVCT) reconstruction aims to acquire CT images based on sparsely-sampled measurements. It allows the subjects exposed to less ionizing radiation, reducing the lifetime risk of develo** cancers. Recent researches employ implicit neural representation (INR) techniques to reconstruct CT images from a single SV sinogram. However, due to ill-posedness, these INR-based methods may leave considerable ``holes'' (i.e., unmodeled spaces) in their fields, leading to sub-optimal results. In this paper, we propose the Coordinate-based Continuous Projection Field (CoCPF), which aims to build hole-free representation fields for SVCT reconstruction, achieving better reconstruction quality. Specifically, to fill the holes, CoCPF first employs the stripe-based volume sampling module to broaden the sampling regions of Radon transformation from rays (1D space) to stripes (2D space), which can well cover the internal regions between SV projections. Then, by feeding the sampling regions into the proposed differentiable rendering modules, the holes can be jointly optimized during training, reducing the ill-posed levels. As a result, CoCPF can accurately estimate the internal measurements between SV projections (i.e., DV sinograms), producing high-quality CT images after re-projection. Extensive experiments on simulated and real projection datasets demonstrate that CoCPF outperforms state-of-the-art methods for 2D and 3D SVCT reconstructions under various projection numbers and geometries, yielding fine-grained details and fewer artifacts. Our code will be publicly available. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2405.07037 [pdf, other]

Robust Online Convex Optimization for Disturbance Rejection

Authors: Joyce Lai, Peter Seiler

Abstract: Online convex optimization (OCO) is a powerful tool for learning sequential data, making it ideal for high precision control applications where the disturbances are arbitrary and unknown in advance. However, the ability of OCO-based controllers to accurately learn the disturbance while maintaining closed-loop stability relies on having an accurate model of the plant. This paper studies the perform… ▽ More Online convex optimization (OCO) is a powerful tool for learning sequential data, making it ideal for high precision control applications where the disturbances are arbitrary and unknown in advance. However, the ability of OCO-based controllers to accurately learn the disturbance while maintaining closed-loop stability relies on having an accurate model of the plant. This paper studies the performance of OCO-based controllers for linear time-invariant (LTI) systems subject to disturbance and model uncertainty. The model uncertainty can cause the closed-loop to become unstable. We provide a sufficient condition for robust stability based on the small gain theorem. This condition is easily incorporated as an on-line constraint in the OCO controller. Finally, we verify via numerical simulations that imposing the robust stability condition on the OCO controller ensures closed-loop stability. △ Less

Submitted 11 May, 2024; originally announced May 2024.

arXiv:2401.01755 [pdf, other]

Incremental FastPitch: Chunk-based High Quality Text to Speech

Authors: Muyang Du, Chuan Liu, Junjie Lai

Abstract: Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we… ▽ More Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 1 table

arXiv:2312.17508 [pdf, ps, other]

doi 10.21437/Interspeech.2023-39

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

Authors: Yun Chen, Lingxiao Yang, Qi Chen, Jian-Huang Lai, Xiaohua Xie

Abstract: Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effect… ▽ More Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics. △ Less

Submitted 29 December, 2023; originally announced December 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2310.13259 [pdf]

Domain-specific optimization and diverse evaluation of self-supervised models for histopathology

Authors: Jeremy Lai, Faruk Ahmed, Supriya Vijay, Tiam Jaroensri, Jessica Loo, Saurabh Vyawahare, Saloni Agarwal, Fayaz Jamil, Yossi Matias, Greg S. Corrado, Dale R. Webster, Jonathan Krause, Yun Liu, Po-Hsuan Cameron Chen, Ellery Wulczyn, David F. Steiner

Abstract: Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential… ▽ More Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 4 main tables, 3 main figures, additional supplemental tables and figures

arXiv:2310.07654 [pdf, other]

Audio-Visual Neural Syntax Acquisition

Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2309.09843 [pdf, other]

Instruction-Following Speech Recognition

Authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai… ▽ More Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2307.05270 [pdf, other]

APRF: Anti-Aliasing Projection Representation Field for Inverse Problem in Imaging

Authors: Zixuan Chen, Lingxiao Yang, Jianhuang Lai, Xiaohua Xie

Abstract: Sparse-view Computed Tomography (SVCT) reconstruction is an ill-posed inverse problem in imaging that aims to acquire high-quality CT images based on sparsely-sampled measurements. Recent works use Implicit Neural Representations (INRs) to build the coordinate-based map** between sinograms and CT images. However, these methods have not considered the correlation between adjacent projection views… ▽ More Sparse-view Computed Tomography (SVCT) reconstruction is an ill-posed inverse problem in imaging that aims to acquire high-quality CT images based on sparsely-sampled measurements. Recent works use Implicit Neural Representations (INRs) to build the coordinate-based map** between sinograms and CT images. However, these methods have not considered the correlation between adjacent projection views, resulting in aliasing artifacts on SV sinograms. To address this issue, we propose a self-supervised SVCT reconstruction method -- Anti-Aliasing Projection Representation Field (APRF), which can build the continuous representation between adjacent projection views via the spatial constraints. Specifically, APRF only needs SV sinograms for training, which first employs a line-segment sampling module to estimate the distribution of projection views in a local region, and then synthesizes the corresponding sinogram values using center-based line integral module. After training APRF on a single SV sinogram itself, it can synthesize the corresponding dense-view (DV) sinogram with consistent continuity. High-quality CT images can be obtained by applying re-projection techniques on the predicted DV sinograms. Extensive experiments on CT images demonstrate that APRF outperforms state-of-the-art methods, yielding more accurate details and fewer artifacts. Our code will be publicly available soon. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2305.11686 [pdf, other]

Domain Adaptive Sim-to-Real Segmentation of Oropharyngeal Organs Towards Robot-assisted Intubation

Authors: Guankun Wang, Tian-Ao Ren, Jiewen Lai, Long Bai, Hongliang Ren

Abstract: Robotic-assisted tracheal intubation requires the robot to distinguish anatomical features like an experienced physician using deep-learning techniques. However, real datasets of oropharyngeal organs are limited due to patient privacy issues, making it challenging to train deep-learning models for accurate image segmentation. We hereby consider generating a new data modality through a virtual envi… ▽ More Robotic-assisted tracheal intubation requires the robot to distinguish anatomical features like an experienced physician using deep-learning techniques. However, real datasets of oropharyngeal organs are limited due to patient privacy issues, making it challenging to train deep-learning models for accurate image segmentation. We hereby consider generating a new data modality through a virtual environment to assist the training process. Specifically, this work introduces a virtual dataset generated by the Simulation Open Framework Architecture (SOFA) framework to overcome the limited availability of actual endoscopic images. We also propose a domain adaptive Sim-to-Real method for oropharyngeal organ image segmentation, which employs an image blending strategy called IoU-Ranking Blend (IRB) and style-transfer techniques to address discrepancies between datasets. Experimental results demonstrate the superior performance of the proposed approach with domain adaptive models, improving segmentation accuracy and training stability. In the practical application, the trained segmentation model holds great promise for robot-assisted intubation surgery and intelligent surgical navigation. △ Less

Submitted 27 June, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Extended abstract in IEEE ICRA 2023 Workshop (New Evolutions in Surgical Robotics: Embracing Multimodal Imaging Guidance, Intelligence, and Bio-inspired Mechanisms). arXiv admin note: text overlap with arXiv:2305.10883

arXiv:2305.10883 [pdf, other]

Domain Adaptive Sim-to-Real Segmentation of Oropharyngeal Organs

Authors: Guankun Wang, Tian-Ao Ren, Jiewen Lai, Long Bai, Hongliang Ren

Abstract: Video-assisted transoral tracheal intubation (TI) necessitates using an endoscope that helps the physician insert a tracheal tube into the glottis instead of the esophagus. The growing trend of robotic-assisted TI would require a medical robot to distinguish anatomical features like an experienced physician which can be imitated by utilizing supervised deep-learning techniques. However, the real d… ▽ More Video-assisted transoral tracheal intubation (TI) necessitates using an endoscope that helps the physician insert a tracheal tube into the glottis instead of the esophagus. The growing trend of robotic-assisted TI would require a medical robot to distinguish anatomical features like an experienced physician which can be imitated by utilizing supervised deep-learning techniques. However, the real datasets of oropharyngeal organs are often inaccessible due to limited open-source data and patient privacy. In this work, we propose a domain adaptive Sim-to-Real framework called IoU-Ranking Blend-ArtFlow (IRB-AF) for image segmentation of oropharyngeal organs. The framework includes an image blending strategy called IoU-Ranking Blend (IRB) and style-transfer method ArtFlow. Here, IRB alleviates the problem of poor segmentation performance caused by significant datasets domain differences; while ArtFlow is introduced to reduce the discrepancies between datasets further. A virtual oropharynx image dataset generated by the SOFA framework is used as the learning subject for semantic segmentation to deal with the limited availability of actual endoscopic images. We adapted IRB-AF with the state-of-the-art domain adaptive segmentation models. The results demonstrate the superior performance of our approach in further improving the segmentation accuracy and training stability. △ Less

Submitted 27 July, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: The manuscript is accepted by Medical & Biological Engineering & Computing. Code and dataset: https://github.com/gkw0010/EISOST-Sim2Real-Dataset-Release

arXiv:2303.16242 [pdf, other]

CuNeRF: Cube-Based Neural Radiance Field for Zero-Shot Medical Image Arbitrary-Scale Super Resolution

Authors: Zixuan Chen, Jian-Huang Lai, Lingxiao Yang, Xiaohua Xie

Abstract: Medical image arbitrary-scale super-resolution (MIASSR) has recently gained widespread attention, aiming to super sample medical volumes at arbitrary scales via a single model. However, existing MIASSR methods face two major limitations: (i) reliance on high-resolution (HR) volumes and (ii) limited generalization ability, which restricts their application in various scenarios. To overcome these li… ▽ More Medical image arbitrary-scale super-resolution (MIASSR) has recently gained widespread attention, aiming to super sample medical volumes at arbitrary scales via a single model. However, existing MIASSR methods face two major limitations: (i) reliance on high-resolution (HR) volumes and (ii) limited generalization ability, which restricts their application in various scenarios. To overcome these limitations, we propose Cube-based Neural Radiance Field (CuNeRF), a zero-shot MIASSR framework that can yield medical images at arbitrary scales and viewpoints in a continuous domain. Unlike existing MIASSR methods that fit the map** between low-resolution (LR) and HR volumes, CuNeRF focuses on building a coordinate-intensity continuous representation from LR volumes without the need for HR references. This is achieved by the proposed differentiable modules: including cube-based sampling, isotropic volume rendering, and cube-based hierarchical rendering. Through extensive experiments on magnetic resource imaging (MRI) and computed tomography (CT) modalities, we demonstrate that CuNeRF outperforms state-of-the-art MIASSR methods. CuNeRF yields better visual verisimilitude and reduces aliasing artifacts at various upsampling factors. Moreover, our CuNeRF does not need any LR-HR training pairs, which is more flexible and easier to be used than others. Our code is released at https://github.com/NarcissusEx/CuNeRF. △ Less

Submitted 16 April, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

Comments: This paper is accepted by the International Conference on Computer Vision (ICCV) 2023

arXiv:2303.14133 [pdf, other]

Adversarial Attack and Defense for Medical Image Analysis: Methods and Applications

Authors: Junhao Dong, Junxi Chen, Xiaohua Xie, Jianhuang Lai, Hao Chen

Abstract: Deep learning techniques have achieved superior performance in computer-aided medical image analysis, yet they are still vulnerable to imperceptible adversarial attacks, resulting in potential misdiagnosis in clinical practice. Oppositely, recent years have also witnessed remarkable progress in defense against these tailored adversarial examples in deep medical diagnosis systems. In this expositio… ▽ More Deep learning techniques have achieved superior performance in computer-aided medical image analysis, yet they are still vulnerable to imperceptible adversarial attacks, resulting in potential misdiagnosis in clinical practice. Oppositely, recent years have also witnessed remarkable progress in defense against these tailored adversarial examples in deep medical diagnosis systems. In this exposition, we present a comprehensive survey on recent advances in adversarial attack and defense for medical image analysis with a novel taxonomy in terms of the application scenario. We also provide a unified theoretical framework for different types of adversarial attack and defense methods for medical image analysis. For a fair comparison, we establish a new benchmark for adversarially robust medical diagnosis models obtained by adversarial training under various scenarios. To the best of our knowledge, this is the first survey paper that provides a thorough evaluation of adversarially robust medical diagnosis models. By analyzing qualitative and quantitative results, we conclude this survey with a detailed discussion of current challenges for adversarial attack and defense in medical image analysis systems to shed light on future research directions. △ Less

Submitted 24 March, 2023; originally announced March 2023.

arXiv:2211.13939 [pdf, other]

Efficient Incremental Text-to-Speech on GPUs

Authors: Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai

Abstract: Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end… ▽ More Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs. △ Less

Submitted 5 December, 2022; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures

arXiv:2211.04717 [pdf, other]

Improving Noisy Student Training on Non-target Domain Data for Automatic Speech Recognition

Authors: Yu Chen, Wen Ding, Junjie Lai

Abstract: Noisy Student Training (NST) has recently demonstrated extremely strong performance in Automatic Speech Recognition(ASR). In this paper, we propose a data selection strategy named LM Filter to improve the performance of NST on non-target domain data in ASR tasks. Hypotheses with and without a Language Model are generated and the CER differences between them are utilized as a filter threshold. Resu… ▽ More Noisy Student Training (NST) has recently demonstrated extremely strong performance in Automatic Speech Recognition(ASR). In this paper, we propose a data selection strategy named LM Filter to improve the performance of NST on non-target domain data in ASR tasks. Hypotheses with and without a Language Model are generated and the CER differences between them are utilized as a filter threshold. Results reveal that significant improvements of 10.4% compared with no data filtering baselines. We can achieve 3.31% CER in AISHELL-1 test set, which is best result from our knowledge without any other supervised data. We also perform evaluations on the supervised 1000 hour AISHELL-2 dataset and competitive results of 4.73% CER can be achieved. △ Less

Submitted 1 March, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: This paper is accepted by the ICASSP 2023 conference

arXiv:2207.00001 [pdf]

MultiEarth 2022 -- The Champion Solution for Image-to-Image Translation Challenge via Generation Models

Authors: Yuchuan Gou, Bo Peng, Hongchen Liu, Hang Zhou, Jui-Hsin Lai

Abstract: The MultiEarth 2022 Image-to-Image Translation challenge provides a well-constrained test bed for generating the corresponding RGB Sentinel-2 imagery with the given Sentinel-1 VV & VH imagery. In this challenge, we designed various generation models and found the SPADE [1] and pix2pixHD [2] models could perform our best results. In our self-evaluation, the SPADE-2 model with L1-loss can achieve 0.… ▽ More The MultiEarth 2022 Image-to-Image Translation challenge provides a well-constrained test bed for generating the corresponding RGB Sentinel-2 imagery with the given Sentinel-1 VV & VH imagery. In this challenge, we designed various generation models and found the SPADE [1] and pix2pixHD [2] models could perform our best results. In our self-evaluation, the SPADE-2 model with L1-loss can achieve 0.02194 MAE score and 31.092 PSNR dB. In our final submission, the best model can achieve 0.02795 MAE score ranked No.1 on the leader board. △ Less

Submitted 17 June, 2022; originally announced July 2022.

Comments: CVPR 2022, MultiEarth 2022, Image-to-Image translation, competition

arXiv:2204.02524 [pdf, other]

Simple and Effective Unsupervised Speech Synthesis

Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation. △ Less

Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: preprint, equal contribution from first two authors

arXiv:2203.06849 [pdf, other]

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Authors: Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in… ▽ More Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: ACL 2022 main conference

arXiv:2110.09784 [pdf, other]

SSAST: Self-Supervised Audio Spectrogram Transformer

Authors: Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology ca… ▽ More Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST. △ Less

Submitted 10 February, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: Accepted at AAAI2022. Code at https://github.com/YuanGongND/ssast

arXiv:2110.01147 [pdf, other]

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS. △ Less

Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

arXiv:2107.07873 [pdf]

Metasurface-Enabled On-Chip Multiplexed Diffractive Neural Networks in the Visible

Authors: Xuhao Luo, Yueqiang Hu, Xin Li, Xiangnian Ou, Jiajie Lai, Na Liu, Huigao Duan

Abstract: Replacing electrons with photons is a compelling route towards light-speed, highly parallel, and low-power artificial intelligence computing. Recently, all-optical diffractive neural deep neural networks have been demonstrated. However, the existing architectures often comprise bulky components and, most critically, they cannot mimic the human brain for multitasking. Here, we demonstrate a multi-s… ▽ More Replacing electrons with photons is a compelling route towards light-speed, highly parallel, and low-power artificial intelligence computing. Recently, all-optical diffractive neural deep neural networks have been demonstrated. However, the existing architectures often comprise bulky components and, most critically, they cannot mimic the human brain for multitasking. Here, we demonstrate a multi-skilled diffractive neural network based on a metasurface device, which can perform on-chip multi-channel sensing and multitasking at the speed of light in the visible. The metasurface is integrated with a complementary metal oxide semiconductor imaging sensor. Polarization multiplexing scheme of the subwavelength nanostructures are applied to construct a multi-channel classifier framework for simultaneous recognition of digital and fashionable items. The areal density of the artificial neurons can reach up to 6.25x106/mm2 multiplied by the number of channels. Our platform provides an integrated solution with all-optical on-chip sensing and computing for applications in machine vision, autonomous driving, and precision medicine. △ Less

Submitted 13 July, 2021; originally announced July 2021.

arXiv:2106.05933 [pdf, other]

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

Authors: Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass

Abstract: Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted prunin… ▽ More Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted pruning methods such as the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost needed. Moreover, we show that the discovered subnetworks yield minimal performance gain compared to the original dense network. We present Prune-Adjust-Re-Prune (PARP), which discovers and finetunes subnetworks for much better performance, while only requiring a single downstream ASR finetuning run. PARP is inspired by our surprising observation that subnetworks pruned for pre-training tasks need merely a slight adjustment to achieve a sizeable performance boost in downstream ASR tasks. Extensive experiments on low-resource ASR verify (1) sparse subnetworks exist in mono-lingual/multi-lingual pre-trained speech SSL, and (2) the computational advantage and performance gain of PARP over baseline pruning methods. In particular, on the 10min Librispeech split without LM decoding, PARP discovers subnetworks from wav2vec 2.0 with an absolute 10.9%/12.6% WER decrease compared to the full model. We further demonstrate the effectiveness of PARP via: cross-lingual pruning without any phone recognition degradation, the discovery of a multi-lingual subnetwork for 10 spoken languages in 1 finetuning run, and its applicability to pre-trained BERT/XLNet for natural language tasks. △ Less

Submitted 26 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:2105.01051 [pdf, ps, other]

SUPERB: Speech processing Universal PERformance Benchmark

Authors: Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge… ▽ More Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing. △ Less

Submitted 15 October, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

Comments: To appear in Interspeech 2021

arXiv:2104.07200 [pdf, other]

A Novel Unified Framework for Solving Reachability, Viability and Invariance Problems

Authors: Wei Liao, Taotao Liang, Xiaohui Wei, Jizhou Lai

Abstract: The level set method is a widely used tool for solving reachability and invariance problems. However, some shortcomings, such as the difficulties of handling dissipation function and constructing terminal conditions for solving the Hamilton-Jacobi partial differential equation, limit the application of the level set method in some problems with non-affine nonlinear systems and irregular target set… ▽ More The level set method is a widely used tool for solving reachability and invariance problems. However, some shortcomings, such as the difficulties of handling dissipation function and constructing terminal conditions for solving the Hamilton-Jacobi partial differential equation, limit the application of the level set method in some problems with non-affine nonlinear systems and irregular target sets. This paper proposes a method that can effectively avoid the above tricky issues and thus has better generality. In the proposed method, the reachable or invariant sets with different time horizons are characterized by some non-zero sublevel sets of a value function. This value function is not obtained by solving a viscosity solution of the partial differential equation but by recursion and interpolation approximation. At the end of this paper, some examples are taken to illustrate the accuracy and generality of the proposed method. △ Less

Submitted 29 November, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: arXiv admin note: text overlap with arXiv:2101.09646

arXiv:2103.05576 [pdf, other]

Distributed Frequency Restoration and SoC Balancing Control for AC Microgrids

Authors: Chang Yu, Xiaoqing Lu, **gang Lai, Li Chai

Abstract: This paper develops an improved distributed finite-time control algorithm for multiagent-based ac microgrids with battery energy storage systems (BESSs) utilizing a low-width communication network. The proposed control algorithm can simultaneously coordinate BESSs to eliminate any deviation from the nominal frequency as well as solving the state of charge (SoC) balancing problem. The stability of… ▽ More This paper develops an improved distributed finite-time control algorithm for multiagent-based ac microgrids with battery energy storage systems (BESSs) utilizing a low-width communication network. The proposed control algorithm can simultaneously coordinate BESSs to eliminate any deviation from the nominal frequency as well as solving the state of charge (SoC) balancing problem. The stability of the proposed control algorithm is established using the Lyapunov method and homogeneous approximation theory, which guarantees an accelerated convergence within a settling time that does not dependent on initial conditions. Based on this, to significantly reduce the communication burdens, an event-triggered communication mechanism is designed which can also avoid Zeno behavior. Then sufficient conditions on the event-triggered boundary are derived to guarantee the stability and reliability of the whole system. Practical local constraints are imposed to implement the control protocol, and the theoretical results are applied to a test system consisting of five DGs and five BESSs, which verifies the effectiveness of the proposed strategy. △ Less

Submitted 9 March, 2021; originally announced March 2021.

arXiv:2012.09131 [pdf, other]

Personal Mental Health Navigator: Harnessing the Power of Data, Personal Models, and Health Cybernetics to Promote Psychological Well-being

Authors: Amir M. Rahmani, Jocelyn Lai, Salar Jafarlou, Asal Yunusova, Alex. P. Rivera, Sina Labbaf, Sirui Hu, Arman Anzanpour, Nikil Dutt, Ramesh Jain, Jessica L. Borelli

Abstract: Traditionally, the regime of mental healthcare has followed an episodic psychotherapy model wherein patients seek care from a provider through a prescribed treatment plan developed over multiple provider visits. Recent advances in wearable and mobile technology have generated increased interest in digital mental healthcare that enables individuals to address episodic mental health symptoms. Howeve… ▽ More Traditionally, the regime of mental healthcare has followed an episodic psychotherapy model wherein patients seek care from a provider through a prescribed treatment plan developed over multiple provider visits. Recent advances in wearable and mobile technology have generated increased interest in digital mental healthcare that enables individuals to address episodic mental health symptoms. However, these efforts are typically reactive and symptom-focused and do not provide comprehensive, wrap-around, customized treatments that capture an individual's holistic mental health model as it unfolds over time. Recognizing that each individual is unique, we present the notion of Personalized Mental Health Navigation (MHN): a therapist-in-the-loop, cybernetic goal-based system that deploys a continuous cyclic loop of measurement, estimation, guidance, to steer the individual's mental health state towards a healthy zone. We outline the major components of MHN that is premised on the development of an individual's personal mental health state, holistically represented by a high-dimensional cover of multiple knowledge layers such as emotion, biological patterns, sociology, behavior, and cognition. We demonstrate the feasibility of the personalized MHN approach via a 12-month pilot case study for holistic stress management in college students and highlight an instance of a therapist-in-the-loop intervention using MHN for monitoring, estimating, and proactively addressing moderately severe depression over a sustained period of time. We believe MHN paves the way to transform mental healthcare from the current passive, episodic, reactive process (where individuals seek help to address symptoms that have already manifested) to a continuous and navigational paradigm that leverages a personalized model of the individual, promising to deliver timely interventions to individuals in a holistic manner. △ Less

Submitted 15 December, 2020; originally announced December 2020.

arXiv:2011.06209 [pdf, other]

Recursive Regret Matching: A General Method for Solving Time-invariant Nonlinear Zero-sum Differential Games

Authors: Wei Liao, Xiaohui Wei, Jizhou Lai

Abstract: In this paper, a new method is proposed to compute the rolling Nash equilibrium of the time-invariant nonlinear two-person zero-sum differential games. The idea is to discretize the time to transform a differential game into a sequential game with several steps, and by introducing state-value function, transform the sequential game into a recursion consisting of several normal-form games, finally,… ▽ More In this paper, a new method is proposed to compute the rolling Nash equilibrium of the time-invariant nonlinear two-person zero-sum differential games. The idea is to discretize the time to transform a differential game into a sequential game with several steps, and by introducing state-value function, transform the sequential game into a recursion consisting of several normal-form games, finally, each normal-form game is solved with action abstraction and regret matching. To improve the real-time property of the proposed method, the state-value function can be kept in memory. This method can deal with the situations that the saddle point exists or does not exist, and the analysises of the existence of the saddle point can be avoided. If the saddle point does not exist, the mixed optimal control pair can be obtained. At the end of this paper, some examples are taken to illustrate the validity of the proposed method. △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: 18 pages, 9 figures

MSC Class: 91-08; 93-08

arXiv:2010.06236 [pdf, other]

Average Cost Optimal Control of Stochastic Systems Using Reinforcement Learning

Authors: **g Lai, Junlin Xiong

Abstract: This paper addresses the average cost minimization problem for discrete-time systems with multiplicative and additive noises via reinforcement learning. By using Q-function, we propose an online learning scheme to estimate the kernel matrix of Q-function and to update the control gain using the data along the system trajectories. The obtained control gain and kernel matrix are proved to converge t… ▽ More This paper addresses the average cost minimization problem for discrete-time systems with multiplicative and additive noises via reinforcement learning. By using Q-function, we propose an online learning scheme to estimate the kernel matrix of Q-function and to update the control gain using the data along the system trajectories. The obtained control gain and kernel matrix are proved to converge to the optimal ones. To implement the proposed learning scheme, an online model-free reinforcement learning algorithm is given, where recursive least squares method is used to estimate the kernel matrix of Q-function. A numerical example is presented to illustrate the proposed approach. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: 6 pages, 2 figures

arXiv:2008.08734 [pdf, ps, other]

Model-free optimal control of discrete-time systems with additive and multiplicative noises

Authors: **g Lai, Junlin Xiong, Zhan Shu

Abstract: This paper investigates the optimal control problem for a class of discrete-time stochastic systems subject to additive and multiplicative noises. A stochastic Lyapunov equation and a stochastic algebra Riccati equation are established for the existence of the optimal admissible control policy. A model-free reinforcement learning algorithm is proposed to learn the optimal admissible control policy… ▽ More This paper investigates the optimal control problem for a class of discrete-time stochastic systems subject to additive and multiplicative noises. A stochastic Lyapunov equation and a stochastic algebra Riccati equation are established for the existence of the optimal admissible control policy. A model-free reinforcement learning algorithm is proposed to learn the optimal admissible control policy using the data of the system states and inputs without requiring any knowledge of the system matrices. It is proven that the learning algorithm converges to the optimal admissible control policy. The implementation of the model-free algorithm is based on batch least squares and numerical average. The proposed algorithm is illustrated through a numerical example, which shows our algorithm outperforms other policy iteration algorithms. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 8 pages, 3 figures

Showing 1–28 of 28 results for author: Lai, J