-
A Plug-and-Play Untrained Neural Network for Full Waveform Inversion in Reconstructing Sound Speed Images of Ultrasound Computed Tomography
Authors:
Weicheng Yan,
Qiude Zhang,
Yun Wu,
Zhaohui Liu,
Liang Zhou,
Mingyue Ding,
Ming Yuchi,
Wu Qiu
Abstract:
Ultrasound computed tomography (USCT), as an emerging technology, can provide multiple quantitative parametric images of human tissue, such as sound speed and attenuation images, distinguishing it from conventional B-mode (reflection) ultrasound imaging. Full waveform inversion (FWI) is acknowledged as a technique with the greatest potential for reconstructing high-resolution sound speed images in…
▽ More
Ultrasound computed tomography (USCT), as an emerging technology, can provide multiple quantitative parametric images of human tissue, such as sound speed and attenuation images, distinguishing it from conventional B-mode (reflection) ultrasound imaging. Full waveform inversion (FWI) is acknowledged as a technique with the greatest potential for reconstructing high-resolution sound speed images in USCT. However, traditional FWI for sound speed image reconstruction suffers from high sensitivity to the initial model caused by its strong non-convex nonlinearity, resulting in poor performance when ultrasound signals are at high frequencies. This limitation significantly restricts the application of FWI in the USCT imaging field. In this paper, we propose an untrained neural network (UNN) that can be integrated into the traditional iteration-based FWI framework as an implicit regularization prior. This integration allows for seamless deployment as a plug-and-play module within existing FWI algorithms or their variants. Notably, the proposed UNN method can be trained in an unsupervised fashion, a vital aspect in medical imaging where ground truth data is often unavailable. Evaluations of the numerical simulation and phantom experiment of the breast demonstrate that the proposed UNN improves the robustness of image reconstruction, reduces image artifacts, and achieves great image contrast. To the best of our knowledge, this study represents the first attempt to propose an implicit UNN for FWI in reconstructing sound speed images for USCT.
△ Less
Submitted 13 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Authors:
Bing Han,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Lingwei Meng,
Yanming Qian,
Yanqing Liu,
Sheng Zhao,
**yu Li,
Furu Wei
Abstract:
With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h…
▽ More
With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Authors:
Sanyuan Chen,
Shujie Liu,
Long Zhou,
Yanqing Liu,
Xu Tan,
**yu Li,
Sheng Zhao,
Yao Qian,
Furu Wei
Abstract:
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in…
▽ More
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. See https://aka.ms/valle2 for demos of VALL-E 2.
△ Less
Submitted 17 June, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
Authors:
Chenyang Le,
Yao Qian,
Dongmei Wang,
Long Zhou,
Shujie Liu,
Xiaofei Wang,
Midia Yousefi,
Yanmin Qian,
**yu Li,
Sheng Zhao,
Michael Zeng
Abstract:
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex…
▽ More
There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Physics-informed Score-based Diffusion Model for Limited-angle Reconstruction of Cardiac Computed Tomography
Authors:
Shuo Han,
Yongshun Xu,
Dayang Wang,
Bahareh Morovati,
Li Zhou,
Jonathan S. Maltz,
Ge Wang,
Hengyong Yu
Abstract:
Cardiac computed tomography (CT) has emerged as a major imaging modality for the diagnosis and monitoring of cardiovascular diseases. High temporal resolution is essential to ensure diagnostic accuracy. Limited-angle data acquisition can reduce scan time and improve temporal resolution, but typically leads to severe image degradation and motivates for improved reconstruction techniques. In this pa…
▽ More
Cardiac computed tomography (CT) has emerged as a major imaging modality for the diagnosis and monitoring of cardiovascular diseases. High temporal resolution is essential to ensure diagnostic accuracy. Limited-angle data acquisition can reduce scan time and improve temporal resolution, but typically leads to severe image degradation and motivates for improved reconstruction techniques. In this paper, we propose a novel physics-informed score-based diffusion model (PSDM) for limited-angle reconstruction of cardiac CT. At the sampling time, we combine a data prior from a diffusion model and a model prior obtained via an iterative algorithm and Fourier fusion to further enhance the image quality. Specifically, our approach integrates the primal-dual hybrid gradient (PDHG) algorithm with score-based diffusion models, thereby enabling us to reconstruct high-quality cardiac CT images from limited-angle data. The numerical simulations and real data experiments confirm the effectiveness of our proposed approach.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
LighTDiff: Surgical Endoscopic Image Low-Light Enhancement with T-Diffusion
Authors:
Tong Chen,
Qingcheng Lyu,
Long Bai,
Erjian Guo,
Huxin Gao,
Xiaoxiao Yang,
Hongliang Ren,
Lu** Zhou
Abstract:
Advances in endoscopy use in surgeries face challenges like inadequate lighting. Deep learning, notably the Denoising Diffusion Probabilistic Model (DDPM), holds promise for low-light image enhancement in the medical field. However, DDPMs are computationally demanding and slow, limiting their practical medical applications. To bridge this gap, we propose a lightweight DDPM, dubbed LighTDiff. It ad…
▽ More
Advances in endoscopy use in surgeries face challenges like inadequate lighting. Deep learning, notably the Denoising Diffusion Probabilistic Model (DDPM), holds promise for low-light image enhancement in the medical field. However, DDPMs are computationally demanding and slow, limiting their practical medical applications. To bridge this gap, we propose a lightweight DDPM, dubbed LighTDiff. It adopts a T-shape model architecture to capture global structural information using low-resolution images and gradually recover the details in subsequent denoising steps. We further prone the model to significantly reduce the model size while retaining performance. While discarding certain downsampling operations to save parameters leads to instability and low efficiency in convergence during the training, we introduce a Temporal Light Unit (TLU), a plug-and-play module, for more stable training and better performance. TLU associates time steps with denoised image features, establishing temporal dependencies of the denoising steps and improving denoising outcomes. Moreover, while recovering images using the diffusion model, potential spectral shifts were noted. We further introduce a Chroma Balancer (CB) to mitigate this issue. Our LighTDiff outperforms many competitive LLIE methods with exceptional computational efficiency.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Underdetermined DOA Estimation of Off-Grid Sources Based on the Generalized Double Pareto Prior
Authors:
Yongfeng Huang,
Zhendong Chen,
Kun Ye,
Lang Zhou,
Haixin Sun
Abstract:
In this letter, we investigate a new generalized double Pareto based on off-grid sparse Bayesian learning (GDPOGSBL) approach to improve the performance of direction of arrival (DOA) estimation in underdetermined scenarios. The method aims to enhance the sparsity of source signal by utilizing the generalized double Pareto (GDP) prior. Firstly, we employ a first-order linear Taylor expansion to mod…
▽ More
In this letter, we investigate a new generalized double Pareto based on off-grid sparse Bayesian learning (GDPOGSBL) approach to improve the performance of direction of arrival (DOA) estimation in underdetermined scenarios. The method aims to enhance the sparsity of source signal by utilizing the generalized double Pareto (GDP) prior. Firstly, we employ a first-order linear Taylor expansion to model the real array manifold matrix, and Bayesian inference is utilized to calculate the off-grid error, which mitigates the grid dictionary mismatch problem in underdetermined scenarios. Secondly, an innovative grid refinement method is introduced, treating grid points as iterative parameters to minimize the modeling error between the source and grid points. The numerical simulation results verify the superiority of the proposed strategy, especially when dealing with a coarse grid and few snapshots.
△ Less
Submitted 17 May, 2024; v1 submitted 18 April, 2024;
originally announced May 2024.
-
Group-aware Parameter-efficient Updating for Content-Adaptive Neural Video Compression
Authors:
Zhenghao Chen,
Lu** Zhou,
Zhihao Hu,
Dong Xu
Abstract:
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few…
▽ More
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Exponentially Consistent Outlier Hypothesis Testing for Continuous Sequences
Authors:
Lina Zhu,
Lin Zhou
Abstract:
In outlier hypothesis testing, one aims to detect outlying sequences among a given set of sequences, where most sequences are generated i.i.d. from a nominal distribution while outlying sequences (outliers) are generated i.i.d. from a different anomalous distribution. Most existing studies focus on discrete-valued sequences, where each data sample takes values in a finite set. To account for pract…
▽ More
In outlier hypothesis testing, one aims to detect outlying sequences among a given set of sequences, where most sequences are generated i.i.d. from a nominal distribution while outlying sequences (outliers) are generated i.i.d. from a different anomalous distribution. Most existing studies focus on discrete-valued sequences, where each data sample takes values in a finite set. To account for practical scenarios where data sequences usually take real values, we study outlier hypothesis testing for continuous sequences when both the nominal and anomalous distributions are \emph{unknown}. Specifically, we propose distribution free tests and prove that the probabilities of misclassification error, false reject and false alarm decay exponentially fast for three different test designs: fixed-length test, sequential test, and two-phase test. In a fixed-length test, one fixes the sample size of each observed sequence; in a sequential test, one takes a sample sequentially from each sequence per unit time until a reliable decision can be made; in a two-phase test, one adapts the sample size from two different fixed values. Remarkably, the two-phase test achieves a good balance between test design complexity and theoretical performance. We first consider the case of at most one outlier, and then generalize our results to the case with multiple outliers where the number of outliers is unknown.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Domain-Transferred Synthetic Data Generation for Improving Monocular Depth Estimation
Authors:
Seungyeop Lee,
Knut Peterson,
Solmaz Arezoomandan,
Bill Cai,
Peihan Li,
Lifeng Zhou,
David Han
Abstract:
A major obstacle to the development of effective monocular depth estimation algorithms is the difficulty in obtaining high-quality depth data that corresponds to collected RGB images. Collecting this data is time-consuming and costly, and even data collected by modern sensors has limited range or resolution, and is subject to inconsistencies and noise. To combat this, we propose a method of data g…
▽ More
A major obstacle to the development of effective monocular depth estimation algorithms is the difficulty in obtaining high-quality depth data that corresponds to collected RGB images. Collecting this data is time-consuming and costly, and even data collected by modern sensors has limited range or resolution, and is subject to inconsistencies and noise. To combat this, we propose a method of data generation in simulation using 3D synthetic environments and CycleGAN domain transfer. We compare this method of data generation to the popular NYUDepth V2 dataset by training a depth estimation model based on the DenseDepth structure using different training sets of real and simulated data. We evaluate the performance of the models on newly collected images and LiDAR depth data from a Husky robot to verify the generalizability of the approach and show that GAN-transformed data can serve as an effective alternative to real-world data, particularly in depth estimation.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Distributed Federated Learning-Based Deep Learning Model for Privacy MRI Brain Tumor Detection
Authors:
Lisang Zhou,
Meng Wang,
Ning Zhou
Abstract:
Distributed training can facilitate the processing of large medical image datasets, and improve the accuracy and efficiency of disease diagnosis while protecting patient privacy, which is crucial for achieving efficient medical image analysis and accelerating medical research progress. This paper presents an innovative approach to medical image classification, leveraging Federated Learning (FL) to…
▽ More
Distributed training can facilitate the processing of large medical image datasets, and improve the accuracy and efficiency of disease diagnosis while protecting patient privacy, which is crucial for achieving efficient medical image analysis and accelerating medical research progress. This paper presents an innovative approach to medical image classification, leveraging Federated Learning (FL) to address the dual challenges of data privacy and efficient disease diagnosis. Traditional Centralized Machine Learning models, despite their widespread use in medical imaging for tasks such as disease diagnosis, raise significant privacy concerns due to the sensitive nature of patient data. As an alternative, FL emerges as a promising solution by allowing the training of a collective global model across local clients without centralizing the data, thus preserving privacy. Focusing on the application of FL in Magnetic Resonance Imaging (MRI) brain tumor detection, this study demonstrates the effectiveness of the Federated Learning framework coupled with EfficientNet-B0 and the FedAvg algorithm in enhancing both privacy and diagnostic accuracy. Through a meticulous selection of preprocessing methods, algorithms, and hyperparameters, and a comparative analysis of various Convolutional Neural Network (CNN) architectures, the research uncovers optimal strategies for image classification. The experimental results reveal that EfficientNet-B0 outperforms other models like ResNet in handling data heterogeneity and achieving higher accuracy and lower loss, highlighting the potential of FL in overcoming the limitations of traditional models. The study underscores the significance of addressing data heterogeneity and proposes further research directions for broadening the applicability of FL in medical image analysis.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Authors:
Leying Zhang,
Yao Qian,
Long Zhou,
Shujie Liu,
Dongmei Wang,
Xiaofei Wang,
Midia Yousefi,
Yanmin Qian,
**yu Li,
Lei He,
Sheng Zhao,
Michael Zeng
Abstract:
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou…
▽ More
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.
△ Less
Submitted 29 May, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
WavLLM: Towards Robust and Adaptive Speech Large Language Model
Authors:
Shujie Hu,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Hongkun Hao,
**g Pan,
Xunying Liu,
**yu Li,
Sunit Sivasankaran,
Linquan Liu,
Furu Wei
Abstract:
The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In th…
▽ More
The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavllm}.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Exploring Fairness for FAS-assisted Communication Systems: from NOMA to OMA
Authors:
Junteng Yao,
Liaoshi Zhou,
Tuo Wu,
Ming **,
Cunhua Pan,
Maged Elkashlan,
Kai-Kit Wong
Abstract:
This paper addresses the fairness issue within fluid antenna system (FAS)-assisted non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA) systems, where a single fixed-antenna base station (BS) transmits superposition-coded signals to two users, each with a single fluid antenna. We define fairness through the minimization of the maximum outage probability for the two users, und…
▽ More
This paper addresses the fairness issue within fluid antenna system (FAS)-assisted non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA) systems, where a single fixed-antenna base station (BS) transmits superposition-coded signals to two users, each with a single fluid antenna. We define fairness through the minimization of the maximum outage probability for the two users, under total resource constraints for both FAS-assisted NOMA and OMA systems. Specifically, in the FAS-assisted NOMA systems, we study both a special case and the general case, deriving a closed-form solution for the former and applying a bisection search method to find the optimal solution for the latter. Moreover, for the general case, we derive a locally optimal closed-form solution to achieve fairness. In the FAS-assisted OMA systems, to deal with the non-convex optimization problem with coupling of the variables in the objective function, we employ an approximation strategy to facilitate a successive convex approximation (SCA)-based algorithm, achieving locally optimal solutions for both cases. Empirical analysis validates that our proposed solutions outperform conventional NOMA and OMA benchmarks in terms of fairness.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
GarchingSim: An Autonomous Driving Simulator with Photorealistic Scenes and Minimalist Workflow
Authors:
Liguo Zhou,
Yinglei Song,
Yichao Gao,
Zhou Yu,
Michael Sodamin,
Hongshen Liu,
Liang Ma,
Lian Liu,
Hao Liu,
Yang Liu,
Haichuan Li,
Guang Chen,
Alois Knoll
Abstract:
Conducting real road testing for autonomous driving algorithms can be expensive and sometimes impractical, particularly for small startups and research institutes. Thus, simulation becomes an important method for evaluating these algorithms. However, the availability of free and open-source simulators is limited, and the installation and configuration process can be daunting for beginners and inte…
▽ More
Conducting real road testing for autonomous driving algorithms can be expensive and sometimes impractical, particularly for small startups and research institutes. Thus, simulation becomes an important method for evaluating these algorithms. However, the availability of free and open-source simulators is limited, and the installation and configuration process can be daunting for beginners and interdisciplinary researchers. We introduce an autonomous driving simulator with photorealistic scenes, meanwhile kee** a user-friendly workflow. The simulator is able to communicate with external algorithms through ROS2 or Socket.IO, making it compatible with existing software stacks. Furthermore, we implement a highly accurate vehicle dynamics model within the simulator to enhance the realism of the vehicle's physical effects. The simulator is able to serve various functions, including generating synthetic data and driving with machine learning-based algorithms. Moreover, we prioritize simplicity in the deployment process, ensuring that beginners find it approachable and user-friendly.
△ Less
Submitted 30 January, 2024; v1 submitted 28 January, 2024;
originally announced January 2024.
-
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Authors:
Hongkun Hao,
Long Zhou,
Shujie Liu,
**yu Li,
Shujie Hu,
Rui Wang,
Furu Wei
Abstract:
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities re…
▽ More
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction.
△ Less
Submitted 30 December, 2023;
originally announced January 2024.
-
From OTFS to DD-ISAC: Integrating Sensing and Communications in the Delay Doppler Domain
Authors:
Weijie Yuan,
Lin Zhou,
Saeid K. Dehkordi,
Shuangyang Li,
**zhi Fan,
Giuseppe Caire,
H. Vincent Poor
Abstract:
Next-generation vehicular networks are expected to provide the capability of robust environmental sensing in addition to reliable communications to meet intelligence requirements. A promising solution is the integrated sensing and communication (ISAC) technology, which performs both functionalities using the same spectrum and hardware resources. Most existing works on ISAC consider the Orthogonal…
▽ More
Next-generation vehicular networks are expected to provide the capability of robust environmental sensing in addition to reliable communications to meet intelligence requirements. A promising solution is the integrated sensing and communication (ISAC) technology, which performs both functionalities using the same spectrum and hardware resources. Most existing works on ISAC consider the Orthogonal Frequency Division Multiplexing (OFDM) waveform. Nevertheless, vehicle motion introduces Doppler shift, which breaks the subcarrier orthogonality and leads to performance degradation. The recently proposed Orthogonal Time Frequency Space (OTFS) modulation, which exploits various advantages of Delay Doppler (DD) channels, has been shown to support reliable communication in high-mobility scenarios. Moreover, the DD waveform can directly interact with radar sensing parameters, which are actually delay and Doppler shifts. This paper investigates the advantages of applying the DD communication waveform to ISAC. Specifically, we first provide a comprehensive overview of implementing DD communications, based on which several advantages of DD-ISAC over OFDM-based ISAC are revealed, including transceiver designs and the ambiguity function. Furthermore, a detailed performance comparison are presented, where the target detection probability and the mean squared error (MSE) performance are also studied. Finally, some challenges and opportunities of DD-ISAC are also provided.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
Meta-DSP: A Meta-Learning Approach for Data-Driven Nonlinear Compensation in High-Speed Optical Fiber Systems
Authors:
Xinyu Xiao,
Zhennan Zhou,
Bin Dong,
Dingjiong Ma,
Li Zhou,
Jie Sun
Abstract:
Non-linear effects in long-haul, high-speed optical fiber systems significantly hinder channel capacity. While the Digital Backward Propagation algorithm (DBP) with adaptive filter (ADF) can mitigate these effects, it suffers from an overwhelming computational complexity. Recent solutions have incorporated deep neural networks in a data-driven strategy to alleviate this complexity in the DBP model…
▽ More
Non-linear effects in long-haul, high-speed optical fiber systems significantly hinder channel capacity. While the Digital Backward Propagation algorithm (DBP) with adaptive filter (ADF) can mitigate these effects, it suffers from an overwhelming computational complexity. Recent solutions have incorporated deep neural networks in a data-driven strategy to alleviate this complexity in the DBP model. However, these models are often limited to a specific symbol rate and channel number, necessitating retraining for different settings, their performance declines significantly under high-speed and high-power conditions. We introduce Meta-DSP, a novel data-driven nonlinear compensation model based on meta-learning that processes multi-modal data across diverse transmission rates, power levels, and channel numbers. This not only enhances signal quality but also substantially reduces the complexity of the nonlinear processing algorithm. Our model delivers a 0.7 dB increase in the Q-factor over Electronic Dispersion Compensation (EDC), and compared to DBP, it curtails computational complexity by a factor of ten while retaining comparable performance. From the perspective of the entire signal processing system, the core idea of Meta-DSP can be employed in any segment of the overall communication system to enhance the model's scalability and generalization performance. Our research substantiates Meta-DSP's proficiency in addressing the critical parameters defining optical communication networks.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
Nonsmooth-Optimization-Based Bandwidth Optimal Control for Precision Motion Systems
Authors:
**gjie Wu,
Lei Zhou
Abstract:
Precision motion systems are at the core of various manufacturing equipment. The rapidly increasing demand for higher productivity necessitates higher control bandwidth in the motion systems to effectively reject disturbances while maintaining excellent positioning accuracy. However, most existing optimal control methods do not explicitly optimize for control bandwidth, and the classic loop-shapin…
▽ More
Precision motion systems are at the core of various manufacturing equipment. The rapidly increasing demand for higher productivity necessitates higher control bandwidth in the motion systems to effectively reject disturbances while maintaining excellent positioning accuracy. However, most existing optimal control methods do not explicitly optimize for control bandwidth, and the classic loop-sha** method suffers from conservative designs and fails to address cross-couplings, which motivates the development of new control solutions for bandwidth optimization. This paper proposes a novel bandwidth optimal control formulation based on nonsmooth optimization for precision motion systems. Our proposed method explicitly optimizes the system's MIMO control bandwidth while constraining the H-infinity norm of the closed-loop sensitivity function for robustness. A nonsmooth optimization solver, GRANSO, is used to solve the proposed program, and an augmented quadratic programming (QP)--based descent direction search is proposed to facilitate convergence. Simulation evaluations show that the bandwidth optimal control method can achieve a 23% higher control bandwidth than conventional loop-sha** design, and the QP-based descent direction search can reduce iteration number by 60%, which illustrates the effectiveness and efficiency of the proposed approach.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
Real-time Neonatal Chest Sound Separation using Deep Learning
Authors:
Yang Yi Poh,
Ethan Grooby,
Kenneth Tan,
Lindsay Zhou,
Arrabella King,
Ashwin Ramanathan,
Atul Malhotra,
Mehrtash Harandi,
Faezeh Marzbanrad
Abstract:
Auscultation for neonates is a simple and non-invasive method of providing diagnosis for cardiovascular and respiratory disease. Such diagnosis often requires high-quality heart and lung sounds to be captured during auscultation. However, in most cases, obtaining such high-quality sounds is non-trivial due to the chest sounds containing a mixture of heart, lung, and noise sounds. As such, addition…
▽ More
Auscultation for neonates is a simple and non-invasive method of providing diagnosis for cardiovascular and respiratory disease. Such diagnosis often requires high-quality heart and lung sounds to be captured during auscultation. However, in most cases, obtaining such high-quality sounds is non-trivial due to the chest sounds containing a mixture of heart, lung, and noise sounds. As such, additional preprocessing is needed to separate the chest sounds into heart and lung sounds. This paper proposes a novel deep-learning approach to separate such chest sounds into heart and lung sounds. Inspired by the Conv-TasNet model, the proposed model has an encoder, decoder, and mask generator. The encoder consists of a 1D convolution model and the decoder consists of a transposed 1D convolution. The mask generator is constructed using stacked 1D convolutions and transformers. The proposed model outperforms previous methods in terms of objective distortion measures by 2.01 dB to 5.06 dB in the artificial dataset, as well as computation time, with at least a 17-time improvement. Therefore, our proposed model could be a suitable preprocessing step for any phonocardiogram-based health monitoring system.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
LoMAE: Low-level Vision Masked Autoencoders for Low-dose CT Denoising
Authors:
Dayang Wang,
Yongshun Xu,
Shuo Han,
Zhan Wu,
Li Zhou,
Bahareh Morovati,
Hengyong Yu
Abstract:
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings.…
▽ More
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In the fields of computer vision and natural language processing, masked autoencoders (MAE) have been recognized as an effective label-free self-pretraining method for transformers, due to their exceptional feature representation ability. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experiments results show that the proposed LoMAE can enhance the transformer's denoising performance and greatly relieve the dependence on the ground truth clean data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Energy-Aware Routing Algorithm for Mobile Ground-to-Air Charging
Authors:
Bill Cai,
Fei Lu,
Lifeng Zhou
Abstract:
We investigate the problem of energy-constrained planning for a cooperative system of an Unmanned Ground Vehicles (UGV) and an Unmanned Aerial Vehicle (UAV). In scenarios where the UGV serves as a mobile base to ferry the UAV and as a charging station to recharge the UAV, we formulate a novel energy-constrained routing problem. To tackle this problem, we design an energy-aware routing algorithm, a…
▽ More
We investigate the problem of energy-constrained planning for a cooperative system of an Unmanned Ground Vehicles (UGV) and an Unmanned Aerial Vehicle (UAV). In scenarios where the UGV serves as a mobile base to ferry the UAV and as a charging station to recharge the UAV, we formulate a novel energy-constrained routing problem. To tackle this problem, we design an energy-aware routing algorithm, aiming to minimize the overall mission duration under the energy limitations of both vehicles. The algorithm first solves a Traveling Salesman Problem (TSP) to generate a guided tour. Then, it employs the Monte-Carlo Tree Search (MCTS) algorithm to refine the tour and generate paths for the two vehicles. We evaluate the performance of our algorithm through extensive simulations and a proof-of-concept experiment. The results show that our algorithm consistently achieves near-optimal mission time and maintains fast running time across a wide range of problem instances.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning
Authors:
Lillian Zhou,
Yuxin Ding,
Mingqing Chen,
Harry Zhang,
Rohit Prabhavalkar,
Dhruv Guliani,
Giovanni Motta,
Rajiv Mathews
Abstract:
Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continu…
▽ More
Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
△ Less
Submitted 30 November, 2023; v1 submitted 29 September, 2023;
originally announced October 2023.
-
Transcending the Acceleration-Bandwidth Trade-off: Lightweight Precision Stages with Active Control of Flexible Dynamics
Authors:
**gjie Wu,
Lei Zhou
Abstract:
Micro/Nano-positioning stages are of great importance in a wide range of manufacturing machines and instruments. In recent years, the drastically growing demand for higher throughput and reduced power consumption in various IC manufacturing equipment calls for the development of next-generation precision positioning systems with unprecedented acceleration capability while maintaining exceptional p…
▽ More
Micro/Nano-positioning stages are of great importance in a wide range of manufacturing machines and instruments. In recent years, the drastically growing demand for higher throughput and reduced power consumption in various IC manufacturing equipment calls for the development of next-generation precision positioning systems with unprecedented acceleration capability while maintaining exceptional positioning accuracy and high control bandwidth. Reducing the stage's weight is an effective approach to achieving this goal. However, the reduction of stages' weight tends to decrease its structural resonance frequency, which limits the closed-loop control bandwidth and can even cause stability issues. Aiming to overcome the aforementioned challenge and thus create new lightweight precision stages with substantially improved acceleration capability without sacrificing stage control performance, this research presents a novel sequential structure and control design framework for lightweight stages with low-frequency flexible modes of the stage being actively controlled. Additional actuators and sensors are placed to actively control the flexible structural dynamics of the lightweight stage to attain high control bandwidth. A case study is simulated to evaluate the effectiveness of the proposed approach, where a stage weight reduction of 24% is demonstrated compared to a baseline case, which demonstrates the potential of the proposed design framework. Experimental evaluation of the designed stage's motion performance will be performed on a magnetically levitated linear motor platform for performance demonstration.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction
Authors:
Leying Zhang,
Yao Qian,
Linfeng Yu,
Heming Wang,
Xinkai Wang,
Hemin Yang,
Long Zhou,
Shujie Liu,
Yanmin Qian,
Michael Zeng
Abstract:
Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptual…
▽ More
Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem).
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
FleXstage: Lightweight Magnetically Levitated Precision Stage with Over-Actuation towards High-Throughput IC Manufacturing
Authors:
**gjie Wu,
Lei Zhou
Abstract:
Precision motion stages play a critical role in various manufacturing and inspection equipment, for example, the wafer/reticle scanning in photolithography scanners and positioning stages in wafer inspection systems. To meet the growing demand for higher throughput in chip manufacturing and inspection, it is critical to create new precision motion stages with higher acceleration capability with hi…
▽ More
Precision motion stages play a critical role in various manufacturing and inspection equipment, for example, the wafer/reticle scanning in photolithography scanners and positioning stages in wafer inspection systems. To meet the growing demand for higher throughput in chip manufacturing and inspection, it is critical to create new precision motion stages with higher acceleration capability with high control bandwidth, which calls for the development of lightweight precision stages. However, in today's precision motion systems, only the rigid body motion of the system are under control, and the flexible dynamic systems are in open loop. For these systems, the motion control bandwidth is limited by the first structural resonance frequency of the stage, which enforces a fundamental trade-off between the stage's bandwidth and acceleration capability. Aiming to overcome this trade-off, we have introduced a sequential structure and control design framework for lightweight stages with the low-frequency flexible modes of the stage are under active control. To facilitate the controller design, we further propose to minimize the resonance frequency of the stage's mode being controlled and to maximize the resonance frequency of the uncontrolled mode. The system's control bandwidth is placed in between the resonance frequencies. This paper presents the design, optimization, building, and experimental evaluations for a lightweight magnetically levitated planar stage, which we call FleXstage, with first flexible mode actively controlled via over-actuation. Simulations show the proposed design is highly promising in enabling stages with lightweight without sacrificing control bandwidth. We have some preliminary results now and are still working on the experimental evaluations for the closed-loop system, and will present the results in the oral presentation.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
PINN-based viscosity solution of HJB equation
Authors:
Tianyu Liu,
Steven Ding,
Jiarui Zhang,
Liutao Zhou
Abstract:
This paper proposed a novel PINN-based viscosity solution for HJB equations. Although there exists work using PINN to solve HJB, but none of them gives the solution in viscosity sense. This paper reveals the fact that using the convex neural network, one can guarantee the viscosity solution and thus the neural network can easily converge to the true solution of HJB despite of the starting point.
This paper proposed a novel PINN-based viscosity solution for HJB equations. Although there exists work using PINN to solve HJB, but none of them gives the solution in viscosity sense. This paper reveals the fact that using the convex neural network, one can guarantee the viscosity solution and thus the neural network can easily converge to the true solution of HJB despite of the starting point.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction
Authors:
Zeyu Han,
Yuhan Wang,
Lu** Zhou,
Peng Wang,
Binyu Yan,
Jiliu Zhou,
Yan Wang,
Dinggang Shen
Abstract:
To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternat…
▽ More
To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
DiVa: An Iterative Framework to Harvest More Diverse and Valid Labels from User Comments for Music
Authors:
Hongru Liang,
**gyao Liu,
Yuanxin Xiang,
Jiachen Du,
Lanjun Zhou,
Shushen Pan,
Wenqiang Lei
Abstract:
Towards sufficient music searching, it is vital to form a complete set of labels for each song. However, current solutions fail to resolve it as they cannot produce diverse enough map**s to make up for the information missed by the gold labels. Based on the observation that such missing information may already be presented in user comments, we propose to study the automated music labeling in an…
▽ More
Towards sufficient music searching, it is vital to form a complete set of labels for each song. However, current solutions fail to resolve it as they cannot produce diverse enough map**s to make up for the information missed by the gold labels. Based on the observation that such missing information may already be presented in user comments, we propose to study the automated music labeling in an essential but under-explored setting, where the model is required to harvest more diverse and valid labels from the users' comments given limited gold labels. To this end, we design an iterative framework (DiVa) to harvest more $\underline{\text{Di}}$verse and $\underline{\text{Va}}$lid labels from user comments for music. The framework makes a classifier able to form complete sets of labels for songs via pseudo-labels inferred from pre-trained classifiers and a novel joint score function. The experiment on a densely annotated testing set reveals the superiority of the Diva over state-of-the-art solutions in producing more diverse labels missed by the gold labels. We hope our work can inspire future research on automated music labeling.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Emotion-Guided Music Accompaniment Generation Based on Variational Autoencoder
Authors:
Qi Wang,
Shubing Zhang,
Li Zhou
Abstract:
Music accompaniment generation is a crucial aspect in the composition process. Deep neural networks have made significant strides in this field, but it remains a challenge for AI to effectively incorporate human emotions to create beautiful accompaniments. Existing models struggle to effectively characterize human emotions within neural network models while composing music. To address this issue,…
▽ More
Music accompaniment generation is a crucial aspect in the composition process. Deep neural networks have made significant strides in this field, but it remains a challenge for AI to effectively incorporate human emotions to create beautiful accompaniments. Existing models struggle to effectively characterize human emotions within neural network models while composing music. To address this issue, we propose the use of an easy-to-represent emotion flow model, the Valence/Arousal Curve, which allows for the compatibility of emotional information within the model through data transformation and enhances interpretability of emotional factors by utilizing a Variational Autoencoder as the model structure. Further, we used relative self-attention to maintain the structure of the music at music phrase level and to generate a richer accompaniment when combined with the rules of music theory.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
On decoder-only architecture for speech-to-text and large language model integration
Authors:
Jian Wu,
Yashesh Gaur,
Zhuo Chen,
Long Zhou,
Yimeng Zhu,
Tianrui Wang,
**yu Li,
Shujie Liu,
Bo Ren,
Linquan Liu,
Yu Wu
Abstract:
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA,…
▽ More
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
△ Less
Submitted 2 October, 2023; v1 submitted 8 July, 2023;
originally announced July 2023.
-
PND-Net: Physics based Non-local Dual-domain Network for Metal Artifact Reduction
Authors:
**qiu Xia,
Yiwen Zhou,
Hailong Wang,
Wenxin Deng,
**g Kang,
Wangjiang Wu,
Mengke Qi,
Linghong Zhou,
Jianhui Ma,
Yuan Xu
Abstract:
Metal artifacts caused by the presence of metallic implants tremendously degrade the reconstructed computed tomography (CT) image quality, affecting clinical diagnosis or reducing the accuracy of organ delineation and dose calculation in radiotherapy. Recently, deep learning methods in sinogram and image domains have been rapidly applied on metal artifact reduction (MAR) task. The supervised dual-…
▽ More
Metal artifacts caused by the presence of metallic implants tremendously degrade the reconstructed computed tomography (CT) image quality, affecting clinical diagnosis or reducing the accuracy of organ delineation and dose calculation in radiotherapy. Recently, deep learning methods in sinogram and image domains have been rapidly applied on metal artifact reduction (MAR) task. The supervised dual-domain methods perform well on synthesized data, while unsupervised methods with unpaired data are more generalized on clinical data. However, most existing methods intend to restore the corrupted sinogram within metal trace, which essentially remove beam hardening artifacts but ignore other components of metal artifacts, such as scatter, non-linear partial volume effect and noise. In this paper, we mathematically derive a physical property of metal artifacts which is verified via Monte Carlo (MC) simulation and propose a novel physics based non-local dual-domain network (PND-Net) for MAR in CT imaging. Specifically, we design a novel non-local sinogram decomposition network (NSD-Net) to acquire the weighted artifact component, and an image restoration network (IR-Net) is proposed to reduce the residual and secondary artifacts in the image domain. To facilitate the generalization and robustness of our method on clinical CT images, we employ a trainable fusion network (F-Net) in the artifact synthesis path to achieve unpaired learning. Furthermore, we design an internal consistency loss to ensure the integrity of anatomical structures in the image domain, and introduce the linear interpolation sinogram as prior knowledge to guide sinogram decomposition. Extensive experiments on simulation and clinical data demonstrate that our method outperforms the state-of-the-art MAR methods.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Authors:
Tianrui Wang,
Long Zhou,
Ziqiang Zhang,
Yu Wu,
Shujie Liu,
Yashesh Gaur,
Zhuo Chen,
**yu Li,
Furu Wei
Abstract:
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a con…
▽ More
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Authors:
Chenyang Le,
Yao Qian,
Long Zhou,
Shujie Liu,
Yanmin Qian,
Michael Zeng,
Xuedong Huang
Abstract:
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate…
▽ More
Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.
△ Less
Submitted 14 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model
Authors:
Juexiao Zhou,
Xiaonan He,
Liyuan Sun,
Jiannan Xu,
Xiuying Chen,
Yuetan Chu,
Longxi Zhou,
Xingyu Liao,
Bin Zhang,
Xin Gao
Abstract:
Skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases, impacting a considerable portion of the population. Nonetheless, the field of dermatology diagnosis faces three significant hurdles. Firstly, there is a shortage of dermatologists accessible to diagnose patients, particularly in rural regions. Secondly, accurately interpreting skin di…
▽ More
Skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases, impacting a considerable portion of the population. Nonetheless, the field of dermatology diagnosis faces three significant hurdles. Firstly, there is a shortage of dermatologists accessible to diagnose patients, particularly in rural regions. Secondly, accurately interpreting skin disease images poses a considerable challenge. Lastly, generating patient-friendly diagnostic reports is usually a time-consuming and labor-intensive task for dermatologists. To tackle these challenges, we present SkinGPT-4, which is the world's first interactive dermatology diagnostic system powered by an advanced visual large language model. SkinGPT-4 leverages a fine-tuned version of MiniGPT-4, trained on an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors' notes. We designed a two-step training process to allow SkinGPT to express medical features in skin disease images with natural language and make accurate diagnoses of the types of skin diseases. With SkinGPT-4, users could upload their own skin photos for diagnosis, and the system could autonomously evaluate the images, identifies the characteristics and categories of the skin conditions, performs in-depth analysis, and provides interactive treatment recommendations. Meanwhile, SkinGPT-4's local deployment capability and commitment to user privacy also render it an appealing choice for patients in search of a dependable and precise diagnosis of their skin ailments. To demonstrate the robustness of SkinGPT-4, we conducted quantitative evaluations on 150 real-life cases, which were independently reviewed by certified dermatologists, and showed that SkinGPT-4 could provide accurate diagnoses of skin diseases.
△ Less
Submitted 8 June, 2023; v1 submitted 20 April, 2023;
originally announced April 2023.
-
AMPLE: An Adaptive Multiple Path Loss Exponent Radio Propagation Model Considering Environmental Factors
Authors:
Lingyou Zhou,
Jie Zhang,
Jiliang Zhang,
Oktay Cetinkaya,
Steve Jubb
Abstract:
We present AMPLE -- a novel multiple path loss exponent (PLE) radio propagation model that can adapt to different environmental factors. The proposed model aims at accurately predicting path loss with low computational complexity considering environmental factors. In the proposed model, the scenario under consideration is classified into regions from a raster map, and each type of region is assign…
▽ More
We present AMPLE -- a novel multiple path loss exponent (PLE) radio propagation model that can adapt to different environmental factors. The proposed model aims at accurately predicting path loss with low computational complexity considering environmental factors. In the proposed model, the scenario under consideration is classified into regions from a raster map, and each type of region is assigned with a PLE. The path loss is then computed based on a direct path between the transmitter (Tx) and receiver (Rx), which records the intersected regions and the weighted region path loss. To regress the model, the parameters, including PLEs, are extracted via measurement and the region map. We also verify the model in a suburban area. To the best of our knowledge, this is the first time that a multi-slope model precisely maps PLEs and region types. Besides, this model can be integrated into map systems by creating a new path loss attribute for digital maps.
△ Less
Submitted 20 August, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Authors:
Ziqiang Zhang,
Long Zhou,
Chengyi Wang,
Sanyuan Chen,
Yu Wu,
Shujie Liu,
Zhuo Chen,
Yanqing Liu,
Huaming Wang,
**yu Li,
Lei He,
Sheng Zhao,
Furu Wei
Abstract:
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilitie…
▽ More
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training
Authors:
Eric Sun,
**yu Li,
Yuxuan Hu,
Yimeng Zhu,
Long Zhou,
Jian Xue,
Peidong Wang,
Linquan Liu,
Shujie Liu,
Edward Lin,
Yifan Gong
Abstract:
We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we const…
▽ More
We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.
△ Less
Submitted 7 July, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Attention Mechanism for Contrastive Learning in GAN-based Image-to-Image Translation
Authors:
Hanzhen Zhang,
Liguo Zhou,
Ruining Wang,
Alois Knoll
Abstract:
Using real road testing to optimize autonomous driving algorithms is time-consuming and capital-intensive. To solve this problem, we propose a GAN-based model that is capable of generating high-quality images across different domains. We further leverage Contrastive Learning to train the model in a self-supervised way using image data acquired in the real world using real sensors and simulated ima…
▽ More
Using real road testing to optimize autonomous driving algorithms is time-consuming and capital-intensive. To solve this problem, we propose a GAN-based model that is capable of generating high-quality images across different domains. We further leverage Contrastive Learning to train the model in a self-supervised way using image data acquired in the real world using real sensors and simulated images from 3D games. In this paper, we also apply an Attention Mechanism module to emphasize features that contain more information about the source domain according to their measurement of significance. Finally, the generated images are used as datasets to train neural networks to perform a variety of downstream tasks to verify that the approach can fill in the gaps between the virtual and real worlds.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Bridging Synthetic and Real Images: a Transferable and Multiple Consistency aided Fundus Image Enhancement Framework
Authors:
Erjian Guo,
Huazhu Fu,
Lu** Zhou,
Dong Xu
Abstract:
Deep learning based image enhancement models have largely improved the readability of fundus images in order to decrease the uncertainty of clinical observations and the risk of misdiagnosis. However, due to the difficulty of acquiring paired real fundus images at different qualities, most existing methods have to adopt synthetic image pairs as training data. The domain shift between the synthetic…
▽ More
Deep learning based image enhancement models have largely improved the readability of fundus images in order to decrease the uncertainty of clinical observations and the risk of misdiagnosis. However, due to the difficulty of acquiring paired real fundus images at different qualities, most existing methods have to adopt synthetic image pairs as training data. The domain shift between the synthetic and the real images inevitably hinders the generalization of such models on clinical data. In this work, we propose an end-to-end optimized teacher-student framework to simultaneously conduct image enhancement and domain adaptation. The student network uses synthetic pairs for supervised enhancement, and regularizes the enhancement model to reduce domain-shift by enforcing teacher-student prediction consistency on the real fundus images without relying on enhanced ground-truth. Moreover, we also propose a novel multi-stage multi-attention guided enhancement network (MAGE-Net) as the backbones of our teacher and student network. Our MAGE-Net utilizes multi-stage enhancement module and retinal structure preservation module to progressively integrate the multi-scale features and simultaneously preserve the retinal structures for better fundus image quality enhancement. Comprehensive experiments on both real and synthetic datasets demonstrate that our framework outperforms the baseline approaches. Moreover, our method also benefits the downstream clinical tasks.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Personalized and privacy-preserving federated heterogeneous medical image analysis with PPPML-HMI
Authors:
Juexiao Zhou,
Longxi Zhou,
Di Wang,
Xiaopeng Xu,
Haoyang Li,
Yuetan Chu,
Wenkai Han,
Xin Gao
Abstract:
Heterogeneous data is endemic due to the use of diverse models and settings of devices by hospitals in the field of medical imaging. However, there are few open-source frameworks for federated heterogeneous medical image analysis with personalization and privacy protection simultaneously without the demand to modify the existing model structures or to share any private data. In this paper, we prop…
▽ More
Heterogeneous data is endemic due to the use of diverse models and settings of devices by hospitals in the field of medical imaging. However, there are few open-source frameworks for federated heterogeneous medical image analysis with personalization and privacy protection simultaneously without the demand to modify the existing model structures or to share any private data. In this paper, we proposed PPPML-HMI, an open-source learning paradigm for personalized and privacy-preserving federated heterogeneous medical image analysis. To our best knowledge, personalization and privacy protection were achieved simultaneously for the first time under the federated scenario by integrating the PerFedAvg algorithm and designing our novel cyclic secure aggregation with the homomorphic encryption algorithm. To show the utility of PPPML-HMI, we applied it to a simulated classification task namely the classification of healthy people and patients from the RAD-ChestCT Dataset, and one real-world segmentation task namely the segmentation of lung infections from COVID-19 CT scans. For the real-world task, PPPML-HMI achieved $\sim$5\% higher Dice score on average compared to conventional FL under the heterogeneous scenario. Meanwhile, we applied the improved deep leakage from gradients to simulate adversarial attacks and showed the solid privacy-preserving capability of PPPML-HMI. By applying PPPML-HMI to both tasks with different neural networks, a varied number of users, and sample sizes, we further demonstrated the strong robustness of PPPML-HMI.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Sequential Structure and Control Co-design of Lightweight Precision Stages with Active control of flexible modes
Authors:
**gjie Wu,
Lei Zhou
Abstract:
Precision motion stages are playing a prominent role in various manufacturing equipment. The drastically increasing demand for higher throughput in integrated circuit (IC) manufacturing and inspection calls for the next-generation precision stages that have light weight and high control bandwidth simultaneously. In today's design techniques, the stage's first flexible mode is limiting its achievab…
▽ More
Precision motion stages are playing a prominent role in various manufacturing equipment. The drastically increasing demand for higher throughput in integrated circuit (IC) manufacturing and inspection calls for the next-generation precision stages that have light weight and high control bandwidth simultaneously. In today's design techniques, the stage's first flexible mode is limiting its achievable control bandwidth, which enforces a trade-off between the stage's acceleration and closed-loop stiffness and thus limits the system's overall performance. To overcome this challenge, this paper proposes a new hardware design and control framework for lightweight precision motion stages with the stage's low-frequency flexible modes actively controlled. Our method proposes to minimize the resonance frequency of the controlled mode to reduce the stage's weight, and to maximize that of the uncontrolled mode to enable high control bandwidth. In addition, the proposed framework determines the placement of the actuators and sensors to maximize the controllability/observability of the stage's controlled flexible mode while minimizing that of the uncontrolled mode, which effectively simplifies the controller designs. Two case studies are used to evaluate the effectiveness of the proposed framework. Simulation results show that the stage designed using the proposed method has a weight reduction of more than 55% compared to a baseline stage design. Improvement in control bandwidth was also achieved. These results demonstrate the effectiveness of the proposed method in achieving lightweight precision positioning stages with high acceleration, bandwidth, and precision.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Authors:
Chengyi Wang,
Sanyuan Chen,
Yu Wu,
Ziqiang Zhang,
Long Zhou,
Shujie Liu,
Zhuo Chen,
Yanqing Liu,
Huaming Wang,
**yu Li,
Lei He,
Sheng Zhao,
Furu Wei
Abstract:
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training…
▽ More
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
△ Less
Submitted 5 January, 2023;
originally announced January 2023.
-
Provably High-Quality Solutions for the Liquid Medical Oxygen Allocation Problem
Authors:
Lejun Zhou,
Lavanya Marla,
Varun Gupta,
Ankur Mani
Abstract:
Oxygen is an essential life-saving medicine used in several indications at all levels of healthcare. During the COVID-19 pandemic, the demand for liquid medical oxygen (LMO) has increased significantly due to the occurrence of lung infections in many patients. However, many countries and regions are not prepared for the emergence of this phenomenon, and the limited supply of LMO has resulted in un…
▽ More
Oxygen is an essential life-saving medicine used in several indications at all levels of healthcare. During the COVID-19 pandemic, the demand for liquid medical oxygen (LMO) has increased significantly due to the occurrence of lung infections in many patients. However, many countries and regions are not prepared for the emergence of this phenomenon, and the limited supply of LMO has resulted in unsatisfied usage needs in many regions. In this paper, we formulated a linear programming model with the objective to minimize the unsatisfied demand given the constraints of supply and transportation capacity. The decision variables are how much LMO should be transferred from a place to another at each time interval using a specific number of vehicles. Multiple storage points are added into the network to allow for more flexible allocation strategies. The proposed model is implemented in India with real-world LMO supply and demand data as a case study. Compared to the manually designed allocation strategy, the proposed model successfully reduces the unsatisfied demand.
△ Less
Submitted 9 May, 2023; v1 submitted 11 December, 2022;
originally announced December 2022.
-
Protocol selection for second-order consensus against disturbance
Authors:
Jiamin Wang,
Liqi Zhou,
Dong Zhang,
Jian Liu,
Yuanshi Zheng
Abstract:
Noticing that both the absolute and relative velocity protocols can solve the second-order consensus of multi-agent systems, this paper aims to investigate which of the above two protocols has better anti-disturbance capability, in which the anti-disturbance capability is measured by the L2 gain from the disturbance to the consensus error. More specifically, by the orthogonal transformation techni…
▽ More
Noticing that both the absolute and relative velocity protocols can solve the second-order consensus of multi-agent systems, this paper aims to investigate which of the above two protocols has better anti-disturbance capability, in which the anti-disturbance capability is measured by the L2 gain from the disturbance to the consensus error. More specifically, by the orthogonal transformation technique, the analytic expression of the L2 gain of the second-order multi-agent system with absolute velocity protocol is firstly derived, followed by the counterpart with relative velocity protocol. It is shown that both the L2 gains for absolute and relative velocity protocols are determined only by the minimum non-zero eigenvalue of Laplacian matrix and the tunable gains of the state and velocity. Then, we establish the graph conditions to tell which protocol has better anti-disturbance capability. Moreover, we propose a two-step scheme to improve the anti-disturbance capability of second-order multi-agent systems. Finally, simulations are given to illustrate the effectiveness of our findings.
△ Less
Submitted 10 December, 2022;
originally announced December 2022.
-
A Four-stage Heuristic Algorithm for Solving On-demand Meal Delivery Routing Problem
Authors:
Lejun Zhou,
Anke Ye,
Simon Hu
Abstract:
Meal delivery services provided by platforms with integrated delivery systems are becoming increasingly popular. This paper adopts a rolling horizon approach to solve the meal delivery routing problem (MDRP). To improve delivery efficiency in scenarios with high delivery demand, multiple orders are allowed to be combined into one bundle with orders from different restaurants. Following this strate…
▽ More
Meal delivery services provided by platforms with integrated delivery systems are becoming increasingly popular. This paper adopts a rolling horizon approach to solve the meal delivery routing problem (MDRP). To improve delivery efficiency in scenarios with high delivery demand, multiple orders are allowed to be combined into one bundle with orders from different restaurants. Following this strategy, an optimization-based four-stage heuristic algorithm is developed to generate an optimal routing plan at each decision point. The algorithm first generates bundles according to orders' spatial and temporal distribution. Secondly, we find feasible bundle pairs. Then, routes for delivering any single bundle or multiple bundles are optimized, respectively. Finally, the routes are assigned to available couriers. In computational experiments using instances from open datasets, the system's performance is evaluated in respect of average click-to-door time and ready-to-pickup time. We demonstrate that this algorithm can effectively process real-time information and assign optimal routes to the couriers. By comparing the proposed method with existing the-state-of-the-art algorithms, the results indicate that our method can generate solutions with higher service quality and shorter distance.
△ Less
Submitted 9 May, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Authors:
Qiushi Zhu,
Long Zhou,
Ziqiang Zhang,
Shujie Liu,
Binxing Jiao,
Jie Zhang,
Lirong Dai,
Daxin Jiang,
**yu Li,
Furu Wei
Abstract:
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech rep…
▽ More
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.
△ Less
Submitted 19 May, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Spatially Exclusive Pasting: A General Data Augmentation for the Polyp Segmentation
Authors:
Lei Zhou
Abstract:
Automated polyp segmentation technology plays an important role in diagnosing intestinal diseases, such as tumors and precancerous lesions. Previous works have typically trained convolution-based U-Net or Transformer-based neural network architectures with labeled data. However, the available public polyp segmentation datasets are too small to train the network sufficiently, suppressing each netwo…
▽ More
Automated polyp segmentation technology plays an important role in diagnosing intestinal diseases, such as tumors and precancerous lesions. Previous works have typically trained convolution-based U-Net or Transformer-based neural network architectures with labeled data. However, the available public polyp segmentation datasets are too small to train the network sufficiently, suppressing each network's potential performance. To alleviate this issue, we propose a universal data augmentation technology to synthesize more data from the existing datasets. Specifically, we paste the polyp area into the same image's background in a spatial-exclusive manner to obtain a combinatorial number of new images. Extensive experiments on various networks and datasets show that the proposed method enhances the data efficiency and achieves consistent improvements over baselines. Finally, we hit a new state of the art in this task. We will release the code soon.
△ Less
Submitted 17 November, 2022; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Acoustic Pornography Recognition Using Convolutional Neural Networks and Bag of Refinements
Authors:
Lifeng Zhou,
Kaifeng Wei,
Yuke Li,
Yiya Hao,
Weiqiang Yang,
Haoqi Zhu
Abstract:
A large number of pornographic audios publicly available on the Internet seriously threaten the mental and physical health of children, but these audios are rarely detected and filtered. In this paper, we firstly propose a convolutional neural networks (CNN) based model for acoustic pornography recognition. Then, we research a collection of refinements and verify their effectiveness through ablati…
▽ More
A large number of pornographic audios publicly available on the Internet seriously threaten the mental and physical health of children, but these audios are rarely detected and filtered. In this paper, we firstly propose a convolutional neural networks (CNN) based model for acoustic pornography recognition. Then, we research a collection of refinements and verify their effectiveness through ablation studies. Finally, we stack all refinements together to verify whether they can further improve the accuracy of the model. Experimental results on our newly-collected large dataset consisting of 224127 pornographic audios and 274206 normal samples demonstrate the effectiveness of our proposed model and these refinements. Specifically, the proposed model achieves an accuracy of 92.46% and the accuracy is further improved to 97.19% when all refinements are combined.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
ESKNet-An enhanced adaptive selection kernel convolution for breast tumors segmentation
Authors:
Gong** Chen,
Lu Zhou,
Jianxun Zhang,
Xiaotao Yin,
Liang Cui,
Yu Dai
Abstract:
Breast cancer is one of the common cancers that endanger the health of women globally. Accurate target lesion segmentation is essential for early clinical intervention and postoperative follow-up. Recently, many convolutional neural networks (CNNs) have been proposed to segment breast tumors from ultrasound images. However, the complex ultrasound pattern and the variable tumor shape and size bring…
▽ More
Breast cancer is one of the common cancers that endanger the health of women globally. Accurate target lesion segmentation is essential for early clinical intervention and postoperative follow-up. Recently, many convolutional neural networks (CNNs) have been proposed to segment breast tumors from ultrasound images. However, the complex ultrasound pattern and the variable tumor shape and size bring challenges to the accurate segmentation of the breast lesion. Motivated by the selective kernel convolution, we introduce an enhanced selective kernel convolution for breast tumor segmentation, which integrates multiple feature map region representations and adaptively recalibrates the weights of these feature map regions from the channel and spatial dimensions. This region recalibration strategy enables the network to focus more on high-contributing region features and mitigate the perturbation of less useful regions. Finally, the enhanced selective kernel convolution is integrated into U-net with deep supervision constraints to adaptively capture the robust representation of breast tumors. Extensive experiments with twelve state-of-the-art deep learning segmentation methods on three public breast ultrasound datasets demonstrate that our method has a more competitive segmentation performance in breast ultrasound images.
△ Less
Submitted 20 January, 2024; v1 submitted 5 November, 2022;
originally announced November 2022.