Search | arXiv e-print repository

BAISeg: Boundary Assisted Weakly Supervised Instance Segmentation

Abstract: How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering al… ▽ More How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at https://github.com/wsis-seg/BAISeg. △ Less

Submitted 27 May, 2024; originally announced June 2024.

arXiv:2406.04840 [pdf, other]

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these problems, we propose TraceableSpeech, a novel TTS model that directly generates watermarked speech, improving watermark imperceptibility and speech quality. Furthermore, We design the frame-wise imprinting and extraction of watermarks, achieving higher robustness against resplicing attacks and temporal flexibility in operation. Experimental results show that TraceableSpeech outperforms the strong baseline where VALL-E or HiFicodec individually uses WavMark in watermark imperceptibility, speech quality and resilience against resplicing attacks. It also can apply to speech of various durations. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: acceped by interspeech 2024

arXiv:2405.18435 [pdf, other]

QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Authors: Hongwei Bran Li, Fernando Navarro, Ivan Ezhov, Amirhossein Bayat, Dhritiman Das, Florian Kofler, Suprosanna Shit, Diana Waldmannstetter, Johannes C. Paetzold, Xiaobin Hu, Benedikt Wiestler, Lucas Zimmer, Tamaz Amiranashvili, Chinmay Prabhakar, Christoph Berger, Jonas Weidner, Michelle Alonso-Basant, Arif Rashid, Ujjwal Baid, Wesam Adel, Deniz Ali, Bhakti Baheti, Yingbin Bai, Ishaan Bhatt, Sabri Can Cetindag , et al. (55 additional authors not shown)

Abstract: Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de… ▽ More Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks. △ Less

Submitted 24 June, 2024; v1 submitted 19 March, 2024; originally announced May 2024.

Comments: initial technical report

arXiv:2405.01170 [pdf, other]

doi 10.1109/TCSVT.2024.3395481

GroupedMixer: An Entropy Model with Group-wise Token-Mixers for Learned Image Compression

Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu, Wen Gao

Abstract: Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we p… ▽ More Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE TCSVT

arXiv:2404.11275 [pdf, other]

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Authors: Ye Bai, Chenxing Li, Hao Li, Yuanyuan Zhao, Xiaorui Wang

Abstract: In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech… ▽ More In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by ICME 2024

arXiv:2404.06393 [pdf, other]

MuPT: A Generative Symbolic Music Pretrained Transformer

Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions. △ Less

Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.17392 [pdf, other]

Natural-artificial hybrid swarm: Cyborg-insect group navigation in unknown obstructed soft terrain

Authors: Yang Bai, Phuoc Thanh Tran Ngoc, Huu Duoc Nguyen, Duc Long Le, Quang Huy Ha, Kazuki Kai, Yu Xiang See To, Yaosheng Deng, Jie Song, Naoki Wakamiya, Hirotaka Sato, Masaki Ogura

Abstract: Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable roboti… ▽ More Navigating multi-robot systems in complex terrains has always been a challenging task. This is due to the inherent limitations of traditional robots in collision avoidance, adaptation to unknown environments, and sustained energy efficiency. In order to overcome these limitations, this research proposes a solution by integrating living insects with miniature electronic controllers to enable robotic-like programmable control, and proposing a novel control algorithm for swarming. Although these creatures, called cyborg insects, have the ability to instinctively avoid collisions with neighbors and obstacles while adapting to complex terrains, there is a lack of literature on the control of multi-cyborg systems. This research gap is due to the difficulty in coordinating the movements of a cyborg system under the presence of insects' inherent individual variability in their reactions to control input. In response to this issue, we propose a novel swarm navigation algorithm addressing these challenges. The effectiveness of the algorithm is demonstrated through an experimental validation in which a cyborg swarm was successfully navigated through an unknown sandy field with obstacles and hills. This research contributes to the domain of swarm robotics and showcases the potential of integrating biological organisms with robotics and control theory to create more intelligent autonomous systems with real-world applications. △ Less

Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.10585 [pdf, other]

Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

Authors: Haoyue Tang, Tian Xie, Aosong Feng, Hanyu Wang, Chenyang Zhang, Yang Bai

Abstract: Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guid… ▽ More Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted and to Appear, AISTATS 2024

arXiv:2401.14007 [pdf, other]

Semantic Ensemble Loss and Latent Refinement for High-Fidelity Neural Image Compression

Authors: Daxin Li, Yuanchao Bai, Kai Wang, Junjun Jiang, Xianming Liu

Abstract: Recent advancements in neural compression have surpassed traditional codecs in PSNR and MS-SSIM measurements. However, at low bit-rates, these methods can introduce visually displeasing artifacts, such as blurring, color shifting, and texture loss, thereby compromising perceptual quality of images. To address these issues, this study presents an enhanced neural compression method designed for opti… ▽ More Recent advancements in neural compression have surpassed traditional codecs in PSNR and MS-SSIM measurements. However, at low bit-rates, these methods can introduce visually displeasing artifacts, such as blurring, color shifting, and texture loss, thereby compromising perceptual quality of images. To address these issues, this study presents an enhanced neural compression method designed for optimal visual fidelity. We have trained our model with a sophisticated semantic ensemble loss, integrating Charbonnier loss, perceptual loss, style loss, and a non-binary adversarial loss, to enhance the perceptual quality of image reconstructions. Additionally, we have implemented a latent refinement process to generate content-aware latent codes. These codes adhere to bit-rate constraints, balance the trade-off between distortion and fidelity, and prioritize bit allocation to regions of greater importance. Our empirical findings demonstrate that this approach significantly improves the statistical fidelity of neural image compression. On CLIC2024 validation set, our approach achieves a 62% bitrate saving compared to MS-ILLM under FID metric. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: 7 pages, 4 figures

arXiv:2310.14270 [pdf, other]

Diffusion-Based Adversarial Purification for Speaker Verification

Authors: Yibo Bai, Xiao-Lei Zhang

Abstract: Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purifica… ▽ More Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purification (DAP) method that enhances the robustness of ASV systems against such adversarial attacks. Our method leverages a conditional denoising diffusion probabilistic model to effectively purify the adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples, and then performs a reverse denoising process to reconstruct clean audio. Experimental results demonstrate the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile minimizing the distortion of the purified audio signals. △ Less

Submitted 24 October, 2023; v1 submitted 22 October, 2023; originally announced October 2023.

arXiv:2309.10740 [pdf, other]

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Authors: Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Abstract: Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by pro… ▽ More Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity. △ Less

Submitted 24 June, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.05908 [pdf, other]

Reset Controller Synthesis by Reach-avoid Analysis for Delay Hybrid Systems

Authors: Han Su, Jiyu Zhu, Shenghua Feng, Yunjun Bai, Bin Gu, Jiang Liu, Mengfei Yang, Naijun Zhan

Abstract: A reset controller plays a crucial role in designing hybrid systems. It restricts the initial set and redefines the reset map associated with discrete transitions, in order to guarantee the system to achieve its objective. Reset controller synthesis, together with feedback controller synthesis and switching logic controller synthesis, provides a correct-by-construction approach to designing hybrid… ▽ More A reset controller plays a crucial role in designing hybrid systems. It restricts the initial set and redefines the reset map associated with discrete transitions, in order to guarantee the system to achieve its objective. Reset controller synthesis, together with feedback controller synthesis and switching logic controller synthesis, provides a correct-by-construction approach to designing hybrid systems. However, time-delay is an inevitable factor in hybrid systems, which can degrade control performance and render verification certificates obtained by abstracting away time-delay invalid in practice. In this paper, we investigate this issue in a practical manner by taking time-delay into account. We propose an approach that reduces the synthesis of reset controllers to the generation of reach-avoid sets for the hybrid system under consideration, which can be efficiently solved using off-the-shell convex optimization solvers. △ Less

Submitted 27 May, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: 15 pages, 10 figures

arXiv:2309.05906 [pdf, other]

Correct-by-Construction for Hybrid Systems by Synthesizing Reset Controller

Authors: Jiang Liu, Han Su, Yunjun Bai, Bin Gu, Bai Xue, Mengfei Yang, Naijun Zhan

Abstract: Controller synthesis, including reset controller, feedback controller, and switching logic controller, provides an essential mechanism to guarantee the correctness and reliability of hybrid systems in a correct-by-construction manner. Unfortunately, reset controller synthesis is still in an infant stage in the literature, although it makes theoretical and practical significance. In this paper, we… ▽ More Controller synthesis, including reset controller, feedback controller, and switching logic controller, provides an essential mechanism to guarantee the correctness and reliability of hybrid systems in a correct-by-construction manner. Unfortunately, reset controller synthesis is still in an infant stage in the literature, although it makes theoretical and practical significance. In this paper, we propose a convex programming based method to synthesize reset controllers for polynomial hybrid systems subject to safety, possibly together with liveness. Such a problem essentially corresponds to computing an initial set of continuous states in each mode and a reset map associated with each discrete jump such that any trajectory starting from any computed initial state keeps safe if only safety constraints are given or reaches the target set eventually and keeps safe before that if both safety and liveness are given, through the computed reset maps. Both cases can be reduced to reach-avoid and/or differential invariant generation problems, further encoded as convex optimization problems. Finally, several examples are provided to demonstrate the efficiency and effectiveness of our method. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: 26 pages, 8 figures

arXiv:2307.15980 [pdf, other]

Initial State Interventions for Deconfounded Imitation Learning

Authors: Samuel Pfrommer, Yatong Bai, Hyunin Lee, Somayeh Sojoudi

Abstract: Imitation learning suffers from causal confusion. This phenomenon occurs when learned policies attend to features that do not causally influence the expert actions but are instead spuriously correlated. Causally confused agents produce low open-loop supervised loss but poor closed-loop performance upon deployment. We consider the problem of masking observed confounders in a disentangled representa… ▽ More Imitation learning suffers from causal confusion. This phenomenon occurs when learned policies attend to features that do not causally influence the expert actions but are instead spuriously correlated. Causally confused agents produce low open-loop supervised loss but poor closed-loop performance upon deployment. We consider the problem of masking observed confounders in a disentangled representation of the observation space. Our novel masking algorithm leverages the usual ability to intervene in the initial system state, avoiding any requirement involving expert querying, expert reward functions, or causal graph specification. Under certain assumptions, we theoretically prove that this algorithm is conservative in the sense that it does not incorrectly mask observations that causally influence the expert; furthermore, intervening on the initial state serves to strictly reduce excess conservatism. The masking algorithm is applied to behavior cloning for two illustrative control systems: CartPole and Reacher. △ Less

Submitted 11 August, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

Comments: 62nd IEEE Conference on Decision and Control

arXiv:2306.16710 [pdf]

doi 10.4230/OASIcs.SLATE.2023.7

Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications

Authors: Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

Abstract: Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to develo** a voicebot that can support children acquiring a f… ▽ More Voicebots have provided a new avenue for supporting the development of language skills, particularly within the context of second language learning. Voicebots, though, have largely been geared towards native adult speakers. We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to develo** a voicebot that can support children acquiring a foreign language. We evaluated their performance on read and extemporaneous speech of native and non-native Dutch children. We also investigated the utility of using ASR technology to provide insight into the children's pronunciation and fluency. The results show that recent, pre-trained ASR transformer-based models achieve acceptable performance from which detailed feedback on phoneme pronunciation quality can be extracted, despite the challenging nature of child and non-native speech. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: Published on SLATE 2023, Esmad, Politecnico Do Porto, Portugal, 26-28 June, 2023, pp: 11:1-11:8

Journal ref: 12th Symposium on Languages, Applications and Technologies (SLATE 2023) (p. 7:1-7:8)

arXiv:2306.12020 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095084

Visual-Aware Text-to-Speech

Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

Abstract: Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s… ▽ More Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: accepted as oral and top 3% paper by ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 1-5

arXiv:2306.04190 [pdf]

doi 10.1007/978-3-030-87802-3_6

An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders

Authors: Yu Bai, Cristian Tejedor-Garcia, Ferdy Hubers, Catia Cucchiarini, Helmer Strik

Abstract: The interest in employing automatic speech recognition (ASR) in applications for reading practice has been growing in recent years. In a previous study, we presented an ASR-based Dutch reading tutor application that was developed to provide instantaneous feedback to first-graders learning to read. We saw that ASR has potential at this stage of the reading process, as the results suggested that pup… ▽ More The interest in employing automatic speech recognition (ASR) in applications for reading practice has been growing in recent years. In a previous study, we presented an ASR-based Dutch reading tutor application that was developed to provide instantaneous feedback to first-graders learning to read. We saw that ASR has potential at this stage of the reading process, as the results suggested that pupils made progress in reading accuracy and fluency by using the software. In the current study, we used children's speech from an existing corpus (JASMIN) to develop two new ASR systems, and compared the results to those of the previous study. We analyze correct/incorrect classification of the ASR systems using human transcripts at word level, by means of evaluation measures such as Cohen's Kappa, Matthews Correlation Coefficient (MCC), precision, recall and F-measures. We observe improvements for the newly developed ASR systems regarding the agreement with human-based judgment and correct rejection (CR). The accuracy of the ASR systems varies for different reading tasks and word types. Our results suggest that, in the current configuration, it is difficult to classify isolated words. We discuss these results, possible ways to improve our systems and avenues for future research. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Published (double-blind peer-reviewed) on SPECOM 2021

Journal ref: In: Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science, vol 12997. Springer, Cham

arXiv:2306.02982 [pdf, other]

PolyVoice: Language Models for Speech to Speech Translation

Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yu** Wang, Mingxuan Wang, Yuxuan Wang

Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice. △ Less

Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.01232 [pdf, other]

Deep Reinforcement Learning Framework for Thoracic Diseases Classification via Prior Knowledge Guidance

Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray is often utilized for diagnosing common thoracic diseases. In recent years, many approaches have been proposed to handle the problem of automatic diagnosis based on chest X-rays. However, the scarcity of labeled data for related diseases still poses a huge challenge to an accurate diagnosis. In this paper, we focus on the thorax disease diagnostic problem and propose a novel deep r… ▽ More The chest X-ray is often utilized for diagnosing common thoracic diseases. In recent years, many approaches have been proposed to handle the problem of automatic diagnosis based on chest X-rays. However, the scarcity of labeled data for related diseases still poses a huge challenge to an accurate diagnosis. In this paper, we focus on the thorax disease diagnostic problem and propose a novel deep reinforcement learning framework, which introduces prior knowledge to direct the learning of diagnostic agents and the model parameters can also be continuously updated as the data increases, like a person's learning process. Especially, 1) prior knowledge can be learned from the pre-trained model based on old data or other domains' similar data, which can effectively reduce the dependence on target domain data, and 2) the framework of reinforcement learning can make the diagnostic agent as exploratory as a human being and improve the accuracy of diagnosis through continuous exploration. The method can also effectively solve the model learning problem in the case of few-shot data and improve the generalization ability of the model. Finally, our approach's performance was demonstrated using the well-known NIH ChestX-ray 14 and CheXpert datasets, and we achieved competitive results. The source code can be found here: \url{https://github.com/NeaseZ/MARL}. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.12072 [pdf, other]

Chest X-ray Image Classification: A Causal Perspective

Authors: Weizhi Nie, Chen Zhang, Dan Song, Lina Zhao, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray (CXR) is one of the most common and easy-to-get medical tests used to diagnose common diseases of the chest. Recently, many deep learning-based methods have been proposed that are capable of effectively classifying CXRs. Even though these techniques have worked quite well, it is difficult to establish whether what these algorithms actually learn is the cause-and-effect link between… ▽ More The chest X-ray (CXR) is one of the most common and easy-to-get medical tests used to diagnose common diseases of the chest. Recently, many deep learning-based methods have been proposed that are capable of effectively classifying CXRs. Even though these techniques have worked quite well, it is difficult to establish whether what these algorithms actually learn is the cause-and-effect link between diseases and their causes or just how to map labels to photos.In this paper, we propose a causal approach to address the CXR classification problem, which constructs a structural causal model (SCM) and uses the backdoor adjustment to select effective visual information for CXR classification. Specially, we design different probability optimization functions to eliminate the influence of confounders on the learning of real causality. Experimental results demonstrate that our proposed method outperforms the open-source NIH ChestX-ray14 in terms of classification performance. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.12070 [pdf, other]

Instrumental Variable Learning for Chest X-ray Classification

Authors: Weizhi Nie, Chen Zhang, Dan song, Yunpeng Bai, Keliang Xie, Anan Liu

Abstract: The chest X-ray (CXR) is commonly employed to diagnose thoracic illnesses, but the challenge of achieving accurate automatic diagnosis through this method persists due to the complex relationship between pathology. In recent years, various deep learning-based approaches have been suggested to tackle this problem but confounding factors such as image resolution or noise problems often damage model… ▽ More The chest X-ray (CXR) is commonly employed to diagnose thoracic illnesses, but the challenge of achieving accurate automatic diagnosis through this method persists due to the complex relationship between pathology. In recent years, various deep learning-based approaches have been suggested to tackle this problem but confounding factors such as image resolution or noise problems often damage model performance. In this paper, we focus on the chest X-ray classification task and proposed an interpretable instrumental variable (IV) learning framework, to eliminate the spurious association and obtain accurate causal representation. Specifically, we first construct a structural causal model (SCM) for our task and learn the confounders and the preliminary representations of IV, we then leverage electronic health record (EHR) as auxiliary information and we fuse the above feature with our transformer-based semantic fusion module, so the IV has the medical semantic. Meanwhile, the reliability of IV is further guaranteed via the constraints of mutual information between related causal variables. Finally, our approach's performance is demonstrated using the MIMIC-CXR, NIH ChestX-ray 14, and CheXpert datasets, and we achieve competitive results. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.07278 [pdf, ps, other]

Deep Learning for Asynchronous Massive Access with Data Frame Length Diversity

Authors: Yanna Bai, Wei Chen, Bo Ai, Petar Popovski

Abstract: Grant-free non-orthogonal multiple access has been regarded as a viable approach to accommodate access for a massive number of machine-type devices with small data packets. The sporadic activation of the devices creates a multiuser setup where it is suitable to use compressed sensing in order to detect the active devices and decode their data. We consider asynchronous access of machine-type device… ▽ More Grant-free non-orthogonal multiple access has been regarded as a viable approach to accommodate access for a massive number of machine-type devices with small data packets. The sporadic activation of the devices creates a multiuser setup where it is suitable to use compressed sensing in order to detect the active devices and decode their data. We consider asynchronous access of machine-type devices that send data packets of different frame sizes, leading to data length diversity. We address the composite problem of activity detection, channel estimation, and data recovery by posing it as a structured sparse recovery, having three-level sparsity caused by sporadic activity, symbol delay, and data length diversity. We approach the problem through approximate message passing with a backward propagation algorithm (AMP-BP), tailored to exploit the sparsity, and in particular the data length diversity. Moreover, we unfold the proposed AMP-BP into a model-driven network, termed learned AMP-BP (LAMP-BP), which enhances detection performance. The results show that the proposed LAMP-BP outperforms existing methods in activity detection and data recovery accuracy. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2303.08999 [pdf, other]

doi 10.1109/ICCAD51958.2021.9643472

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

Authors: Wenqian Zhao, Qi Sun, Yang Bai, Wenbo Li, Haisheng Zheng, Bei Yu, Martin D. F. Wong

Abstract: Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and acc… ▽ More Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly, and can achieve real-time performance. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2301.12048 [pdf, other]

Making Reconstruction-based Method Great Again for Video Anomaly Detection

Authors: Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, Yun Fu

Abstract: Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indist… ▽ More Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indistinguishable reconstruction errors of normal and abnormal frames during the inference phase. To address such issues, firstly, we get inspiration from transformer and propose ${\textbf S}$patio-${\textbf T}$emporal ${\textbf A}$uto-${\textbf T}$rans-${\textbf E}$ncoder, dubbed as $\textbf{STATE}$, as a new autoencoder model for enhanced consecutive frame reconstruction. Our STATE is equipped with a specifically designed learnable convolutional attention module for efficient temporal learning and reasoning. Secondly, we put forward a novel reconstruction-based input perturbation technique during testing to further differentiate anomalous frames. With the same perturbation magnitude, the testing reconstruction error of the normal frames lowers more than that of the abnormal frames, which contributes to mitigating the overfitting problem of reconstruction. Owing to the high relevance of the frame abnormality and the objects in the frame, we conduct object-level reconstruction using both the raw frame and the corresponding optical flow patches. Finally, the anomaly score is designed based on the combination of the raw and motion reconstruction errors using perturbed inputs. Extensive experiments on benchmark video anomaly detection datasets demonstrate that our approach outperforms previous reconstruction-based methods by a notable margin, and achieves state-of-the-art anomaly detection performance consistently. The code is available at https://github.com/wyzjack/MRMGA4VAD. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: Accepted by ICDM 2022

arXiv:2301.10314 [pdf, other]

WhisperWand: Simultaneous Voice and Gesture Tracking Interface

Authors: Yang Bai, Irtaza Shahid, Harshvardhan Takawale, Nirupam Roy

Abstract: This paper presents the design and implementation of WhisperWand, a comprehensive voice and motion tracking interface for voice assistants. Distinct from prior works, WhisperWand is a precise tracking interface that can co-exist with the voice interface on low sampling rate voice assistants. Taking handwriting as a specific application, it can also capture natural strokes and the individualized st… ▽ More This paper presents the design and implementation of WhisperWand, a comprehensive voice and motion tracking interface for voice assistants. Distinct from prior works, WhisperWand is a precise tracking interface that can co-exist with the voice interface on low sampling rate voice assistants. Taking handwriting as a specific application, it can also capture natural strokes and the individualized style of writing while occupying only a single frequency. The core technique includes an accurate acoustic ranging method called Cross Frequency Continuous Wave (CFCW) sonar, enabling voice assistants to use ultrasound as a ranging signal while using the regular microphone system of voice assistants as a receiver. We also design a new optimization algorithm that only requires a single frequency for time difference of arrival. WhisperWand prototype achieves 73 um of median error for 1D ranging and 1.4 mm of median error in 3D tracking of an acoustic beacon using the microphone array used in voice assistants. Our implementation of an in-air handwriting interface achieves 94.1% accuracy with automatic handwriting-to-text software, similar to writing on paper (96.6%). At the same time, the error rate of voice-based user authentication only increases from 6.26% to 8.28%. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2209.14539 [pdf, other]

Transmission Model for Resonant Beam SWIPT with Telescope Internal Modulator

Authors: Wen Fang, Yunfeng Bai, Qingwen Liu, Shengli Zhou

Abstract: To satisfy the long-range and energy self-sustaining communication needs of electronic devices in the Internet of Things (IoT), we introduce a simultaneous wireless information and power transfer (SWIPT) system using the resonant beam that incorporates a telescope modulator inside a cavity for suppressing diffraction losses. We theoretically analyze power transfer in the resonant beam system with… ▽ More To satisfy the long-range and energy self-sustaining communication needs of electronic devices in the Internet of Things (IoT), we introduce a simultaneous wireless information and power transfer (SWIPT) system using the resonant beam that incorporates a telescope modulator inside a cavity for suppressing diffraction losses. We theoretically analyze power transfer in the resonant beam system with telescope internal modulator (TIM-RBS) considering the electromagnetic field propagation, the end-to-end (E2E) power transfer, and power and information reception. The numerical evaluation demonstrates that the TIM can effectively compress the beam spot, which allows the TIM-RBS to transmit energy twice as far as the RBS without TIM at higher power. Additionally, the largest transmission distance and maximum output power are proportional to the input power, and about 34m transmission distance, 4W electric power, and 12bps/Hz spectral efficiency can be achieved in the TIM-RBS with 200W input power. Hence, TIM-RBS can be considered as a promising option for realizing long-range, high-power, and high-rate SWIPT. △ Less

Submitted 29 September, 2022; originally announced September 2022.

arXiv:2209.08326 [pdf, other]

Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition

Authors: Ye Bai, Jie Li, Wen**g Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang

Abstract: While transformers and their variant conformers show promising performance in speech recognition, the parameterized property leads to much memory cost during training and inference. Some works use cross-layer weight-sharing to reduce the parameters of the model. However, the inevitable loss of capacity harms the model performance. To address this issue, this paper proposes a parameter-efficient co… ▽ More While transformers and their variant conformers show promising performance in speech recognition, the parameterized property leads to much memory cost during training and inference. Some works use cross-layer weight-sharing to reduce the parameters of the model. However, the inevitable loss of capacity harms the model performance. To address this issue, this paper proposes a parameter-efficient conformer via sharing sparsely-gated experts. Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing computation. Then, the parameters of the grouped conformer blocks are shared so that the number of parameters is reduced. Next, to ensure the shared blocks with the flexibility of adapting representations at different levels, we design the MoE routers and normalization individually. Moreover, we use knowledge distillation to further improve the performance. Experimental results show that the proposed model achieves competitive performance with 1/3 of the parameters of the encoder, compared with the full-parameter model. △ Less

Submitted 17 September, 2022; originally announced September 2022.

Comments: accepted in INTERSPEECH 2022

arXiv:2209.05951 [pdf, ps, other]

Data-Driven Compressed Sensing for Massive Wireless Access

Authors: Yanna Bai, Wei Chen, Feifei Sun, Bo Ai, Petar Popovski

Abstract: The central challenge in massive machine-type communications (mMTC) is to connect a large number of uncoordinated devices through a limited spectrum. The typical mMTC communication pattern is sporadic, with short packets. This could be exploited in grant-free random access in which the activity detection, channel estimation, and data recovery are formulated as a sparse recovery problem and solved… ▽ More The central challenge in massive machine-type communications (mMTC) is to connect a large number of uncoordinated devices through a limited spectrum. The typical mMTC communication pattern is sporadic, with short packets. This could be exploited in grant-free random access in which the activity detection, channel estimation, and data recovery are formulated as a sparse recovery problem and solved via compressed sensing algorithms. This approach results in new challenges in terms of high computational complexity and latency. We present how data-driven methods can be applied in grant-free random access and demonstrate the performance gains. Variations of neural networks for the problem are discussed, as well as future challenges and potential directions. △ Less

Submitted 28 September, 2022; v1 submitted 13 September, 2022; originally announced September 2022.

Comments: in IEEE Communication Magazine vol:60, iss:11, 2022

arXiv:2209.04847 [pdf, other]

Deep Lossy Plus Residual Coding for Lossless and Near-lossless Image Compression

Authors: Yuanchao Bai, Xianming Liu, Kai Wang, Xiangyang Ji, Xiaolin Wu, Wen Gao

Abstract: Lossless and near-lossless image compression is of paramount importance to professional users in many technical fields, such as medicine, remote sensing, precision engineering and scientific research. But despite rapidly growing research interests in learning-based image compression, no published method offers both lossless and near-lossless modes. In this paper, we propose a unified and powerful… ▽ More Lossless and near-lossless image compression is of paramount importance to professional users in many technical fields, such as medicine, remote sensing, precision engineering and scientific research. But despite rapidly growing research interests in learning-based image compression, no published method offers both lossless and near-lossless modes. In this paper, we propose a unified and powerful deep lossy plus residual (DLPR) coding framework for both lossless and near-lossless image compression. In the lossless mode, the DLPR coding system first performs lossy compression and then lossless coding of residuals. We solve the joint lossy and residual compression problem in the approach of VAEs, and add autoregressive context modeling of the residuals to enhance lossless compression performance. In the near-lossless mode, we quantize the original residuals to satisfy a given $\ell_\infty$ error bound, and propose a scalable near-lossless compression scheme that works for variable $\ell_\infty$ bounds instead of training multiple networks. To expedite the DLPR coding, we increase the degree of algorithm parallelization by a novel design of coding context, and accelerate the entropy coding with adaptive residual interval. Experimental results demonstrate that the DLPR coding system achieves both the state-of-the-art lossless and near-lossless image compression performance with competitive coding speed. △ Less

Submitted 10 January, 2024; v1 submitted 11 September, 2022; originally announced September 2022.

Comments: manuscript accepted by TPAMI, source code:https://github.com/BYchao100/Deep-Lossy-Plus-Residual-Coding

arXiv:2204.08187 [pdf, other]

Securing Signal-free Intersections against Strategic Jamming Attacks: A Macroscopic Approach

Authors: Yumeng Bai, Saurabh Amin, Xudong Wang, Li **

Abstract: We consider the security-by-design of a signal-free intersection for connected and autonomous vehicles in the face of strategic jamming attacks. We use a fluid model to characterize macroscopic traffic flow through the intersection, where the saturation rate is derived from a vehicle coordination algorithm. We model jamming attacks as sudden increase in communication latency induced on vehicle-to-… ▽ More We consider the security-by-design of a signal-free intersection for connected and autonomous vehicles in the face of strategic jamming attacks. We use a fluid model to characterize macroscopic traffic flow through the intersection, where the saturation rate is derived from a vehicle coordination algorithm. We model jamming attacks as sudden increase in communication latency induced on vehicle-to-infrastructure connectivity; such latency triggers the safety mode for vehicle coordination and thus reduces the intersection saturation rate. A strategic attacker selects the attacking rate, while a system operator selects key design parameters, either the saturation rate or the recovery rate. Both players' actions induce technological costs and jointly determine the mean travel delay. By analyzing the equilibrium of the security game, we study the preferable level of investment in the intersection's nominal discharging capability or recovery capability. △ Less

Submitted 18 September, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

Comments: Accepted by 2022 IEEE Conference on Decision and Control(CDC)

arXiv:2204.03329 [pdf]

Information-driven Path Planning for Hybrid Aerial Underwater Vehicles

Authors: Zheng Zeng, Chengke Xiong, Xinyi Yuan, Yulin Bai, Yufei **, Di Lu, Lian Lian

Abstract: This paper presents a novel Rapidly-exploring Adaptive Sampling Tree (RAST) algorithm for the adaptive sampling mission of a hybrid aerial underwater vehicle (HAUV) in an air-sea 3D environment. This algorithm innovatively combines the tournament-based point selection sampling strategy, the information heuristic search process and the framework of Rapidly-exploring Random Tree (RRT) algorithm. Hen… ▽ More This paper presents a novel Rapidly-exploring Adaptive Sampling Tree (RAST) algorithm for the adaptive sampling mission of a hybrid aerial underwater vehicle (HAUV) in an air-sea 3D environment. This algorithm innovatively combines the tournament-based point selection sampling strategy, the information heuristic search process and the framework of Rapidly-exploring Random Tree (RRT) algorithm. Hence can guide the vehicle to the region of interest to scientists for sampling and generate a collision-free path for maximizing information collection by the HAUV under the constraints of environmental effects of currents or wind and limited budget. The simulation results show that the fast search adaptive sampling tree algorithm has higher optimization performance, faster solution speed and better stability than the Rapidly-exploring Information Gathering Tree (RIGT) algorithm and the particle swarm optimization (PSO) algorithm. △ Less

Submitted 8 April, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

arXiv:2203.02291 [pdf, other]

Freeform Body Motion Generation from Speech

Authors: **g Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei

Abstract: People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompos… ▽ More People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equip** a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through https://github.com/TheTempAccount/Co-Speech-Motion-Generation. △ Less

Submitted 4 March, 2022; originally announced March 2022.

arXiv:2202.08433 [pdf, ps, other]

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Abstract: Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam… ▽ More Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks. △ Less

Submitted 26 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2112.09300 [pdf, other]

Towards End-to-End Image Compression and Analysis with Transformers

Authors: Yuanchao Bai, Xu Yang, Xianming Liu, Junjun Jiang, Yaowei Wang, Xiangyang Ji, Wen Gao

Abstract: We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression w… ▽ More We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: Accepted by AAAI 2022; Code: https://github.com/BYchao100/Towards-Image-Compression-and-Analysis-with-Transformers

arXiv:2109.04960 [pdf, other]

Automatic Displacement and Vibration Measurement in Laboratory Experiments with A Deep Learning Method

Authors: Yongsheng Bai, Ramzi M. Abduallah, Halil Sezen, Alper Yilmaz

Abstract: This paper proposes a pipeline to automatically track and measure displacement and vibration of structural specimens during laboratory experiments. The latest Mask Regional Convolutional Neural Network (Mask R-CNN) can locate the targets and monitor their movement from videos recorded by a stationary camera. To improve precision and remove the noise, techniques such as Scale-invariant Feature Tran… ▽ More This paper proposes a pipeline to automatically track and measure displacement and vibration of structural specimens during laboratory experiments. The latest Mask Regional Convolutional Neural Network (Mask R-CNN) can locate the targets and monitor their movement from videos recorded by a stationary camera. To improve precision and remove the noise, techniques such as Scale-invariant Feature Transform (SIFT) and various filters for signal processing are included. Experiments on three small-scale reinforced concrete beams and a shaking table test are utilized to verify the proposed method. Results show that the proposed deep learning method can achieve the goal to automatically and precisely measure the motion of tested structural members during laboratory experiments. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Journal ref: IEEE Sensors 2021

arXiv:2108.00004 [pdf, ps, other]

Long-Range Optical Wireless Information and Power Transfer

Authors: Yunfeng Bai, Qingwen Liu, Riqing Chen, Qingqing Zhang, Wei Wang

Abstract: Simultaneous wireless information and power transfer (SWIPT) is a remarkable technology to support both the data and the energy transfer in the era of Internet of Things (IoT). In this paper, we proposed a long-range optical wireless information and power transfer system utilizing retro-reflectors, a gain medium, a telescope internal modulator to form the resonant beam, achieving high-power and hi… ▽ More Simultaneous wireless information and power transfer (SWIPT) is a remarkable technology to support both the data and the energy transfer in the era of Internet of Things (IoT). In this paper, we proposed a long-range optical wireless information and power transfer system utilizing retro-reflectors, a gain medium, a telescope internal modulator to form the resonant beam, achieving high-power and high-rate SWIPT. We adopt the transfer matrix, which can depict the beam modulated, resonator stability, transmission loss, and beam distribution. Then, we provide a model for energy harvesting and data receiving, which can evaluate the SWIPT performance. Numerical results illustrate that the proposed system can simultaneously supply 0$\sim$9 W electrical power and 18 bit/s/Hz spectral efficiency over 20 m distance. △ Less

Submitted 6 July, 2022; v1 submitted 29 July, 2021; originally announced August 2021.

arXiv:2107.14458 [pdf, ps, other]

High-Efficiency Resonant Beam Charging and Communication

Authors: Yunfeng Bai, Qingwen Liu, Xin Wang, Yudan Gou, Bin Zhou, Zhiyong Bu

Abstract: With the development of Internet of Things (IoT), demands of power and data for IoT devices increase drastically. In order to resolve the supply-demand contradiction, simultaneous wireless information and power transfer (SWIPT) has been envisioned as an enabling technology by providing high-power energy transfer and high-rate data delivering concurrently. In this paper, we introduce a high-efficie… ▽ More With the development of Internet of Things (IoT), demands of power and data for IoT devices increase drastically. In order to resolve the supply-demand contradiction, simultaneous wireless information and power transfer (SWIPT) has been envisioned as an enabling technology by providing high-power energy transfer and high-rate data delivering concurrently. In this paper, we introduce a high-efficiency resonant beam (RB) charging and communication scheme. The scheme utilizes the semiconductor materials as gain medium, which has a better energy absorption capacity compared with the traditional solid-state one. Moreover, the telescope internal modulator (TIM) are adopted in the scheme which can concentrate beams to match the gain size, reducing the transmission loss. To evaluate the scheme SWIPT performance, we establish an analytical model and study the influence factors of its beam transmission, energy conversion, output power, and spectral efficiency. Numerical results shows that the proposed RB system can realize 16 W electric power output with 11 % end-to-end conversion efficiency, and support 18 bit/s/Hz spectral efficiency for communication. △ Less

Submitted 4 January, 2024; v1 submitted 30 July, 2021; originally announced July 2021.

arXiv:2105.13174 [pdf, other]

Charging A Smartphone Over the Air: The Resonant Beam Charging Method

Authors: Qingwen Liu, Mingqing Xiong, Mingqing Liu, Qingwei Jiang, Wen Fang, Yunfeng Bai

Abstract: Wireless charging for mobile Internet of Things (IoT) devices such as smartphones is extremely difficult. To reduce energy dissipation during wireless transmission in mobile scenarios, laser or narrow radio beams with sophisticated tracking control are typically required. However, reaching the necessary tracking accuracy and reliability is really difficult. In this paper, inspired by the features… ▽ More Wireless charging for mobile Internet of Things (IoT) devices such as smartphones is extremely difficult. To reduce energy dissipation during wireless transmission in mobile scenarios, laser or narrow radio beams with sophisticated tracking control are typically required. However, reaching the necessary tracking accuracy and reliability is really difficult. In this paper, inspired by the features of optical resonators and retroreflectors, we develop an experiment on a self-aligned resonant beam charging system for long-distance mobile power transfer. It exploits light resonances inside a double-retroreflector-based spatially separated laser resonator (SSLR), which eliminates the requirement for any kind of tracking control. Focal telecentric cat's eye retroreflectors are employed here. The SSLR was investigated by both theoretical calculation and experiment. We also well assembled the transmitter and the receiver and demonstrated its application in mobile smartphone charging. The results show that above 5-W optical power (also obtained more than 0.6-W electrical power) transferring with negligible diffraction loss to a few-centimeter-size receiver is realized while the receiver moves arbitrarily within 2-m vertical distance and 6° field of view from the transmitter. The maximum horizontal moving range is up to 18cm. This wireless charging system empowers a smartphone in mobile operation with unlimited battery life without the need for a cable. △ Less

Submitted 12 January, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

arXiv:2104.07286 [pdf, other]

doi 10.21437/Interspeech.2021-794

Continual Learning for Fake Audio Detection

Authors: Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang

Abstract: Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and retraining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining tak… ▽ More Fake audio attack becomes a major threat to the speaker verification system. Although current detection approaches have achieved promising results on dataset-specific scenarios, they encounter difficulties on unseen spoofing data. Fine-tuning and retraining from scratch have been applied to incorporate new data. However, fine-tuning leads to performance degradation on previous data. Retraining takes a lot of time and computation resources. Besides, previous data are unavailable due to privacy in some situations. To solve the above problems, this paper proposes detecting fake without forgetting, a continual-learning-based method, to make the model learn new spoofing attacks incrementally. A knowledge distillation loss is introduced to loss function to preserve the memory of original model. Supposing the distribution of genuine voice is consistent among different scenarios, an extra embedding similarity loss is used as another constraint to further do a positive sample alignment. Experiments are conducted on the ASVspoof2019 dataset. The results show that our proposed method outperforms fine-tuning by the relative reduction of average equal error rate up to 81.62%. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: 5 pages, conference

Journal ref: Proc. Interspeech 2021, 886-890

arXiv:2104.03617 [pdf, other]

Half-Truth: A Partially Fake Audio Detection Dataset

Authors: Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu

Abstract: Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper… ▽ More Diverse promising datasets have been designed to hold back the development of fake audio detection, such as ASVspoof databases. However, previous datasets ignore an attacking situation, in which the hacker hides some small fake clips in real speech audio. This poses a serious threat since that it is difficult to distinguish the small fake clip from the whole speech utterance. Therefore, this paper develops such a dataset for half-truth audio detection (HAD). Partially fake audio in the HAD dataset involves only changing a few words in an utterance.The audio of the words is generated with the very latest state-of-the-art speech synthesis technology. We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset. Some benchmark results are presented on this dataset. The results show that partially fake audio presents much more challenging than fully fake audio for fake audio detection. The HAD dataset is publicly available: https://zenodo.org/records/10377492. △ Less

Submitted 15 December, 2023; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: accepted by Interspeech 2021

arXiv:2104.02882 [pdf, other]

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Authors: Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

Abstract: Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of comp… ▽ More Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence. Therefore, we propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model. During the inference, the transducer model can predict the blank tokens in advance by a simple CTC project layer without many complicated forward calculations of the transducer decoder and then skip them, which will reduce the computation and improve the inference speed greatly. All experiments are conducted on a public Chinese mandarin dataset AISHELL-1. The results show that the fast-skip regularization can indeed help the transducer model learn the blank position alignments. Besides, the inference with fast-skip can be speeded up nearly 4 times with only a little performance degradation. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: Submitted to INTERSPEECH2021

arXiv:2104.01522 [pdf, other]

doi 10.1109/LSP.2022.3152128

TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Authors: Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu

Abstract: The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire… ▽ More The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. However, the NAR model still faces two major problems. On the one hand, there is still a great gap in performance between the NAR models and the advanced AR models. On the other hand, it's difficult for most of the NAR models to train and converge. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model. Furthermore, we introduce the two-stage method into the inference process, which improves the model performance greatly. All the experiments are conducted on a public Chinese mandarin dataset ASIEHLL-1. The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models. △ Less

Submitted 3 April, 2021; originally announced April 2021.

Comments: Submitted to Interspeech2021

arXiv:2103.17015 [pdf, other]

Learning Scalable $\ell_\infty$-constrained Near-lossless Image Compression via Joint Lossy Image and Residual Compression

Authors: Yuanchao Bai, Xianming Liu, Wangmeng Zuo, Yaowei Wang, Xiangyang Ji

Abstract: We propose a novel joint lossy image and residual compression framework for learning $\ell_\infty$-constrained near-lossless image compression. Specifically, we obtain a lossy reconstruction of the raw image through lossy image compression and uniformly quantize the corresponding residual to satisfy a given tight $\ell_\infty$ error bound. Suppose that the error bound is zero, i.e., lossless image… ▽ More We propose a novel joint lossy image and residual compression framework for learning $\ell_\infty$-constrained near-lossless image compression. Specifically, we obtain a lossy reconstruction of the raw image through lossy image compression and uniformly quantize the corresponding residual to satisfy a given tight $\ell_\infty$ error bound. Suppose that the error bound is zero, i.e., lossless image compression, we formulate the joint optimization problem of compressing both the lossy image and the original residual in terms of variational auto-encoders and solve it with end-to-end training. To achieve scalable compression with the error bound larger than zero, we derive the probability model of the quantized residual by quantizing the learned probability model of the original residual, instead of training multiple networks. We further correct the bias of the derived probability model caused by the context mismatch between training and inference. Finally, the quantized residual is encoded according to the bias-corrected probability model and is concatenated with the bitstream of the compressed lossy image. Experimental results demonstrate that our near-lossless codec achieves the state-of-the-art performance for lossless and near-lossless image compression, and achieves competitive PSNR while much smaller $\ell_\infty$ error compared with lossy image codecs at high bit rates. △ Less

Submitted 31 March, 2021; originally announced March 2021.

Comments: Accepted by CVPR 2021; Code: https://github.com/BYchao100/Scalable-Near-lossless-Image-Compression

arXiv:2103.15858 [pdf, other]

CateNorm: Categorical Normalization for Robust Medical Image Segmentation

Authors: Junfei Xiao, Lequan Yu, Zongwei Zhou, Yutong Bai, Lei Xing, Alan Yuille, Yuyin Zhou

Abstract: Batch normalization (BN) uniformly shifts and scales the activations based on the statistics of a batch of images. However, the intensity distribution of the background pixels often dominates the BN statistics because the background accounts for a large proportion of the entire image. This paper focuses on enhancing BN with the intensity distribution of foreground pixels, the one that really matte… ▽ More Batch normalization (BN) uniformly shifts and scales the activations based on the statistics of a batch of images. However, the intensity distribution of the background pixels often dominates the BN statistics because the background accounts for a large proportion of the entire image. This paper focuses on enhancing BN with the intensity distribution of foreground pixels, the one that really matters for image segmentation. We propose a new normalization strategy, named categorical normalization (CateNorm), to normalize the activations according to categorical statistics. The categorical statistics are obtained by dynamically modulating specific regions in an image that belong to the foreground. CateNorm demonstrates both precise and robust segmentation results across five public datasets obtained from different domains, covering complex and variable data distributions. It is attributable to the ability of CateNorm to capture domain-invariant information from multiple domains (institutions) of medical data. Code is available at https://github.com/lambert-x/CateNorm. △ Less

Submitted 4 August, 2022; v1 submitted 29 March, 2021; originally announced March 2021.

Comments: Accepted by MICCAI 2022 Workshop on Domain Adaptation and Representation Transfer (DART)

arXiv:2103.11565 [pdf, other]

Switching Controller Synthesis for Delay Hybrid Systems under Perturbations

Authors: Yunjun Bai, Ting Gan, Li Jiao, Bican Xia, Bai Xue, Naijun Zhan

Abstract: Delays are ubiquitous in modern hybrid systems, which exhibit both continuous and discrete dynamical behaviors. Induced by signal transmission, conversion, the nature of plants, and so on, delays may appear either in the continuous evolution of a hybrid system such that the evolution depends not only on the present state but also on its execution history, or in the discrete switching between its d… ▽ More Delays are ubiquitous in modern hybrid systems, which exhibit both continuous and discrete dynamical behaviors. Induced by signal transmission, conversion, the nature of plants, and so on, delays may appear either in the continuous evolution of a hybrid system such that the evolution depends not only on the present state but also on its execution history, or in the discrete switching between its different control modes. In this paper we come up with a new model of hybrid systems, called \emph{delay hybrid automata}, to capture the dynamics of systems with the aforementioned two kinds of delays. Furthermore, based upon this model we study the robust switching controller synthesis problem such that the controlled delay system is able to satisfy the specified safety properties regardless of perturbations. To the end, a novel method is proposed to synthesize switching controllers based on the computation of differential invariants for continuous evolution and backward reachable sets of discrete jumps with delays. Finally, we implement a prototypical tool of our approach and demonstrate it on some case studies. △ Less

Submitted 21 March, 2021; originally announced March 2021.

arXiv:2102.07594 [pdf, other]

Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

Abstract: Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech u… ▽ More Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance. △ Less

Submitted 29 August, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 14 pages, 7 figures

arXiv:2012.10533 [pdf, other]

Atlas-ISTN: Joint Segmentation, Registration and Atlas Construction with Image-and-Spatial Transformer Networks

Authors: Matthew Sinclair, Andreas Schuh, Karl Hahn, Kersten Petersen, Ying Bai, James Batten, Michiel Schaap, Ben Glocker

Abstract: Deep learning models for semantic segmentation are able to learn powerful representations for pixel-wise predictions, but are sensitive to noise at test time and do not guarantee a plausible topology. Image registration models on the other hand are able to warp known topologies to target images as a means of segmentation, but typically require large amounts of training data, and have not widely be… ▽ More Deep learning models for semantic segmentation are able to learn powerful representations for pixel-wise predictions, but are sensitive to noise at test time and do not guarantee a plausible topology. Image registration models on the other hand are able to warp known topologies to target images as a means of segmentation, but typically require large amounts of training data, and have not widely been benchmarked against pixel-wise segmentation models. We propose Atlas-ISTN, a framework that jointly learns segmentation and registration on 2D and 3D image data, and constructs a population-derived atlas in the process. Atlas-ISTN learns to segment multiple structures of interest and to register the constructed, topologically consistent atlas labelmap to an intermediate pixel-wise segmentation. Additionally, Atlas-ISTN allows for test time refinement of the model's parameters to optimize the alignment of the atlas labelmap to an intermediate pixel-wise segmentation. This process both mitigates for noise in the target image that can result in spurious pixel-wise predictions, as well as improves upon the one-pass prediction of the model. Benefits of the Atlas-ISTN framework are demonstrated qualitatively and quantitatively on 2D synthetic data and 3D cardiac computed tomography and brain magnetic resonance image data, out-performing both segmentation and registration baseline models. Atlas-ISTN also provides inter-subject correspondence of the structures of interest, enabling population-level shape and motion analysis. △ Less

Submitted 18 December, 2020; originally announced December 2020.

Comments: 33 pages, 15 figures

arXiv:2011.03098 [pdf]

End-to-end Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales

Authors: Yongsheng Bai, Halil Sezen, Alper Yilmaz

Abstract: Robust Mask R-CNN (Mask Regional Convolu-tional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With dat… ▽ More Robust Mask R-CNN (Mask Regional Convolu-tional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low- or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications. △ Less

Submitted 5 November, 2020; originally announced November 2020.

arXiv:2010.14798 [pdf, other]

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

Authors: Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Zhengqi wen

Abstract: Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts:… ▽ More Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real-time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction. △ Less

Submitted 28 October, 2020; originally announced October 2020.

Comments: 5 pages, 1 figures

arXiv:2010.14791 [pdf, other]

One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition

Authors: Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

Abstract: The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder an… ▽ More The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model. During inference, the CTC decoder first generates many candidates in a streaming fashion. Then the transformer decoder selects the best candidate based on the corresponding acoustic encoded states. The second-stage transformer decoder can be regarded as a conditional language model. We assume that a large enough number and enough diversity of candidates generated in the first stage can compensate the CTC model for the lack of language modeling ability. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. The results show that our proposed model can implement streaming decoding in a fast and straightforward way. Our model can achieve up to a 20% reduction in the character error rate than the baseline CTC model. In addition, our model can also perform non-streaming inference with only a little performance degradation. △ Less

Submitted 3 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Showing 1–50 of 68 results for author: Bai, Y