-
A Foundation Model for Brain Lesion Segmentation with Mixture of Modality Experts
Authors:
Xinru Zhang,
Ni Ou,
Berke Doga Basaran,
Marco Visentin,
Mengyun Qiao,
Renyang Gu,
Cheng Ouyang,
Yaou Liu,
Paul M. Matthew,
Chuyang Ye,
Wenjia Bai
Abstract:
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation…
▽ More
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation model is developed for a particular lesion type and imaging modality. However, the use of task-specific models requires predetermination of the lesion type and imaging modality, which complicates their deployment in real-world scenarios. In this work, we propose a universal foundation model for 3D brain lesion segmentation, which can automatically segment different types of brain lesions for input data of various imaging modalities. We formulate a novel Mixture of Modality Experts (MoME) framework with multiple expert networks attending to different imaging modalities. A hierarchical gating network combines the expert predictions and fosters expertise collaboration. Furthermore, we introduce a curriculum learning strategy during training to avoid the degeneration of each expert network and preserve their specialization. We evaluated the proposed method on nine brain lesion datasets, encompassing five imaging modalities and eight lesion types. The results show that our model outperforms state-of-the-art universal models and provides promising generalization to unseen datasets.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Gull: A Generative Multifunctional Audio Codec
Authors:
Yi Luo,
Jianwei Yu,
Hangting Chen,
Rongzhi Gu,
Chao Weng
Abstract:
We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recen…
▽ More
We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.
△ Less
Submitted 7 June, 2024; v1 submitted 7 April, 2024;
originally announced April 2024.
-
AISPACE at SemEval-2024 task 8: A Class-balanced Soft-voting System for Detecting Multi-generator Machine-generated Text
Authors:
Renhua Gu,
Xiangfeng Meng
Abstract:
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPAC…
▽ More
SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPACE conducted a systematic study of fine-tuning transformer-based models, including encoderonly, decoder-only and encoder-decoder models. We compared their performance on this task and identified that encoder-only models performed exceptionally well. We also applied a weighted Cross Entropy loss function to address the issue of data imbalance of different class samples. Additionally, we employed softvoting strategy over multi-models ensemble to enhance the reliability of our predictions. Our system ranked top 1 in Subtask B, which sets a state-of-the-art benchmark for this new challenge.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Static vs. Dynamic Databases for Indoor Localization based on Wi-Fi Fingerprinting: A Discussion from a Data Perspective
Authors:
Zhe Tang,
Ruocheng Gu,
Sihao Li,
Kyeong Soo Kim,
Jeremy S. Smith
Abstract:
Wi-Fi fingerprinting has emerged as the most popular approach to indoor localization. The use of ML algorithms has greatly improved the localization performance of Wi-Fi fingerprinting, but its success depends on the availability of fingerprint databases composed of a large number of RSSIs, the MAC addresses of access points, and the other measurement information. However, most fingerprint databas…
▽ More
Wi-Fi fingerprinting has emerged as the most popular approach to indoor localization. The use of ML algorithms has greatly improved the localization performance of Wi-Fi fingerprinting, but its success depends on the availability of fingerprint databases composed of a large number of RSSIs, the MAC addresses of access points, and the other measurement information. However, most fingerprint databases do not reflect well the time varying nature of electromagnetic interferences in complicated modern indoor environment. This could result in significant changes in statistical characteristics of training/validation and testing datasets, which are often constructed at different times, and even the characteristics of the testing datasets could be different from those of the data submitted by users during the operation of localization systems after their deployment. In this paper, we consider the implications of time-varying Wi-Fi fingerprints on indoor localization from a data-centric point of view and discuss the differences between static and dynamic databases. As a case study, we have constructed a dynamic database covering three floors of the IR building of XJTLU based on RSSI measurements, over 44 days, and investigated the differences between static and dynamic databases in terms of statistical characteristics and localization performance. The analyses based on variance calculations and Isolation Forest show the temporal shifts in RSSIs, which result in a noticeable trend of the increase in the localization error of a Gaussian process regression model with the maximum error of 6.65 m after 14 days of training without model adjustments. The results of the case study with the XJTLU dynamic database clearly demonstrate the limitations of static databases and the importance of the creation and adoption of dynamic databases for future indoor localization research and real-world deployment.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
MLCommons Cloud Masking Benchmark with Early Stop**
Authors:
Varshitha Chennamsetti,
Gregor von Laszewski,
Ruochen Gu,
Laiba Mehnaz,
Juri Papay,
Samuel Jackson,
Jeyan Thiyagalingam,
Sergey V. Samsonau,
Geoffrey C. Fox
Abstract:
In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop.…
▽ More
In this paper, we report on work performed for the MLCommons Science Working Group on the cloud masking benchmark. MLCommons is a consortium that develops and maintains several scientific benchmarks that aim to benefit developments in AI. The benchmarks are conducted on the High Performance Computing (HPC) Clusters of New York University and University of Virginia, as well as a commodity desktop. We provide a description of the cloud masking benchmark, as well as a summary of our submission to MLCommons on the benchmark experiment we conducted. It includes a modification to the reference implementation of the cloud masking benchmark enabling early stop**. This benchmark is executed on the NYU HPC through a custom batch script that runs the various experiments through the batch queuing system while allowing for variation on the number of epochs trained. Our submission includes the modified code, a custom batch script to modify epochs, documentation, and the benchmark results. We report the highest accuracy (scientific metric) and the average time taken (performance metric) for training and inference that was achieved on NYU HPC Greene. We also provide a comparison of the compute capabilities between different systems by running the benchmark for one epoch. Our submission can be found in a Globus repository that is accessible to MLCommons Science Working Group.
△ Less
Submitted 30 May, 2024; v1 submitted 11 December, 2023;
originally announced January 2024.
-
SECap: Speech Emotion Captioning with Large Language Model
Authors:
Yaoxun Xu,
Hangting Chen,
Jianwei Yu,
Qiaochu Huang,
Zhiyong Wu,
Shixiong Zhang,
Guangzhi Li,
Yi Luo,
Rongzhi Gu
Abstract:
Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately…
▽ More
Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately represent speech emotions. On the contrary, describing speech emotions directly by means of natural language may be a more effective approach. Regrettably, there are not many studies available that have focused on this direction. Therefore, this paper proposes a speech emotion captioning framework named SECap, aiming at effectively describing speech emotions using natural language. Owing to the impressive capabilities of large language models in language comprehension and text generation, SECap employs LLaMA as the text decoder to allow the production of coherent speech emotion captions. In addition, SECap leverages HuBERT as the audio encoder to extract general speech features and Q-Former as the Bridge-Net to provide LLaMA with emotion-related speech features. To accomplish this, Q-Former utilizes mutual information learning to disentangle emotion-related speech features and speech contents, while implementing contrastive learning to extract more emotion-related speech features. The results of objective and subjective evaluations demonstrate that: 1) the SECap framework outperforms the HTSAT-BART baseline in all objective evaluations; 2) SECap can generate high-quality speech emotion captions that attain performance on par with human annotators in subjective mean opinion score tests.
△ Less
Submitted 23 December, 2023; v1 submitted 16 December, 2023;
originally announced December 2023.
-
An Overview of MLCommons Cloud Mask Benchmark: Related Research and Data
Authors:
Gregor von Laszewski,
Ruochen Gu
Abstract:
Cloud masking is a crucial task that is well-motivated for meteorology and its applications in environmental and atmospheric sciences. Its goal is, given satellite images, to accurately generate cloud masks that identify each pixel in image to contain either cloud or clear sky. In this paper, we summarize some of the ongoing research activities in cloud masking, with a focus on the research and be…
▽ More
Cloud masking is a crucial task that is well-motivated for meteorology and its applications in environmental and atmospheric sciences. Its goal is, given satellite images, to accurately generate cloud masks that identify each pixel in image to contain either cloud or clear sky. In this paper, we summarize some of the ongoing research activities in cloud masking, with a focus on the research and benchmark currently conducted in MLCommons Science Working Group. This overview is produced with the hope that others will have an easier time getting started and collaborate on the activities related to MLCommons Cloud Mask Benchmark.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction
Authors:
Ruiquan Ge,
Xiangyang Hu,
Rungen Huang,
Gangyong Jia,
Yaqi Wang,
Renshu Gu,
Changmiao Wang,
Elazab Ahmed,
Linyan Wang,
Juan Ye,
Ye Li
Abstract:
Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. H…
▽ More
Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. However, most existing approaches overlook the intra-modality latent information and the complex inter-modality correlations. Furthermore, existing modalities do not fully exploit the immense representational capabilities of neural networks for feature aggregation and disregard the importance of relationships between features. Therefore, it is highly recommended to address these issues in order to enhance the prediction performance by proposing a novel deep learning-based method. We propose a novel framework named Two-stream Transformer-based Multimodal Fusion Network for survival prediction (TTMFN), which integrates pathological images and gene expression data. In TTMFN, we present a two-stream multimodal co-attention transformer module to take full advantage of the complex relationships between different modalities and the potential connections within the modalities. Additionally, we develop a multi-head attention pooling approach to effectively aggregate the feature representations of the two modalities. The experiment results on four datasets from The Cancer Genome Atlas demonstrate that TTMFN can achieve the best performance or competitive results compared to the state-of-the-art methods in predicting the overall survival of patients.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
A Parallel Feature-preserving Mesh Variable Offsetting Method with Dynamic Programming
Authors:
Hongyi Cao,
Gang Xu,
Renshu Gu,
**lan Xu,
Xiaoyu Zhang,
Timon Rabczuk
Abstract:
Mesh offsetting plays an important role in discrete geometric processing. In this paper, we propose a parallel feature-preserving mesh offsetting framework with variable distance. Different from the traditional method based on distance and normal vector, a new calculation of offset position is proposed by using dynamic programming and quadratic programming, and the sharp feature can be preserved a…
▽ More
Mesh offsetting plays an important role in discrete geometric processing. In this paper, we propose a parallel feature-preserving mesh offsetting framework with variable distance. Different from the traditional method based on distance and normal vector, a new calculation of offset position is proposed by using dynamic programming and quadratic programming, and the sharp feature can be preserved after offsetting. Instead of distance implicit field, a spatial coverage region represented by polyhedral for computing offsets is proposed. Our method can generate an offsetting model with smaller mesh size, and also can achieve high quality without gaps, holes, and self-intersections. Moreover, several acceleration techniques are proposed for the efficient mesh offsetting, such as the parallel computing with grid, AABB tree and rays computing. In order to show the efficiency and robustness of the proposed framework, we have tested our method on the quadmesh dataset, which is available at [https://www.quadmesh.cloud]. The source code of the proposed algorithm is available on GitHub at [https://github.com/iGame-Lab/PFPOffset].
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Innovative Digital Storytelling with AIGC: Exploration and Discussion of Recent Advances
Authors:
Rongzhang Gu,
Hui Li,
Changyue Su,
Wayne Wu
Abstract:
Digital storytelling, as an art form, has struggled with cost-quality balance. The emergence of AI-generated Content (AIGC) is considered as a potential solution for efficient digital storytelling production. However, the specific form, effects, and impacts of this fusion remain unclear, leaving the boundaries of AIGC combined with storytelling undefined. This work explores the current integration…
▽ More
Digital storytelling, as an art form, has struggled with cost-quality balance. The emergence of AI-generated Content (AIGC) is considered as a potential solution for efficient digital storytelling production. However, the specific form, effects, and impacts of this fusion remain unclear, leaving the boundaries of AIGC combined with storytelling undefined. This work explores the current integration state of AIGC and digital storytelling, investigates the artistic value of their fusion in a sample project, and addresses common issues through interviews. Through our study, we conclude that AIGC, while proficient in image creation, voiceover production, and music composition, falls short of replacing humans due to the irreplaceable elements of human creativity and aesthetic sensibilities at present, especially in complex character animations, facial expressions, and sound effects. The research objective is to increase public awareness of the current state, limitations, and challenges arising from combining AIGC and digital storytelling.
△ Less
Submitted 28 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
UPL-SFDA: Uncertainty-aware Pseudo Label Guided Source-Free Domain Adaptation for Medical Image Segmentation
Authors:
Jianghao Wu,
Guotai Wang,
Ran Gu,
Tao Lu,
Yinan Chen,
Wentao Zhu,
Tom Vercauteren,
Sébastien Ourselin,
Shaoting Zhang
Abstract:
Domain Adaptation (DA) is important for deep learning-based medical image segmentation models to deal with testing images from a new target domain. As the source-domain data are usually unavailable when a trained model is deployed at a new center, Source-Free Domain Adaptation (SFDA) is appealing for data and annotation-efficient adaptation to the target domain. However, existing SFDA methods have…
▽ More
Domain Adaptation (DA) is important for deep learning-based medical image segmentation models to deal with testing images from a new target domain. As the source-domain data are usually unavailable when a trained model is deployed at a new center, Source-Free Domain Adaptation (SFDA) is appealing for data and annotation-efficient adaptation to the target domain. However, existing SFDA methods have a limited performance due to lack of sufficient supervision with source-domain images unavailable and target-domain images unlabeled. We propose a novel Uncertainty-aware Pseudo Label guided (UPL) SFDA method for medical image segmentation. Specifically, we propose Target Domain Growing (TDG) to enhance the diversity of predictions in the target domain by duplicating the pre-trained model's prediction head multiple times with perturbations. The different predictions in these duplicated heads are used to obtain pseudo labels for unlabeled target-domain images and their uncertainty to identify reliable pseudo labels. We also propose a Twice Forward pass Supervision (TFS) strategy that uses reliable pseudo labels obtained in one forward pass to supervise predictions in the next forward pass. The adaptation is further regularized by a mean prediction-based entropy minimization term that encourages confident and consistent results in different prediction heads. UPL-SFDA was validated with a multi-site heart MRI segmentation dataset, a cross-modality fetal brain segmentation dataset, and a 3D fetal tissue segmentation dataset. It improved the average Dice by 5.54, 5.01 and 6.89 percentage points for the three tasks compared with the baseline, respectively, and outperformed several state-of-the-art SFDA methods.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
ReZero: Region-customizable Sound Extraction
Authors:
Rongzhi Gu,
Yi Luo
Abstract:
We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise sound extraction (R-SE) task. R-SE task aims at extracting all active target sounds (e.g., human speech) within a specific, user-defined spatial region, which is different from conventional and existing tasks where a blind separation or a fixed, predefined spatial region a…
▽ More
We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise sound extraction (R-SE) task. R-SE task aims at extracting all active target sounds (e.g., human speech) within a specific, user-defined spatial region, which is different from conventional and existing tasks where a blind separation or a fixed, predefined spatial region are typically assumed. The spatial region can be defined as an angular window, a sphere, a cone, or other geometric patterns. Being a solution to the R-SE task, the proposed ReZero framework includes (1) definitions of different types of spatial regions, (2) methods for region feature extraction and aggregation, and (3) a multi-channel extension of the band-split RNN (BSRNN) model specified for the R-SE task. We design experiments for different microphone array geometries, different types of spatial regions, and comprehensive ablation studies on different system configurations. Experimental results on both simulated and real-recorded data demonstrate the effectiveness of ReZero. Demos are available at https://innerselfm.github.io/rezero/.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression
Authors:
Hangting Chen,
Jianwei Yu,
Yi Luo,
Rongzhi Gu,
Weihua Li,
Zhuocheng Lu,
Chao Weng
Abstract:
Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to r…
▽ More
Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling. We have found that under fixed compression ratios, dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet.
△ Less
Submitted 10 October, 2023; v1 submitted 21 August, 2023;
originally announced August 2023.
-
The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
Authors:
Stefan Uhlich,
Giorgio Fabbro,
Masato Hirano,
Shusuke Takahashi,
Gordon Wichern,
Jonathan Le Roux,
Dipam Chakraborty,
Sharada Mohanty,
Kai Li,
Yi Luo,
Jianwei Yu,
Rongzhi Gu,
Roman Solovyev,
Alexander Stempkovskiy,
Tatiana Habruseva,
Mikhail Sukhovei,
Yuki Mitsufuji
Abstract:
This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes…
▽ More
This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.
△ Less
Submitted 18 April, 2024; v1 submitted 14 August, 2023;
originally announced August 2023.
-
Fast Random Approximation of Multi-channel Room Impulse Response
Authors:
Yi Luo,
Rongzhi Gu
Abstract:
Modern neural-network-based speech processing systems are typically required to be robust against reverberation, and the training of such systems thus needs a large amount of reverberant data. During the training of the systems, on-the-fly simulation pipeline is nowadays preferred as it allows the model to train on infinite number of data samples without pre-generating and saving them on harddisk.…
▽ More
Modern neural-network-based speech processing systems are typically required to be robust against reverberation, and the training of such systems thus needs a large amount of reverberant data. During the training of the systems, on-the-fly simulation pipeline is nowadays preferred as it allows the model to train on infinite number of data samples without pre-generating and saving them on harddisk. An RIR simulation method thus needs to not only generate more realistic artificial room impulse response (RIR) filters, but also generate them in a fast way to accelerate the training process. Existing RIR simulation tools have proven effective in a wide range of speech processing tasks and neural network architectures, but their usage in on-the-fly simulation pipeline remains questionable due to their computational complexity or the quality of the generated RIR filters. In this paper, we propose FRAM-RIR, a fast random approximation method of the widely-used image-source method (ISM), to efficiently generate realistic multi-channel RIR filters. FRAM-RIR bypasses the explicit calculation of sound propagation paths in ISM-based algorithms by randomly sampling the location and number of reflections of each virtual sound source based on several heuristic assumptions, while still maintains accurate direction-of-arrival (DOA) information of all sound sources. Visualization of oracle beampatterns and directional features shows that FRAM-RIR can generate more realistic RIR filters than existing widely-used ISM-based tools, and experiment results on multi-channel noisy speech separation and dereverberation tasks with a wide range of neural network architectures show that models trained with FRAM-RIR can also achieve on par or better performance on real RIRs compared to other RIR simulation tools with a significantly accelerated training procedure. A Python implementation of FRAM-RIR is released.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
3D Neural Beamforming for Multi-channel Speech Separation Against Location Uncertainty
Authors:
Rongzhi Gu,
Shi-Xiong Zhang,
Dong Yu
Abstract:
Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper…
▽ More
Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper proposes 3D features and an associated 3D neural beamformer for multi-channel speech separation. Previous works in this area are extended in two important directions. First, the traditional 1D directional beam patterns are generalized to 3D. This enables the model to extract speech from any target region in the 3D space. Thus, speakers with similar directions but different elevations or distances become separable. Second, to handle the speaker location uncertainty, previously proposed spatial feature is extended to a new 3D region feature. The proposed 3D region feature and 3D neural beamformer are evaluated under an in-car scenario. Experimental results demonstrated that the combination of 3D feature and 3D beamformer can achieve comparable performance to the separation model with ground truth speaker location as input.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
Authors:
Rongzhi Gu,
Shi-Xiong Zhang,
Yuexian Zou,
Dong Yu
Abstract:
Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and…
▽ More
Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and frequency domain multichannel speech separation. The proposed model consists of two modules: separation and beamforming. Both modules perform temporal-spectral-spatial modeling and are trained from end-to-end using a joint loss function. The novelty of this study lies in two folds. Firstly, a time domain directional feature conditioned on the direction of the target speaker is proposed, which can be jointly optimized within the time domain architecture to enhance target signal estimation. Secondly, an all-neural beamforming network in time domain is designed to refine the pre-separated results. This module features with parametric time-variant beamforming coefficient estimation, without explicitly following the derivation of optimal filters that may lead to an upper bound. The proposed method is evaluated on simulated reverberant overlapped speech data derived from the AISHELL-1 corpus. Experimental results demonstrate significant performance improvements over frequency domain state-of-the-arts, ideal magnitude masks and existing time domain neural beamforming methods.
△ Less
Submitted 23 December, 2022; v1 submitted 16 December, 2022;
originally announced December 2022.
-
CDDSA: Contrastive Domain Disentanglement and Style Augmentation for Generalizable Medical Image Segmentation
Authors:
Ran Gu,
Guotai Wang,
Jiangshan Lu,
**gyang Zhang,
Wenhui Lei,
Yinan Chen,
Wenjun Liao,
Shichuan Zhang,
Kang Li,
Dimitris N. Metaxas,
Shaoting Zhang
Abstract:
Generalization to previously unseen images with potential domain shifts and different styles is essential for clinically applicable medical image segmentation, and the ability to disentangle domain-specific and domain-invariant features is key for achieving Domain Generalization (DG). However, existing DG methods can hardly achieve effective disentanglement to get high generalizability. To deal wi…
▽ More
Generalization to previously unseen images with potential domain shifts and different styles is essential for clinically applicable medical image segmentation, and the ability to disentangle domain-specific and domain-invariant features is key for achieving Domain Generalization (DG). However, existing DG methods can hardly achieve effective disentanglement to get high generalizability. To deal with this problem, we propose an efficient Contrastive Domain Disentanglement and Style Augmentation (CDDSA) framework for generalizable medical image segmentation. First, a disentangle network is proposed to decompose an image into a domain-invariant anatomical representation and a domain-specific style code, where the former is sent to a segmentation model that is not affected by the domain shift, and the disentangle network is regularized by a decoder that combines the anatomical and style codes to reconstruct the input image. Second, to achieve better disentanglement, a contrastive loss is proposed to encourage the style codes from the same domain and different domains to be compact and divergent, respectively. Thirdly, to further improve generalizability, we propose a style augmentation method based on the disentanglement representation to synthesize images in various unseen styles with shared anatomical structures. Our method was validated on a public multi-site fundus image dataset for optic cup and disc segmentation and an in-house multi-site Nasopharyngeal Carcinoma Magnetic Resonance Image (NPC-MRI) dataset for nasopharynx Gross Tumor Volume (GTVnx) segmentation. Experimental results showed that the proposed CDDSA achieved remarkable generalizability across different domains, and it outperformed several state-of-the-art methods in domain-generalizable segmentation.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters
Authors:
Junyi Peng,
Themos Stafylakis,
Rongzhi Gu,
Oldřich Plchot,
Ladislav Mošner,
Lukáš Burget,
Jan Černocký
Abstract:
Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh…
▽ More
Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a comprehensive analysis of applying parameter-efficient transfer learning (PETL) methods to reduce the required learnable parameters for adapting to speaker verification tasks. Specifically, during the fine-tuning process, the pre-trained models are frozen, and only lightweight modules inserted in each Transformer block are trainable (a method known as adapters). Moreover, to boost the performance in a cross-language low-resource scenario, the Transformer model is further tuned on a large intermediate dataset before directly fine-tuning it on a small dataset. With updating fewer than 4% of parameters, (our proposed) PETL-based methods achieve comparable performances with full fine-tuning methods (Vox1-O: 0.55%, Vox1-E: 0.82%, Vox1-H:1.73%).
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
PyMIC: A deep learning toolkit for annotation-efficient medical image segmentation
Authors:
Guotai Wang,
Xiangde Luo,
Ran Gu,
Shuojue Yang,
Yijie Qu,
Shuwei Zhai,
Qianfei Zhao,
Kang Li,
Shaoting Zhang
Abstract:
Background and Objective: Open-source deep learning toolkits are one of the driving forces for develo** medical image segmentation models. Existing toolkits mainly focus on fully supervised segmentation and require full and accurate pixel-level annotations that are time-consuming and difficult to acquire for segmentation tasks, which makes learning from imperfect labels highly desired for reduci…
▽ More
Background and Objective: Open-source deep learning toolkits are one of the driving forces for develo** medical image segmentation models. Existing toolkits mainly focus on fully supervised segmentation and require full and accurate pixel-level annotations that are time-consuming and difficult to acquire for segmentation tasks, which makes learning from imperfect labels highly desired for reducing the annotation cost. We aim to develop a new deep learning toolkit to support annotation-efficient learning for medical image segmentation.
Methods: Our proposed toolkit named PyMIC is a modular deep learning library for medical image segmentation tasks. In addition to basic components that support development of high-performance models for fully supervised segmentation, it contains several advanced components tailored for learning from imperfect annotations, such as loading annotated and unannounced images, loss functions for unannotated, partially or inaccurately annotated images, and training procedures for co-learning between multiple networks, etc. PyMIC supports development of semi-supervised, weakly supervised and noise-robust learning methods for medical image segmentation.
Results: We present several illustrative medical image segmentation tasks based on PyMIC: (1) Achieving competitive performance on fully supervised learning; (2) Semi-supervised cardiac structure segmentation with only 10% training images annotated; (3) Weakly supervised segmentation using scribble annotations; and (4) Learning from noisy labels for chest radiograph segmentation.
Conclusions: The PyMIC toolkit is easy to use and facilitates efficient development of medical image segmentation models with imperfect annotations. It is modular and flexible, which enables researchers to develop high-performance models with low annotation cost. The source code is available at: https://github.com/HiLab-git/PyMIC.
△ Less
Submitted 4 February, 2023; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Contrastive Semi-supervised Learning for Domain Adaptive Segmentation Across Similar Anatomical Structures
Authors:
Ran Gu,
**gyang Zhang,
Guotai Wang,
Wenhui Lei,
Tao Song,
Xiaofan Zhang,
Kang Li,
Shaoting Zhang
Abstract:
Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for medical image segmentation, yet need plenty of manual annotations for training. Semi-Supervised Learning (SSL) methods are promising to reduce the requirement of annotations, but their performance is still limited when the dataset size and the number of annotated images are small. Leveraging existing annotated data…
▽ More
Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for medical image segmentation, yet need plenty of manual annotations for training. Semi-Supervised Learning (SSL) methods are promising to reduce the requirement of annotations, but their performance is still limited when the dataset size and the number of annotated images are small. Leveraging existing annotated datasets with similar anatomical structures to assist training has a potential for improving the model's performance. However, it is further challenged by the cross-anatomy domain shift due to the different appearance and even imaging modalities from the target structure. To solve this problem, we propose Contrastive Semi-supervised learning for Cross Anatomy Domain Adaptation (CS-CADA) that adapts a model to segment similar structures in a target domain, which requires only limited annotations in the target domain by leveraging a set of existing annotated images of similar structures in a source domain. We use Domain-Specific Batch Normalization (DSBN) to individually normalize feature maps for the two anatomical domains, and propose a cross-domain contrastive learning strategy to encourage extracting domain invariant features. They are integrated into a Self-Ensembling Mean-Teacher (SE-MT) framework to exploit unlabeled target domain images with a prediction consistency constraint. Extensive experiments show that our CS-CADA is able to solve the challenging cross-anatomy domain shift problem, achieving accurate segmentation of coronary arteries in X-ray images with the help of retinal vessel images and cardiac MR images with the help of fundus images, respectively, given only a small number of annotations in the target domain.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
Learning towards Synchronous Network Memorizability and Generalizability for Continual Segmentation across Multiple Sites
Authors:
**gyang Zhang,
Peng Xue,
Ran Gu,
Yuning Gu,
Mianxin Liu,
Yongsheng Pan,
Zhiming Cui,
Jiawei Huang,
Lei Ma,
Dinggang Shen
Abstract:
In clinical practice, a segmentation network is often required to continually learn on a sequential data stream from multiple sites rather than a consolidated set, due to the storage cost and privacy restriction. However, during the continual learning process, existing methods are usually restricted in either network memorizability on previous sites or generalizability on unseen sites. This paper…
▽ More
In clinical practice, a segmentation network is often required to continually learn on a sequential data stream from multiple sites rather than a consolidated set, due to the storage cost and privacy restriction. However, during the continual learning process, existing methods are usually restricted in either network memorizability on previous sites or generalizability on unseen sites. This paper aims to tackle the challenging problem of Synchronous Memorizability and Generalizability (SMG) and to simultaneously improve performance on both previous and unseen sites, with a novel proposed SMG-learning framework. First, we propose a Synchronous Gradient Alignment (SGA) objective, which not only promotes the network memorizability by enforcing coordinated optimization for a small exemplar set from previous sites (called replay buffer), but also enhances the generalizability by facilitating site-invariance under simulated domain shift. Second, to simplify the optimization of SGA objective, we design a Dual-Meta algorithm that approximates the SGA objective as dual meta-objectives for optimization without expensive computation overhead. Third, for efficient rehearsal, we configure the replay buffer comprehensively considering additional inter-site diversity to reduce redundancy. Experiments on prostate MRI data sequentially acquired from six institutes demonstrate that our method can simultaneously achieve higher memorizability and generalizability over state-of-the-art methods. Code is available at https://github.com/**gyzhang/SMG-Learning.
△ Less
Submitted 27 June, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning
Authors:
Chengfei Lv,
Chaoyue Niu,
Renjie Gu,
Xiaotang Jiang,
Zhaode Wang,
Bin Liu,
Ziqi Wu,
Qiulin Yao,
Congyu Huang,
Panos Huang,
Tao Huang,
Hui Shu,
**de Song,
Bin Zou,
Peng Lan,
Guohuan Xu,
Fei Wu,
Shaojie Tang,
Fan Wu,
Guihai Chen
Abstract:
To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a c…
▽ More
To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment, while facilitating daily task iteration. Specifically, the compute container is based on Mobile Neural Network (MNN), a tensor compute engine along with the data processing and model execution libraries, which are exposed through a refined Python thread-level virtual machine (VM) to support diverse ML tasks and concurrent task execution. The core of MNN is the novel mechanisms of operator decomposition and semi-auto search, sharply reducing the workload in manually optimizing hundreds of operators for tens of hardware backends and further quickly identifying the best backend with runtime optimization for a computation graph. The data pipeline introduces an on-device stream processing framework to enable processing user behavior data at source. The deployment platform releases ML tasks with an efficient push-then-pull method and supports multi-granularity deployment policies. We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability. Extensive micro-benchmarks also highlight the superior performance of MNN and the Python thread-level VM. Walle has been in large-scale production use in Alibaba, while MNN has been open source with a broad impact in the community.
△ Less
Submitted 29 May, 2022;
originally announced May 2022.
-
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Authors:
Yuxing Chen,
Renshu Gu,
Ouhan Huang,
Gangyong Jia
Abstract:
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flat…
▽ More
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Exploring the stimulative effect on following drivers in a consecutive lane-change using microscopic vehicle trajectory data
Authors:
Ruifeng Gu
Abstract:
Improper lane-changing behaviors may result in breakdown of traffic flow and the occurrence of various types of collisions. This study investigates lane-changing behaviors of multiple vehicles and the stimulative effect on following drivers in a consecutive lane-changing scenario. The microscopic trajectory data from the dataset are used for driving behavior analysis.Two discretionary lane-changin…
▽ More
Improper lane-changing behaviors may result in breakdown of traffic flow and the occurrence of various types of collisions. This study investigates lane-changing behaviors of multiple vehicles and the stimulative effect on following drivers in a consecutive lane-changing scenario. The microscopic trajectory data from the dataset are used for driving behavior analysis.Two discretionary lane-changing vehicle groups constitute a consecutive lane-changing scenario, and not only distance- and speed-related factors but also driving behaviors are taken into account to examine the impacts on the utility of following lane-changing vehicles.A random parameters logit model is developed to capture the driver psychological heterogeneity in the consecutive lane-changing situation.Furthermore, a lane-changing utility prediction model is established based on three supervised learning algorithms to detect the improper lane-changing decision. Results indicate that (1) the consecutive lane-changing behaviors have a significant negative effect on the following lane-changing vehicles after lane-change; (2) the stimulative effect exists in a consecutive lane-change situation and its influence is heterogeneous due to different psychological activities of drivers; and (3) the utility prediction model can be used to detect an improper lane-changing decision.
△ Less
Submitted 4 June, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.
-
Contrastive Domain Disentanglement for Generalizable Medical Image Segmentation
Authors:
Ran Gu,
Jiangshan Lu,
**gyang Zhang,
Wenhui Lei,
Xiaofan Zhang,
Guotai Wang,
Shaoting Zhang
Abstract:
Efficiently utilizing discriminative features is crucial for convolutional neural networks to achieve remarkable performance in medical image segmentation and is also important for model generalization across multiple domains, where letting model recognize domain-specific and domain-invariant information among multi-site datasets is a reasonable strategy for domain generalization. Unfortunately, m…
▽ More
Efficiently utilizing discriminative features is crucial for convolutional neural networks to achieve remarkable performance in medical image segmentation and is also important for model generalization across multiple domains, where letting model recognize domain-specific and domain-invariant information among multi-site datasets is a reasonable strategy for domain generalization. Unfortunately, most of the recent disentangle networks are not directly adaptable to unseen-domain datasets because of the limitations of offered data distribution. To tackle this deficiency, we propose Contrastive Domain Disentangle (CDD) network for generalizable medical image segmentation. We first introduce a disentangle network to decompose medical images into an anatomical representation factor and a modality representation factor. Then, a style contrastive loss is proposed to encourage the modality representations from the same domain to distribute as close as possible while different domains are estranged from each other. Finally, we propose a domain augmentation strategy that can randomly generate new domains for model generalization training. Experimental results on multi-site fundus image datasets for optic cup and disc segmentation show that the CDD has good model generalization. Our proposed CDD outperforms several state-of-the-art methods in domain generalizable segmentation.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
How to choose features to improve prediction performance in lane-changing intention: A meta-analysis
Authors:
Ruifeng Gu
Abstract:
Lane-change is a fundamental driving behavior and highly associated with various types of collisions, such as rear-end collisions, sideswipe collisions, and angle collisions and the increased risk of a traffic crash. This study investigates effectiveness of different features categories combination in lane-changing intention prediction. Studies related to lane-changing intention prediction have be…
▽ More
Lane-change is a fundamental driving behavior and highly associated with various types of collisions, such as rear-end collisions, sideswipe collisions, and angle collisions and the increased risk of a traffic crash. This study investigates effectiveness of different features categories combination in lane-changing intention prediction. Studies related to lane-changing intention prediction have been selected followed by strict standards. Then the meta-analysis was employed to not only evaluate the effectiveness of different features categories combination in lane-changing intention but also capture heterogeneity, effect size combination, and publication bias. According to the meta-analysis and reviewed research papers, results indicate that using input features from different types can lead to different performances. And vehicle input type has a better performance in lane-changing intention, prediction, compared with environment or even driver combination input type. Finally, some potential future research directions are proposed based on the findings of the paper.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Authors:
Xinmeng Xu,
Rongzhi Gu,
Yuexian Zou
Abstract:
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for…
▽ More
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech distortion led by end-to-end DMSE module. Finally, a spectral gain function is adopted to further suppress the unnatural residual noise. Experiment results demonstrated superior performance of the proposed model against several state-of-the-art models.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Giallar: Push-Button Verification for the Qiskit Quantum Compiler
Authors:
Runzhou Tao,
Yunong Shi,
Jianan Yao,
Xupeng Li,
Ali Javadi-Abhari,
Andrew W. Cross,
Frederic T. Chong,
Ronghui Gu
Abstract:
This paper presents Giallar, a fully-automated verification toolkit for quantum compilers. Giallar requires no manual specifications, invariants, or proofs, and can automatically verify that a compiler pass preserves the semantics of quantum circuits. To deal with unbounded loops in quantum compilers, Giallar abstracts three loop templates, whose loop invariants can be automatically inferred. To e…
▽ More
This paper presents Giallar, a fully-automated verification toolkit for quantum compilers. Giallar requires no manual specifications, invariants, or proofs, and can automatically verify that a compiler pass preserves the semantics of quantum circuits. To deal with unbounded loops in quantum compilers, Giallar abstracts three loop templates, whose loop invariants can be automatically inferred. To efficiently check the equivalence of arbitrary input and output circuits that have complicated matrix semantics representation, Giallar introduces a symbolic representation for quantum circuits and a set of rewrite rules for showing the equivalence of symbolic quantum circuits. With Giallar, we implemented and verified 44 (out of 56) compiler passes in 13 versions of the Qiskit compiler, the open-source quantum compiler standard, during which three bugs were detected in and confirmed by Qiskit. Our evaluation shows that most of Qiskit compiler passes can be automatically verified in seconds and verification imposes only a modest overhead to compilation performance.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Authors:
Zifeng Zhao,
Rongzhi Gu,
Dongchao Yang,
**chuan Tian,
Yuexian Zou
Abstract:
Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker…
▽ More
Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean sources enables application in noisy scenarios. Extensive experiments on Libri2Mix show that the proposed method achieves promising results without access to any clean sources (11.06dB SI-SDRi). With a domain adaptation, our approach even outperformed supervised framework in a cross-domain evaluation on AISHELL-1.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
Authors:
Zifeng Zhao,
Dongchao Yang,
Rongzhi Gu,
Haoran Zhang,
Yuexian Zou
Abstract:
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network…
▽ More
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Learning Decoupling Features Through Orthogonality Regularization
Authors:
Li Wang,
Rongzhi Gu,
Weiji Zhuang,
Peng Gao,
Yujun Wang,
Yuexian Zou
Abstract:
Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important…
▽ More
Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important to explore a method that can effectively extract common features while decoupling task-specific features. Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively. Experiments are conducted on Google Speech Commands Dataset (GSCD). The results demonstrate that the orthogonality regularization helps the network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
On-Device Learning with Cloud-Coordinated Data Augmentation for Extreme Model Personalization in Recommender Systems
Authors:
Renjie Gu,
Chaoyue Niu,
Yikai Yan,
Fan Wu,
Shaojie Tang,
Rongfeng Jia,
Chengfei Lyu,
Guihai Chen
Abstract:
Data heterogeneity is an intrinsic property of recommender systems, making models trained over the global data on the cloud, which is the mainstream in industry, non-optimal to each individual user's local data distribution. To deal with data heterogeneity, model personalization with on-device learning is a potential solution. However, on-device training using a user's small size of local samples…
▽ More
Data heterogeneity is an intrinsic property of recommender systems, making models trained over the global data on the cloud, which is the mainstream in industry, non-optimal to each individual user's local data distribution. To deal with data heterogeneity, model personalization with on-device learning is a potential solution. However, on-device training using a user's small size of local samples will incur severe overfitting and undermine the model's generalization ability. In this work, we propose a new device-cloud collaborative learning framework, called CoDA, to break the dilemmas of purely cloud-based learning and on-device learning. The key principle of CoDA is to retrieve similar samples from the cloud's global pool to augment each user's local dataset to train the recommendation model. Specifically, after a coarse-grained sample matching on the cloud, a personalized sample classifier is further trained on each device for a fine-grained sample filtering, which can learn the boundary between the local data distribution and the outside data distribution. We also build an end-to-end pipeline to support the flows of data, model, computation, and control between the cloud and each device. We have deployed CoDA in a recommendation scenario of Mobile Taobao. Online A/B testing results show the remarkable performance improvement of CoDA over both cloud-based learning without model personalization and on-device training without data augmentation. Overhead testing on a real device demonstrates the computation, storage, and communication efficiency of the on-device tasks in CoDA.
△ Less
Submitted 23 January, 2022;
originally announced January 2022.
-
One-shot Weakly-Supervised Segmentation in Medical Images
Authors:
Wenhui Lei,
Qi Su,
Ran Gu,
Na Wang,
Xinglong Liu,
Guotai Wang,
Xiaofan Zhang,
Shaoting Zhang
Abstract:
Deep neural networks usually require accurate and a large number of annotations to achieve outstanding performance in medical image segmentation. One-shot segmentation and weakly-supervised learning are promising research directions that lower labeling effort by learning a new class from only one annotated image and utilizing coarse labels instead, respectively. Previous works usually fail to leve…
▽ More
Deep neural networks usually require accurate and a large number of annotations to achieve outstanding performance in medical image segmentation. One-shot segmentation and weakly-supervised learning are promising research directions that lower labeling effort by learning a new class from only one annotated image and utilizing coarse labels instead, respectively. Previous works usually fail to leverage the anatomical structure and suffer from class imbalance and low contrast problems. Hence, we present an innovative framework for 3D medical image segmentation with one-shot and weakly-supervised settings. Firstly a propagation-reconstruction network is proposed to project scribbles from annotated volume to unlabeled 3D images based on the assumption that anatomical patterns in different human bodies are similar. Then a dual-level feature denoising module is designed to refine the scribbles based on anatomical- and pixel-level features. After expanding the scribbles to pseudo masks, we could train a segmentation model for the new class with the noisy label training strategy. Experiments on one abdomen and one head-and-neck CT dataset show the proposed method obtains significant improvement over the state-of-the-art methods and performs robustly even under severe class imbalance and low contrast.
△ Less
Submitted 21 November, 2021;
originally announced November 2021.
-
Resistance-Time Co-Modulated PointNet for Temporal Super-Resolution Simulation of Blood Vessel Flows
Authors:
Zhizheng Jiang,
Fei Gao,
Renshu Gu,
**lan Xu,
Gang Xu,
Timon Rabczuk
Abstract:
In this paper, a novel deep learning framework is proposed for temporal super-resolution simulation of blood vessel flows, in which a high-temporal-resolution time-varying blood vessel flow simulation is generated from a low-temporal-resolution flow simulation result. In our framework, point-cloud is used to represent the complex blood vessel model, resistance-time aided PointNet model is proposed…
▽ More
In this paper, a novel deep learning framework is proposed for temporal super-resolution simulation of blood vessel flows, in which a high-temporal-resolution time-varying blood vessel flow simulation is generated from a low-temporal-resolution flow simulation result. In our framework, point-cloud is used to represent the complex blood vessel model, resistance-time aided PointNet model is proposed for extracting the time-space features of the time-varying flow field, and finally we can reconstruct the high-accuracy and high-resolution flow field through the Decoder module. In particular, the amplitude loss and the orientation loss of the velocity are proposed from the vector characteristics of the velocity. And the combination of these two metrics constitutes the final loss function for network training. Several examples are given to illustrate the effective and efficiency of the proposed framework for temporal super-resolution simulation of blood vessel flows.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Domain Composition and Attention for Unseen-Domain Generalizable Medical Image Segmentation
Authors:
Ran Gu,
**gyang Zhang,
Rui Huang,
Wenhui Lei,
Guotai Wang,
Shaoting Zhang
Abstract:
Domain generalizable model is attracting increasing attention in medical image analysis since data is commonly acquired from different institutes with various imaging protocols and scanners. To tackle this challenging domain generalization problem, we propose a Domain Composition and Attention-based network (DCA-Net) to improve the ability of domain representation and generalization. First, we pre…
▽ More
Domain generalizable model is attracting increasing attention in medical image analysis since data is commonly acquired from different institutes with various imaging protocols and scanners. To tackle this challenging domain generalization problem, we propose a Domain Composition and Attention-based network (DCA-Net) to improve the ability of domain representation and generalization. First, we present a domain composition method that represents one certain domain by a linear combination of a set of basis representations (i.e., a representation bank). Second, a novel plug-and-play parallel domain preceptor is proposed to learn these basis representations and we introduce a divergence constraint function to encourage the basis representations to be as divergent as possible. Then, a domain attention module is proposed to learn the linear combination coefficients of the basis representations. The result of linear combination is used to calibrate the feature maps of an input image, which enables the model to generalize to different and even unseen domains. We validate our method on public prostate MRI dataset acquired from six different institutions with apparent domain shift. Experimental results show that our proposed model can generalize well on different and even unseen domains and it outperforms state-of-the-art methods on the multi-domain prostate segmentation task.
△ Less
Submitted 18 September, 2021;
originally announced September 2021.
-
Robust Beamforming Design for Rate Splitting Multiple Access-Aided MISO Visible Light Communications
Authors:
Shuai Ma,
Guanjie Zhang,
Zhi Zhang,
Rongyan Gu
Abstract:
In this paper, we focus on the optimal beamformer design for rate splitting multiple access (RSMA)-aided multipleinput single-output (MISO) visible light communication (VLC) networks. First, we derive the closed-form lower bounds of the achievable rate of each user, which are the first theoretical bound of achievable rate for RSMA-aided VLC networks. Second, we investigate the optimal beamformer d…
▽ More
In this paper, we focus on the optimal beamformer design for rate splitting multiple access (RSMA)-aided multipleinput single-output (MISO) visible light communication (VLC) networks. First, we derive the closed-form lower bounds of the achievable rate of each user, which are the first theoretical bound of achievable rate for RSMA-aided VLC networks. Second, we investigate the optimal beamformer design for RSMA-aided VLC networks to maximize the sum rate under the optical and electrical power constraints. In addition, we show that the proposed RSMA-aided networks can achieve superior performance compared with space-division multiple access (SDMA) and nonorthogonal multiple access (NOMA).
△ Less
Submitted 1 November, 2022; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking
Authors:
Gaoang Wang,
Renshu Gu,
Zuozhu Liu,
Weijie Hu,
Mingli Song,
Jenq-Neng Hwang
Abstract:
Vehicle tracking is an essential task in the multi-object tracking (MOT) field. A distinct characteristic in vehicle tracking is that the trajectories of vehicles are fairly smooth in both the world coordinate and the image coordinate. Hence, models that capture motion consistencies are of high necessity. However, tracking with the standalone motion-based trackers is quite challenging because targ…
▽ More
Vehicle tracking is an essential task in the multi-object tracking (MOT) field. A distinct characteristic in vehicle tracking is that the trajectories of vehicles are fairly smooth in both the world coordinate and the image coordinate. Hence, models that capture motion consistencies are of high necessity. However, tracking with the standalone motion-based trackers is quite challenging because targets could get lost easily due to limited information, detection error and occlusion. Leveraging appearance information to assist object re-identification could resolve this challenge to some extent. However, doing so requires extra computation while appearance information is sensitive to occlusion as well. In this paper, we try to explore the significance of motion patterns for vehicle tracking without appearance information. We propose a novel approach that tackles the association issue for long-term tracking with the exclusive fully-exploited motion information. We address the tracklet embedding issue with the proposed reconstruct-to-embed strategy based on deep graph convolutional neural networks (GCN). Comprehensive experiments on the KITTI-car tracking dataset and UA-Detrac dataset show that the proposed method, though without appearance information, could achieve competitive performance with the state-of-the-art (SOTA) trackers. The source code will be available at https://github.com/GaoangW/LGMTracker.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
Text Anchor Based Metric Learning for Small-footprint Keyword Spotting
Authors:
Li Wang,
Rongzhi Gu,
Nuo Chen,
Yuexian Zou
Abstract:
Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size. However, for metric learning, due to data limitations, the speech anchor is highly suscepti…
▽ More
Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size. However, for metric learning, due to data limitations, the speech anchor is highly susceptible to the acoustic environment and speakers. Also, we note that the 1D-CNN models have limited capability to capture long-term temporal acoustic features. To address the above problems, we propose to utilize text anchors to improve the stability of anchors. Furthermore, a new type of model (LG-Net) is exquisitely designed to promote long-short term acoustic feature modeling based on 1D-CNN and self-attention. Experiments are conducted on Google Speech Commands Dataset version 1 (GSCDv1) and 2 (GSCDv2). The results demonstrate that the proposed text anchor based metric learning method shows consistent improvements over speech anchor on representative CNN-based models. Moreover, our LG-Net model achieves SOTA accuracy of 97.67% and 96.79% on two datasets, respectively. It is encouraged to see that our lighter LG-Net with only 74k parameters obtains 96.82% KWS accuracy on the GSCDv1 and 95.77% KWS accuracy on the GSCDv2.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
LASOR: Learning Accurate 3D Human Pose and Shape Via Synthetic Occlusion-Aware Data and Neural Mesh Rendering
Authors:
Kaibing Yang,
Renshu Gu,
Maoyu Wang,
Masahiro Toyoura,
Gang Xu
Abstract:
A key challenge in the task of human pose and shape estimation is occlusion, including self-occlusions, object-human occlusions, and inter-person occlusions. The lack of diverse and accurate pose and shape training data becomes a major bottleneck, especially for scenes with occlusions in the wild. In this paper, we focus on the estimation of human pose and shape in the case of inter-person occlusi…
▽ More
A key challenge in the task of human pose and shape estimation is occlusion, including self-occlusions, object-human occlusions, and inter-person occlusions. The lack of diverse and accurate pose and shape training data becomes a major bottleneck, especially for scenes with occlusions in the wild. In this paper, we focus on the estimation of human pose and shape in the case of inter-person occlusions, while also handling object-human occlusions and self-occlusion. We propose a novel framework that synthesizes occlusion-aware silhouette and 2D keypoints data and directly regress to the SMPL pose and shape parameters. A neural 3D mesh renderer is exploited to enable silhouette supervision on the fly, which contributes to great improvements in shape estimation. In addition, keypoints-and-silhouette-driven training data in panoramic viewpoints are synthesized to compensate for the lack of viewpoint diversity in any existing dataset. Experimental results show that we are among the state-of-the-art on the 3DPW and 3DPW-Crowd datasets in terms of pose estimation accuracy. The proposed method evidently outperforms Mesh Transformer, 3DCrowdNet and ROMP in terms of shape estimation. Top performance is also achieved on SSP-3D in terms of shape prediction accuracy. Demo and code will be available at https://igame-lab.github.io/LASOR/.
△ Less
Submitted 29 January, 2022; v1 submitted 31 July, 2021;
originally announced August 2021.
-
HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration
Authors:
Fan Lu,
Guang Chen,
Yinlong Liu,
Lijun Zhang,
Sanqing Qu,
Shu Liu,
Rongqi Gu
Abstract:
Point cloud registration is a fundamental problem in 3D computer vision. Outdoor LiDAR point clouds are typically large-scale and complexly distributed, which makes the registration challenging. In this paper, we propose an efficient hierarchical network named HRegNet for large-scale outdoor LiDAR point cloud registration. Instead of using all points in the point clouds, HRegNet performs registrat…
▽ More
Point cloud registration is a fundamental problem in 3D computer vision. Outdoor LiDAR point clouds are typically large-scale and complexly distributed, which makes the registration challenging. In this paper, we propose an efficient hierarchical network named HRegNet for large-scale outdoor LiDAR point cloud registration. Instead of using all points in the point clouds, HRegNet performs registration on hierarchically extracted keypoints and descriptors. The overall framework combines the reliable features in deeper layer and the precise position information in shallower layers to achieve robust and precise registration. We present a correspondence network to generate correct and accurate keypoints correspondences. Moreover, bilateral consensus and neighborhood consensus are introduced for keypoints matching and novel similarity features are designed to incorporate them into the correspondence network, which significantly improves the registration performance. Besides, the whole network is also highly efficient since only a small number of keypoints are used for registration. Extensive experiments are conducted on two large-scale outdoor LiDAR point cloud datasets to demonstrate the high accuracy and efficiency of the proposed HRegNet. The project website is https://ispc-group.github.io/hregnet.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Ladder Polynomial Neural Networks
Authors:
Li-** Liu,
Ruiyuan Gu,
Xiaozhe Hu
Abstract:
Polynomial functions have plenty of useful analytical properties, but they are rarely used as learning models because their function class is considered to be restricted. This work shows that when trained properly polynomial functions can be strong learning models. Particularly this work constructs polynomial feedforward neural networks using the product activation, a new activation function const…
▽ More
Polynomial functions have plenty of useful analytical properties, but they are rarely used as learning models because their function class is considered to be restricted. This work shows that when trained properly polynomial functions can be strong learning models. Particularly this work constructs polynomial feedforward neural networks using the product activation, a new activation function constructed from multiplications. The new neural network is a polynomial function and provides accurate control of its polynomial order. It can be trained by standard training techniques such as batch normalization and dropout. This new feedforward network covers several previous polynomial models as special cases. Compared with common feedforward neural networks, the polynomial feedforward network has closed-form calculations of a few interesting quantities, which are very useful in Bayesian learning. In a series of regression and classification tasks in the empirical study, the proposed model outperforms previous polynomial models.
△ Less
Submitted 29 June, 2021; v1 submitted 25 June, 2021;
originally announced June 2021.
-
Hash Adaptive Bloom Filter
Authors:
Rongbiao Xie,
Meng Li,
Zheyu Miao,
Rong Gu,
He Huang,
Haipeng Dai,
Guihai Chen
Abstract:
Bloom filter is a compact memory-efficient probabilistic data structure supporting membership testing, i.e., to check whether an element is in a given set. However, as Bloom filter maps each element with uniformly random hash functions, few flexibilities are provided even if the information of negative keys (elements are not in the set) are available. The problem gets worse when the misidentificat…
▽ More
Bloom filter is a compact memory-efficient probabilistic data structure supporting membership testing, i.e., to check whether an element is in a given set. However, as Bloom filter maps each element with uniformly random hash functions, few flexibilities are provided even if the information of negative keys (elements are not in the set) are available. The problem gets worse when the misidentification of negative keys brings different costs. To address the above problems, we propose a new Hash Adaptive Bloom Filter (HABF) that supports the customization of hash functions for keys. The key idea of HABF is to customize the hash functions for positive keys (elements are in the set) to avoid negative keys with high cost, and pack customized hash functions into a lightweight data structure named HashExpressor. Then, given an element at query time, HABF follows a two-round pattern to check whether the element is in the set. Further, we theoretically analyze the performance of HABF and bound the expected false positive rate. We conduct extensive experiments on representative datasets, and the results show that HABF outperforms the standard Bloom filter and its cutting-edge variants on the whole in terms of accuracy, construction time, query time, and memory space consumption (Note that source codes are available in [1]).
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
SS-CADA: A Semi-Supervised Cross-Anatomy Domain Adaptation for Coronary Artery Segmentation
Authors:
**gyang Zhang,
Ran Gu,
Guotai Wang,
Hongzhi Xie,
Lixu Gu
Abstract:
The segmentation of coronary arteries by convolutional neural network is promising yet requires a large amount of labor-intensive manual annotations. Transferring knowledge from retinal vessels in widely-available public labeled fundus images (FIs) has a potential to reduce the annotation requirement for coronary artery segmentation in X-ray angiograms (XAs) due to their common tubular structures.…
▽ More
The segmentation of coronary arteries by convolutional neural network is promising yet requires a large amount of labor-intensive manual annotations. Transferring knowledge from retinal vessels in widely-available public labeled fundus images (FIs) has a potential to reduce the annotation requirement for coronary artery segmentation in X-ray angiograms (XAs) due to their common tubular structures. However, it is challenged by the cross-anatomy domain shift due to the intrinsically different vesselness characteristics in different anatomical regions under even different imaging protocols. To solve this problem, we propose a Semi-Supervised Cross-Anatomy Domain Adaptation (SS-CADA) which requires only limited annotations for coronary arteries in XAs. With the supervision from a small number of labeled XAs and publicly available labeled FIs, we propose a vesselness-specific batch normalization (VSBN) to individually normalize feature maps for them considering their different cross-anatomic vesselness characteristics. In addition, to further facilitate the annotation efficiency, we employ a self-ensembling mean-teacher (SEMT) to exploit abundant unlabeled XAs by imposing a prediction consistency constraint. Extensive experiments show that our SS-CADA is able to solve the challenging cross-anatomy domain shift, achieving accurate segmentation for coronary arteries given only a small number of labeled XAs.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Split and Connect: A Universal Tracklet Booster for Multi-Object Tracking
Authors:
Gaoang Wang,
Yizhou Wang,
Renshu Gu,
Weijie Hu,
Jenq-Neng Hwang
Abstract:
Multi-object tracking (MOT) is an essential task in the computer vision field. With the fast development of deep learning technology in recent years, MOT has achieved great improvement. However, some challenges still remain, such as sensitiveness to occlusion, instability under different lighting conditions, non-robustness to deformable objects, etc. To address such common challenges in most of th…
▽ More
Multi-object tracking (MOT) is an essential task in the computer vision field. With the fast development of deep learning technology in recent years, MOT has achieved great improvement. However, some challenges still remain, such as sensitiveness to occlusion, instability under different lighting conditions, non-robustness to deformable objects, etc. To address such common challenges in most of the existing trackers, in this paper, a tracklet booster algorithm is proposed, which can be built upon any other tracker. The motivation is simple and straightforward: split tracklets on potential ID-switch positions and then connect multiple tracklets into one if they are from the same object. In other words, the tracklet booster consists of two parts, i.e., Splitter and Connector. First, an architecture with stacked temporal dilated convolution blocks is employed for the splitting position prediction via label smoothing strategy with adaptive Gaussian kernels. Then, a multi-head self-attention based encoder is exploited for the tracklet embedding, which is further used to connect tracklets into larger groups. We conduct sufficient experiments on MOT17 and MOT20 benchmark datasets, which demonstrates promising results. Combined with the proposed tracklet booster, existing trackers usually can achieve large improvements on the IDF1 score, which shows the effectiveness of the proposed method.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency
Authors:
**chuan Tian,
Rongzhi Gu,
Helin Wang,
Yuexian Zou
Abstract:
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may encounter prohibitively high computational cost and large parameter budget. Although Parameter Sharing Strategy (PSS) proposed in ALBERT paves the way for parameter re…
▽ More
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may encounter prohibitively high computational cost and large parameter budget. Although Parameter Sharing Strategy (PSS) proposed in ALBERT paves the way for parameter reduction, the computation required remains the same. Interestingly, we found in experiments that distributions of feature embeddings from different Transformer layers are similar when PSS is integrated: a property termed as Layer Consistency (LC) in this paper. Given this similarity of feature distributions, we assume that feature embeddings from different layers would have similar representing power. In this work, Layer Consistency enables us to adopt Transformer-based models in a more efficient manner: the number of Conformer layers in each training iteration could be uniformly sampled and Shallow Layer Inference (SLI) could be applied to reduce the number of layers in inference stage. In experiments, our models are trained with LibriSpeech dataset and then evaluated on both phone classification and Speech Recognition tasks. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
△ Less
Submitted 8 April, 2021;
originally announced May 2021.
-
Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain
Authors:
Rongzhi Gu,
Shi-Xiong Zhang,
Yuexian Zou,
Dong Yu
Abstract:
To date, mainstream target speech separation (TSS) approaches are formulated to estimate the complex ratio mask (cRM) of the target speech in time-frequency domain under supervised deep learning framework. However, the existing deep models for estimating cRM are designed in the way that the real and imaginary parts of the cRM are separately modeled using real-valued training data pairs. The resear…
▽ More
To date, mainstream target speech separation (TSS) approaches are formulated to estimate the complex ratio mask (cRM) of the target speech in time-frequency domain under supervised deep learning framework. However, the existing deep models for estimating cRM are designed in the way that the real and imaginary parts of the cRM are separately modeled using real-valued training data pairs. The research motivation of this study is to design a deep model that fully exploits the temporal-spectral-spatial information of multi-channel signals for estimating cRM directly and efficiently in complex domain. As a result, a novel TSS network is designed consisting of two modules, a complex neural spatial filter (cNSF) and an MVDR. Essentially, cNSF is a cRM estimation model and an MVDR module is cascaded to the cNSF module to reduce the nonlinear speech distortions introduced by neural network. Specifically, to fit the cRM target, all input features of cNSF are reformulated into complex-valued representations following the supervised learning paradigm. Then, to achieve good hierarchical feature abstraction, a complex deep neural network (cDNN) is delicately designed with U-Net structure. Experiments conducted on simulated multi-channel speech data demonstrate the proposed cNSF outperforms the baseline NSF by 12.1% scale-invariant signal-to-distortion ratio and 33.1% word error rate.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
Gleipnir: Toward Practical Error Analysis for Quantum Programs (Extended Version)
Authors:
Runzhou Tao,
Yunong Shi,
Jianan Yao,
John Hui,
Frederic T. Chong,
Ronghui Gu
Abstract:
Practical error analysis is essential for the design, optimization, and evaluation of Noisy Intermediate-Scale Quantum(NISQ) computing. However, bounding errors in quantum programs is a grand challenge, because the effects of quantum errors depend on exponentially large quantum states. In this work, we present Gleipnir, a novel methodology toward practically computing verified error bounds in quan…
▽ More
Practical error analysis is essential for the design, optimization, and evaluation of Noisy Intermediate-Scale Quantum(NISQ) computing. However, bounding errors in quantum programs is a grand challenge, because the effects of quantum errors depend on exponentially large quantum states. In this work, we present Gleipnir, a novel methodology toward practically computing verified error bounds in quantum programs. Gleipnir introduces the $(\hatρ,δ)$-diamond norm, an error metric constrained by a quantum predicate consisting of the approximate state $\hatρ$ and its distance $δ$ to the ideal state $ρ$. This predicate $(\hatρ,δ)$ can be computed adaptively using tensor networks based on the Matrix Product States. Gleipnir features a lightweight logic for reasoning about error bounds in noisy quantum programs, based on the $(\hatρ,δ)$-diamond norm metric. Our experimental results show that Gleipnir is able to efficiently generate tight error bounds for real-world quantum programs with 10 to 100 qubits, and can be used to evaluate the error mitigation performance of quantum compiler transformations.
△ Less
Submitted 19 April, 2021; v1 submitted 13 April, 2021;
originally announced April 2021.
-
SciviK: A Versatile Framework for Specifying and Verifying Smart Contracts
Authors:
Shaokai Lin,
Xinyuan Sun,
Jianan Yao,
Ronghui Gu
Abstract:
The growing adoption of smart contracts on blockchains poses new security risks that can lead to significant monetary loss, while existing approaches either provide no (or partial) security guarantees for smart contracts or require huge proof effort. To address this challenge, we present SciviK, a versatile framework for specifying and verifying industrial-grade smart contracts. SciviK's versatile…
▽ More
The growing adoption of smart contracts on blockchains poses new security risks that can lead to significant monetary loss, while existing approaches either provide no (or partial) security guarantees for smart contracts or require huge proof effort. To address this challenge, we present SciviK, a versatile framework for specifying and verifying industrial-grade smart contracts. SciviK's versatile approach extends previous efforts with three key contributions: (i) an expressive annotation system enabling built-in directives for vulnerability pattern checking, neural-based loop invariant inference, and the verification of rich properties of real-world smart contracts (ii) a fine-grained model for the Ethereum Virtual Machine (EVM) that provides low-level execution semantics, (iii) an IR-level verification framework integrating both SMT solvers and the Coq proof assistant.
We use SciviK to specify and verify security properties for 12 benchmark contracts and a real-world Decentralized Finance (DeFi) smart contract. Among all 158 specified security properties (in six types), 151 properties can be automatically verified within 2 seconds, five properties can be automatically verified after moderate modifications, and two properties are manually proved with around 200 lines of Coq code.
△ Less
Submitted 3 March, 2021;
originally announced March 2021.
-
Automatic Segmentation of Organs-at-Risk from Head-and-Neck CT using Separable Convolutional Neural Network with Hard-Region-Weighted Loss
Authors:
Wenhui Lei,
Haochen Mei,
Zhengwentai Sun,
Shan Ye,
Ran Gu,
Huan Wang,
Rui Huang,
Shichuan Zhang,
Shaoting Zhang,
Guotai Wang
Abstract:
Nasopharyngeal Carcinoma (NPC) is a leading form of Head-and-Neck (HAN) cancer in the Arctic, China, Southeast Asia, and the Middle East/North Africa. Accurate segmentation of Organs-at-Risk (OAR) from Computed Tomography (CT) images with uncertainty information is critical for effective planning of radiation therapy for NPC treatment. Despite the stateof-the-art performance achieved by Convolutio…
▽ More
Nasopharyngeal Carcinoma (NPC) is a leading form of Head-and-Neck (HAN) cancer in the Arctic, China, Southeast Asia, and the Middle East/North Africa. Accurate segmentation of Organs-at-Risk (OAR) from Computed Tomography (CT) images with uncertainty information is critical for effective planning of radiation therapy for NPC treatment. Despite the stateof-the-art performance achieved by Convolutional Neural Networks (CNNs) for automatic segmentation of OARs, existing methods do not provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by several factors, including the low contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing. To address these problems, we propose a novel framework for accurate OAR segmentation with reliable uncertainty estimation. First, we propose a Segmental Linear Function (SLF) to transform the intensity of CT images to make multiple organs more distinguishable than existing methods based on a simple window width/level that often gives a better visibility of one organ while hiding the others. Second, to deal with the large inter-slice spacing, we introduce a novel 2.5D network (named as 3D-SepNet) specially designed for dealing with clinic HAN CT scans with anisotropic spacing. Thirdly, existing hardness-aware loss function often deal with class-level hardness, but our proposed attention to hard voxels (ATH) uses a voxel-level hardness strategy, which is more suitable to dealing with some hard regions despite that its corresponding class may be easy. Our code is now available at https://github.com/HiLab-git/SepNet.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.