-
A Robust Semantic Communication System for Image
Authors:
Xiang Peng,
Zhi** Qin,
Xiaoming Tao,
Jianhua Lu,
Khaled B. Letaief
Abstract:
Semantic communications have gained significant attention as a promising approach to address the transmission bottleneck, especially with the continuous development of 6G techniques. Distinct from the well investigated physical channel impairments, this paper focuses on semantic impairments in image, particularly those arising from adversarial perturbations. Specifically, we propose a novel metric…
▽ More
Semantic communications have gained significant attention as a promising approach to address the transmission bottleneck, especially with the continuous development of 6G techniques. Distinct from the well investigated physical channel impairments, this paper focuses on semantic impairments in image, particularly those arising from adversarial perturbations. Specifically, we propose a novel metric for quantifying the intensity of semantic impairment and develop a semantic impairment dataset. Furthermore, we introduce a deep learning enabled semantic communication system, termed as DeepSC-RI, to enhance the robustness of image transmission, which incorporates a multi-scale semantic extractor with a dual-branch architecture for extracting semantics with varying granularity, thereby improving the robustness of the system. The fine-grained branch incorporates a semantic importance evaluation module to identify and prioritize crucial semantics, while the coarse-grained branch adopts a hierarchical approach for capturing the robust semantics. These two streams of semantics are seamlessly integrated via an advanced cross-attention-based semantic fusion module. Experimental results demonstrate the superior performance of DeepSC-RI under various levels of semantic impairment intensity.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Learning to See Through Dazzle
Authors:
Xiaopeng Peng,
Erin F. Fleet,
Abbie T. Watnik,
Grover A. Swartzlander
Abstract:
Machine vision is susceptible to laser dazzle, where intense laser light can blind and distort its perception of the environment through oversaturation or permanent damage to sensor pixels. Here we employ a wavefront-coded phase mask to diffuse the energy of laser light and introduce a sandwich generative adversarial network (SGAN) to restore images from complex image degradations, such as varying…
▽ More
Machine vision is susceptible to laser dazzle, where intense laser light can blind and distort its perception of the environment through oversaturation or permanent damage to sensor pixels. Here we employ a wavefront-coded phase mask to diffuse the energy of laser light and introduce a sandwich generative adversarial network (SGAN) to restore images from complex image degradations, such as varying laser-induced image saturation, mask-induced image blurring, unknown lighting conditions, and various noise corruptions. The SGAN architecture combines discriminative and generative methods by wrap** two GANs around a learnable image deconvolution module. In addition, we make use of Fourier feature representations to reduce the spectral bias of neural networks and improve its learning of high-frequency image details. End-to-end training includes the realistic physics-based synthesis of a large set of training data from publicly available images. We trained the SGAN to suppress the peak laser irradiance as high as $10^6$ times the sensor saturation threshold - the point at which camera sensors may experience damage without the mask. The trained model was evaluated on both a synthetic data set and data collected from the laboratory. The proposed image restoration model quantitatively and qualitatively outperforms state-of-the-art methods for a wide range of scene contents, laser powers, incident laser angles, ambient illumination strengths, and noise characteristics.
△ Less
Submitted 4 March, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control
Authors:
Zhongyu Li,
Xue Bin Peng,
Pieter Abbeel,
Sergey Levine,
Glen Berseth,
Koushil Sreenath
Abstract:
This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jum** and standing. Our RL-based controller incorporates a n…
▽ More
This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jum** and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world.The study also delves into the adaptivity and robustness introduced by the proposed RL system in develo** locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot's I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jum** skills, such as standing long jumps and high jumps.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Masked Audio Modeling with CLAP and Multi-Objective Learning
Authors:
Yifei Xin,
Xiulian Peng,
Yan Lu
Abstract:
Most existing masked audio modeling (MAM) methods learn audio representations by masking and reconstructing local spectrogram patches. However, the reconstruction loss mainly accounts for the signal-level quality of the reconstructed spectrogram and is still limited in extracting high-level audio semantics. In this paper, we propose to enhance the semantic modeling of MAM by distilling cross-modal…
▽ More
Most existing masked audio modeling (MAM) methods learn audio representations by masking and reconstructing local spectrogram patches. However, the reconstruction loss mainly accounts for the signal-level quality of the reconstructed spectrogram and is still limited in extracting high-level audio semantics. In this paper, we propose to enhance the semantic modeling of MAM by distilling cross-modality knowledge from contrastive language-audio pretraining (CLAP) representations for both masked and unmasked regions (MAM-CLAP) and leveraging a multi-objective learning strategy with a supervised classification branch (SupMAM), thereby providing more semantic knowledge for MAM and enabling it to effectively learn global features from labels. Experiments show that our methods significantly improve the performance on multiple downstream tasks. Furthermore, by combining our MAM-CLAP with SupMAM, we can achieve new state-of-the-art results on various audio and speech classification tasks, exceeding previous self-supervised learning and supervised pretraining methods.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
CycLight: learning traffic signal cooperation with a cycle-level strategy
Authors:
Gengyue Han,
Xiaohan Liu,
Xianyue Peng,
Hao Wang,
Yu Han
Abstract:
This study introduces CycLight, a novel cycle-level deep reinforcement learning (RL) approach for network-level adaptive traffic signal control (NATSC) systems. Unlike most traditional RL-based traffic controllers that focus on step-by-step decision making, CycLight adopts a cycle-level strategy, optimizing cycle length and splits simultaneously using Parameterized Deep Q-Networks (PDQN) algorithm…
▽ More
This study introduces CycLight, a novel cycle-level deep reinforcement learning (RL) approach for network-level adaptive traffic signal control (NATSC) systems. Unlike most traditional RL-based traffic controllers that focus on step-by-step decision making, CycLight adopts a cycle-level strategy, optimizing cycle length and splits simultaneously using Parameterized Deep Q-Networks (PDQN) algorithm. This cycle-level approach effectively reduces the computational burden associated with frequent data communication, meanwhile enhancing the practicality and safety of real-world applications. A decentralized framework is formulated for multi-agent cooperation, while attention mechanism is integrated to accurately assess the impact of the surroundings on the current intersection. CycLight is tested in a large synthetic traffic grid using the microscopic traffic simulation tool, SUMO. Experimental results not only demonstrate the superiority of CycLight over other state-of-the-art approaches but also showcase its robustness against information transmission delays.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Selective Run-Length Encoding
Authors:
Xutan Peng,
Yi Zhang,
Dejia Peng,
Jiafa Zhu
Abstract:
Run-Length Encoding (RLE) is one of the most fundamental tools in data compression. However, its compression power drops significantly if there lacks consecutive elements in the sequence. In extreme cases, the output of the encoder may require more space than the input (aka size inflation). To alleviate this issue, using combinatorics, we quantify RLE's space savings for a given input distribution…
▽ More
Run-Length Encoding (RLE) is one of the most fundamental tools in data compression. However, its compression power drops significantly if there lacks consecutive elements in the sequence. In extreme cases, the output of the encoder may require more space than the input (aka size inflation). To alleviate this issue, using combinatorics, we quantify RLE's space savings for a given input distribution. With this insight, we develop the first algorithm that automatically identifies suitable symbols, then selectively encodes these symbols with RLE while directly storing the others without RLE. Through experiments on real-world datasets of various modalities, we empirically validate that our method, which maintains RLE's efficiency advantage, can effectively mitigate the size inflation dilemma.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
Q-learning Based Optimal False Data Injection Attack on Probabilistic Boolean Control Networks
Authors:
Xianlun Peng,
Yang Tang,
Fangfei Li,
Yang Liu
Abstract:
In this paper, we present a reinforcement learning (RL) method for solving optimal false data injection attack problems in probabilistic Boolean control networks (PBCNs) where the attacker lacks knowledge of the system model. Specifically, we employ a Q-learning (QL) algorithm to address this problem. We then propose an improved QL algorithm that not only enhances learning efficiency but also obta…
▽ More
In this paper, we present a reinforcement learning (RL) method for solving optimal false data injection attack problems in probabilistic Boolean control networks (PBCNs) where the attacker lacks knowledge of the system model. Specifically, we employ a Q-learning (QL) algorithm to address this problem. We then propose an improved QL algorithm that not only enhances learning efficiency but also obtains optimal attack strategies for large-scale PBCNs that the standard QL algorithm cannot handle. Finally, we verify the effectiveness of our proposed approach by considering two attacked PBCNs, including a 10-node network and a 28-node network.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Joint Optimization of Traffic Signal Control and Vehicle Routing in Signalized Road Networks using Multi-Agent Deep Reinforcement Learning
Authors:
Xianyue Peng,
Hang Gao,
Gengyue Han,
Hao Wang,
Michael Zhang
Abstract:
Urban traffic congestion is a critical predicament that plagues modern road networks. To alleviate this issue and enhance traffic efficiency, traffic signal control and vehicle routing have proven to be effective measures. In this paper, we propose a joint optimization approach for traffic signal control and vehicle routing in signalized road networks. The objective is to enhance network performan…
▽ More
Urban traffic congestion is a critical predicament that plagues modern road networks. To alleviate this issue and enhance traffic efficiency, traffic signal control and vehicle routing have proven to be effective measures. In this paper, we propose a joint optimization approach for traffic signal control and vehicle routing in signalized road networks. The objective is to enhance network performance by simultaneously controlling signal timings and route choices using Multi-Agent Deep Reinforcement Learning (MADRL). Signal control agents (SAs) are employed to establish signal timings at intersections, whereas vehicle routing agents (RAs) are responsible for selecting vehicle routes. By establishing relevance between agents and enabling them to share observations and rewards, interaction and cooperation among agents are fostered, which enhances individual training. The Multi-Agent Advantage Actor-Critic algorithm is used to handle multi-agent environments, and Deep Neural Network (DNN) structures are designed to facilitate the algorithm's convergence. Notably, our work is the first to utilize MADRL in determining the optimal joint policy for signal control and vehicle routing. Numerical experiments conducted on the modified Sioux network demonstrate that our integration of signal control and vehicle routing outperforms controlling signal timings or vehicles' routes alone in enhancing traffic efficiency.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Low-latency Speech Enhancement via Speech Token Generation
Authors:
Huaying Xue,
Xiulian Peng,
Yan Lu
Abstract:
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement…
▽ More
Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
△ Less
Submitted 23 January, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
MAD: Meta Adversarial Defense Benchmark
Authors:
X. Peng,
D. Zhou,
G. Sun,
J. Shi,
L. Wu
Abstract:
Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a…
▽ More
Adversarial training (AT) is a prominent technique employed by deep learning models to defend against adversarial attacks, and to some extent, enhance model robustness. However, there are three main drawbacks of the existing AT-based defense methods: expensive computational cost, low generalization ability, and the dilemma between the original model and the defense model. To this end, we propose a novel benchmark called meta adversarial defense (MAD). The MAD benchmark consists of two MAD datasets, along with a MAD evaluation protocol. The two large-scale MAD datasets were generated through experiments using 30 kinds of attacks on MNIST and CIFAR-10 datasets. In addition, we introduce a meta-learning based adversarial training (Meta-AT) algorithm as the baseline, which features high robustness to unseen adversarial attacks through few-shot learning. Experimental results demonstrate the effectiveness of our Meta-AT algorithm compared to the state-of-the-art methods. Furthermore, the model after Meta-AT maintains a relatively high clean-samples classification accuracy (CCA). It is worth noting that Meta-AT addresses all three aforementioned limitations, leading to substantial improvements. This benchmark ultimately achieved breakthroughs in investigating the transferability of adversarial defense methods to new attacks and the ability to learn from a limited number of adversarial examples. Our codes and attacked datasets address will be available at https://github.com/PXX1110/Meta_AT.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Sensiverse: A dataset for ISAC study
Authors:
Jia** Luo,
Baojian Zhou,
Yang Yu,
** Zhang,
Xiaohui Peng,
Jianglei Ma,
Peiying Zhu,
Jianmin Lu,
Wen Tong
Abstract:
In order to address the lack of applicable channel models for ISAC research and evaluation, we release Sensiverse, a dataset that can be used for ISAC research. In this paper, we present the method of generating Sensiverse, including the acquisition and formatting of the 3D scene models, the generation of the channel data and associations with Tx/Rx deployment. The file structure and usage of the…
▽ More
In order to address the lack of applicable channel models for ISAC research and evaluation, we release Sensiverse, a dataset that can be used for ISAC research. In this paper, we present the method of generating Sensiverse, including the acquisition and formatting of the 3D scene models, the generation of the channel data and associations with Tx/Rx deployment. The file structure and usage of the dataset are also described, and finally the use of the dataset is illustrated with examples through the evaluation of use cases such as 3D environment reconstruction and moving targets.
△ Less
Submitted 26 August, 2023;
originally announced August 2023.
-
Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction
Authors:
Jie Zeng,
Zeyu Han,
Xingchen Peng,
Jianghong Xiao,
Peng Wang,
Yan Wang
Abstract:
Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from anoth…
▽ More
Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from another cancer, i.e., rectum cancer, which has the same scanning area and more clinically available data, to improve the dose map prediction performance for cervical cancer through domain adaptation. In order to close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM), which can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
High-Dimensional MR Reconstruction Integrating Subspace and Adaptive Generative Models
Authors:
Ruiyang Zhao,
Xi Peng,
Varun A. Kelkar,
Mark A. Anastasio,
Fan Lam
Abstract:
We present a novel method that integrates subspace modeling with an adaptive generative image prior for high-dimensional MR image reconstruction. The subspace model imposes an explicit low-dimensional representation of the high-dimensional images, while the generative image prior serves as a spatial constraint on the "contrast-weighted" images or the spatial coefficients of the subspace model. A f…
▽ More
We present a novel method that integrates subspace modeling with an adaptive generative image prior for high-dimensional MR image reconstruction. The subspace model imposes an explicit low-dimensional representation of the high-dimensional images, while the generative image prior serves as a spatial constraint on the "contrast-weighted" images or the spatial coefficients of the subspace model. A formulation was introduced to synergize these two components with complimentary regularization such as joint sparsity. A special pretraining plus subject-specific network adaptation strategy was proposed to construct an accurate generative-model-based representation for images with varying contrasts, validated by experimental data. An iterative algorithm was introduced to jointly update the subspace coefficients and the multiresolution latent space of the generative image model that leveraged a recently developed intermediate layer optimization technique for network inversion. We evaluated the utility of the proposed method in two high-dimensional imaging applications: accelerated MR parameter map** and high-resolution MRSI. Improved performance over state-of-the-art subspace-based methods was demonstrated in both cases. Our work demonstrated the potential of integrating data-driven and adaptive generative models with low-dimensional representation for high-dimensional imaging problems.
△ Less
Submitted 16 June, 2023; v1 submitted 14 June, 2023;
originally announced June 2023.
-
ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression
Authors:
Yixin Wan,
Yuan Zhou,
Xiulian Peng,
Kai-Wei Chang,
Yan Lu
Abstract:
Noise suppression (NS) models have been widely applied to enhance speech quality. Recently, Deep Learning-Based NS, which we denote as Deep Noise Suppression (DNS), became the mainstream NS method due to its excelling performance over traditional ones. However, DNS models face 2 major challenges for supporting the real-world applications. First, high-performing DNS models are usually large in size…
▽ More
Noise suppression (NS) models have been widely applied to enhance speech quality. Recently, Deep Learning-Based NS, which we denote as Deep Noise Suppression (DNS), became the mainstream NS method due to its excelling performance over traditional ones. However, DNS models face 2 major challenges for supporting the real-world applications. First, high-performing DNS models are usually large in size, causing deployment difficulties. Second, DNS models require extensive training data, including noisy audios as inputs and clean audios as labels. It is often difficult to obtain clean labels for training DNS models. We propose the use of knowledge distillation (KD) to resolve both challenges. Our study serves 2 main purposes. To begin with, we are among the first to comprehensively investigate mainstream KD techniques on DNS models to resolve the two challenges. Furthermore, we propose a novel Attention-Based-Compression KD method that outperforms all investigated mainstream KD frameworks on DNS task.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Omni-Line-of-Sight Imaging for Holistic Shape Reconstruction
Authors:
Binbin Huang,
Xingyue Peng,
Siyuan Shen,
Suan Xia,
Ruiqian Li,
Yanhua Yu,
Yuehan Wang,
Shenghua Gao,
Wenzheng Chen,
Shiying Li,
**gyi Yu
Abstract:
We introduce Omni-LOS, a neural computational imaging method for conducting holistic shape reconstruction (HSR) of complex objects utilizing a Single-Photon Avalanche Diode (SPAD)-based time-of-flight sensor. As illustrated in Fig. 1, our method enables new capabilities to reconstruct near-$360^\circ$ surrounding geometry of an object from a single scan spot. In such a scenario, traditional line-o…
▽ More
We introduce Omni-LOS, a neural computational imaging method for conducting holistic shape reconstruction (HSR) of complex objects utilizing a Single-Photon Avalanche Diode (SPAD)-based time-of-flight sensor. As illustrated in Fig. 1, our method enables new capabilities to reconstruct near-$360^\circ$ surrounding geometry of an object from a single scan spot. In such a scenario, traditional line-of-sight (LOS) imaging methods only see the front part of the object and typically fail to recover the occluded back regions. Inspired by recent advances of non-line-of-sight (NLOS) imaging techniques which have demonstrated great power to reconstruct occluded objects, Omni-LOS marries LOS and NLOS together, leveraging their complementary advantages to jointly recover the holistic shape of the object from a single scan position. The core of our method is to put the object nearby diffuse walls and augment the LOS scan in the front view with the NLOS scans from the surrounding walls, which serve as virtual ``mirrors'' to trap lights toward the object. Instead of separately recovering the LOS and NLOS signals, we adopt an implicit neural network to represent the object, analogous to NeRF and NeTF. While transients are measured along straight rays in LOS but over the spherical wavefronts in NLOS, we derive differentiable ray propagation models to simultaneously model both types of transient measurements so that the NLOS reconstruction also takes into account the direct LOS measurements and vice versa. We further develop a proof-of-concept Omni-LOS hardware prototype for real-world validation. Comprehensive experiments on various wall settings demonstrate that Omni-LOS successfully resolves shape ambiguities caused by occlusions, achieves high-fidelity 3D scan quality, and manages to recover objects of various scales and complexity.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
Contrast-PLC: Contrastive Learning for Packet Loss Concealment
Authors:
Huaying Xue,
Xiulian Peng,
Yan Lu
Abstract:
Packet loss concealment (PLC) is challenging in concealing missing contents both plausibly and naturally when there are only limited available context to use. Recently deep-learning based PLC algorithms have demonstrated their superiority over traditional counterparts; but their concealment ability is still mostly limited to a maximum of 120ms loss. Even with strong GAN-based generative models, it…
▽ More
Packet loss concealment (PLC) is challenging in concealing missing contents both plausibly and naturally when there are only limited available context to use. Recently deep-learning based PLC algorithms have demonstrated their superiority over traditional counterparts; but their concealment ability is still mostly limited to a maximum of 120ms loss. Even with strong GAN-based generative models, it is still very challenging to predict long burst losses that could happen within/in-between phonemes. In this paper, we propose to use contrastive learning to learn a loss-robust semantic representation for PLC. A hybrid neural PLC architecture combining the semantic prediction and GAN-based generative model is designed to verify its effectiveness. Results on the blind test set of Interspeech2022 PLC Challenge show its superiority over commonly used UNet-style framework and the one without contrastive learning, especially for the longer burst loss at (120, 220] ms.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Time-Variance Aware Real-Time Speech Enhancement
Authors:
Chengyu Zheng,
Yuan Zhou,
Xiulian Peng,
Yuan Zhang,
Yan Lu
Abstract:
Time-variant factors often occur in real-world full-duplex communication applications. Some of them are caused by the complex environment such as non-stationary environmental noises and varying acoustic path while some are caused by the communication system such as the dynamic delay between the far-end and near-end signals. Current end-to-end deep neural network (DNN) based methods usually model t…
▽ More
Time-variant factors often occur in real-world full-duplex communication applications. Some of them are caused by the complex environment such as non-stationary environmental noises and varying acoustic path while some are caused by the communication system such as the dynamic delay between the far-end and near-end signals. Current end-to-end deep neural network (DNN) based methods usually model the time-variant components implicitly and can hardly handle the unpredictable time-variance in real-time speech enhancement. To explicitly capture the time-variant components, we propose a dynamic kernel generation (DKG) module that can be introduced as a learnable plug-in to a DNN-based end-to-end pipeline. Specifically, the DKG module generates a convolutional kernel regarding to each input audio frame, so that the DNN model is able to dynamically adjust its weights according to the input signal during inference. Experimental results verify that DKG module improves the performance of the model under time-variant scenarios, in the joint acoustic echo cancellation (AEC) and deep noise suppression (DNS) tasks.
△ Less
Submitted 25 February, 2023;
originally announced February 2023.
-
Improving Speech Enhancement via Event-based Query
Authors:
Yifei Xin,
Xiulian Peng,
Yan Lu
Abstract:
Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive speech and noises as different types of sound events and propose an event-based query method for SE. Specifically, representative speech embeddings that can dis…
▽ More
Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive speech and noises as different types of sound events and propose an event-based query method for SE. Specifically, representative speech embeddings that can discriminate speech with noises are first pre-trained with the sound event detection (SED) task. The embeddings are then clustered into fixed golden speech queries to assist the SE network to enhance the speech from noisy audio. The golden speech queries can be obtained offline and generalizable to different SE datasets and networks. Therefore, little extra complexity is introduced and no enrollment is needed for each speaker. Experimental results show that the proposed method yields significant gains compared with baselines and the golden queries are well generalized to different datasets.
△ Less
Submitted 24 February, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
DasFormer: Deep Alternating Spectrogram Transformer for Multi/Single-Channel Speech Separation
Authors:
Shuo Wang,
Xiangyu Kong,
Xiulian Peng,
Mahmood Movassagh,
Vinod Prakash,
Yan Lu
Abstract:
For the task of speech separation, previous study usually treats multi-channel and single-channel scenarios as two research tracks with specialized solutions developed respectively. Instead, we propose a simple and unified architecture - DasFormer (Deep alternating spectrogram transFormer) to handle both of them in the challenging reverberant environments. Unlike frame-wise sequence modeling, each…
▽ More
For the task of speech separation, previous study usually treats multi-channel and single-channel scenarios as two research tracks with specialized solutions developed respectively. Instead, we propose a simple and unified architecture - DasFormer (Deep alternating spectrogram transFormer) to handle both of them in the challenging reverberant environments. Unlike frame-wise sequence modeling, each TF-bin in the spectrogram is assigned with an embedding encoding spectral and spatial information. With such input, DasFormer is then formed by multiple repetition of simple blocks each of which integrates 1) two multi-head self-attention (MHSA) modules alternately processing within each frequency bin & temporal frame of the spectrogram 2) MBConv before each MHSA for modeling local features on the spectrogram. Experiments show that DasFormer has a powerful ability to model the time-frequency representation, whose performance far exceeds the current SOTA models in multi-channel speech separation, and also achieves single-channel SOTA in the more challenging yet realistic reverberation scenario.
△ Less
Submitted 14 March, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Real-time speech enhancement with dynamic attention span
Authors:
Chengyu Zheng,
Yuan Zhou,
Xiulian Peng,
Yuan Zhang,
Yan Lu
Abstract:
For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the historical information can be used for the system to capture the time-variant characteristics. We propose to adaptively change the receptive field according to the inp…
▽ More
For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the historical information can be used for the system to capture the time-variant characteristics. We propose to adaptively change the receptive field according to the input signal in deep neural network based SE model. Specifically, in an encoder-decoder framework, a dynamic attention span mechanism is introduced to all the attention modules for controlling the size of historical content used for processing the current frame. Experimental results verify that this dynamic mechanism can better track time-variant factors and capture speech-related characteristics, benefiting to both interference removing and speech quality retaining.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Robust and Versatile Bipedal Jum** Control through Reinforcement Learning
Authors:
Zhongyu Li,
Xue Bin Peng,
Pieter Abbeel,
Sergey Levine,
Glen Berseth,
Koushil Sreenath
Abstract:
This work aims to push the limits of agility for bipedal robots by enabling a torque-controlled bipedal robot to perform robust and versatile dynamic jumps in the real world. We present a reinforcement learning framework for training a robot to accomplish a large variety of jum** tasks, such as jum** to different locations and directions. To improve performance on these challenging tasks, we d…
▽ More
This work aims to push the limits of agility for bipedal robots by enabling a torque-controlled bipedal robot to perform robust and versatile dynamic jumps in the real world. We present a reinforcement learning framework for training a robot to accomplish a large variety of jum** tasks, such as jum** to different locations and directions. To improve performance on these challenging tasks, we develop a new policy structure that encodes the robot's long-term input/output (I/O) history while also providing direct access to a short-term I/O history. In order to train a versatile jum** policy, we utilize a multi-stage training scheme that includes different training stages for different objectives. After multi-stage training, the policy can be directly transferred to a real bipedal Cassie robot. Training on different tasks and exploring more diverse scenarios lead to highly robust policies that can exploit the diverse set of learned maneuvers to recover from perturbations or poor landings during real-world deployment. Such robustness in the proposed policy enables Cassie to succeed in completing a variety of challenging jump tasks in the real world, such as standing long jumps, jum** onto elevated platforms, and multi-axes jumps.
△ Less
Submitted 31 May, 2023; v1 submitted 18 February, 2023;
originally announced February 2023.
-
Relationship Quantification of Image Degradations
Authors:
Wenxin Wang,
Boyun Li,
Yuanbiao Gou,
Peng Hu,
Wangmeng Zuo,
Xi Peng
Abstract:
In this paper, we study two challenging but less-touched problems in image restoration, namely, i) how to quantify the relationship between image degradations and ii) how to improve the performance of a specific restoration task using the quantified relationship. To tackle the first challenge, we proposed a Degradation Relationship Index (DRI) which is defined as the mean drop rate difference in t…
▽ More
In this paper, we study two challenging but less-touched problems in image restoration, namely, i) how to quantify the relationship between image degradations and ii) how to improve the performance of a specific restoration task using the quantified relationship. To tackle the first challenge, we proposed a Degradation Relationship Index (DRI) which is defined as the mean drop rate difference in the validation loss between two models which are respectively trained using the anchor degradation and the mixture of the anchor and the auxiliary degradations. Through quantifying the degradation relationship using DRI, we reveal that i) a positive DRI always predicts performance improvement by using the specific degradation as an auxiliary to train models; ii) the degradation proportion is crucial to the image restoration performance. In other words, the restoration performance is improved only if the anchor and the auxiliary degradations are mixed with an appropriate proportion. Based on the observations, we further propose a simple but effective method (dubbed DPD) to estimate whether the given degradation combinations could improve the performance on the anchor degradation with the assistance of the auxiliary degradation. Extensive experimental results verify the effectiveness of our method in dehazing, denoising, deraining, and desnowing. The code will be released after acceptance.
△ Less
Submitted 5 August, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Disentangled Feature Learning for Real-Time Neural Speech Coding
Authors:
Xue Jiang,
Xiulian Peng,
Yuan Zhang,
Yan Lu
Abstract:
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specificall…
▽ More
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.
△ Less
Submitted 24 February, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Creating a Dynamic Quadrupedal Robotic Goalkeeper with Reinforcement Learning
Authors:
Xiaoyu Huang,
Zhongyu Li,
Yanzhen Xiang,
Yiming Ni,
Yufeng Chi,
Yunhao Li,
Lizhi Yang,
Xue Bin Peng,
Koushil Sreenath
Abstract:
We present a reinforcement learning (RL) framework that enables quadrupedal robots to perform soccer goalkee** tasks in the real world. Soccer goalkee** using quadrupeds is a challenging problem, that combines highly dynamic locomotion with precise and fast non-prehensile object (ball) manipulation. The robot needs to react to and intercept a potentially flying ball using dynamic locomotion ma…
▽ More
We present a reinforcement learning (RL) framework that enables quadrupedal robots to perform soccer goalkee** tasks in the real world. Soccer goalkee** using quadrupeds is a challenging problem, that combines highly dynamic locomotion with precise and fast non-prehensile object (ball) manipulation. The robot needs to react to and intercept a potentially flying ball using dynamic locomotion maneuvers in a very short amount of time, usually less than one second. In this paper, we propose to address this problem using a hierarchical model-free RL framework. The first component of the framework contains multiple control policies for distinct locomotion skills, which can be used to cover different regions of the goal. Each control policy enables the robot to track random parametric end-effector trajectories while performing one specific locomotion skill, such as jump, dive, and sidestep. These skills are then utilized by the second part of the framework which is a high-level planner to determine a desired skill and end-effector trajectory in order to intercept a ball flying to different regions of the goal. We deploy the proposed framework on a Mini Cheetah quadrupedal robot and demonstrate the effectiveness of our framework for various agile interceptions of a fast-moving ball in the real world.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
Hierarchical Reinforcement Learning for Precise Soccer Shooting Skills using a Quadrupedal Robot
Authors:
Yandong Ji,
Zhongyu Li,
Yinan Sun,
Xue Bin Peng,
Sergey Levine,
Glen Berseth,
Koushil Sreenath
Abstract:
We address the problem of enabling quadrupedal robots to perform precise shooting skills in the real world using reinforcement learning. Develo** algorithms to enable a legged robot to shoot a soccer ball to a given target is a challenging problem that combines robot motion control and planning into one task. To solve this problem, we need to consider the dynamics limitation and motion stability…
▽ More
We address the problem of enabling quadrupedal robots to perform precise shooting skills in the real world using reinforcement learning. Develo** algorithms to enable a legged robot to shoot a soccer ball to a given target is a challenging problem that combines robot motion control and planning into one task. To solve this problem, we need to consider the dynamics limitation and motion stability during the control of a dynamic legged robot. Moreover, we need to consider motion planning to shoot the hard-to-model deformable ball rolling on the ground with uncertain friction to a desired location. In this paper, we propose a hierarchical framework that leverages deep reinforcement learning to train (a) a robust motion control policy that can track arbitrary motions and (b) a planning policy to decide the desired kicking motion to shoot a soccer ball to a target. We deploy the proposed framework on an A1 quadrupedal robot and enable it to accurately shoot the ball to random targets in the real world.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Latent-Domain Predictive Neural Speech Coding
Authors:
Xue Jiang,
Xiulian Peng,
Huaying Xue,
Yuan Zhang,
Yan Lu
Abstract:
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domai…
▽ More
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft map** and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.
△ Less
Submitted 25 May, 2023; v1 submitted 17 July, 2022;
originally announced July 2022.
-
Cross-Scale Vector Quantization for Scalable Neural Speech Coding
Authors:
Xue Jiang,
Xiulian Peng,
Huaying Xue,
Yuan Zhang,
Yan Lu
Abstract:
Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a…
▽ More
Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Authors:
Xiaoyu Wang,
Xiangyu Kong,
Xiulian Peng,
Yan Lu
Abstract:
In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. Although previous efforts have been extensively put on combining audio and visual modalities, most of them solely adopt a straightforward concatenation of audio and visual features. To exploit the real useful information behind these two modalities, we define two key…
▽ More
In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. Although previous efforts have been extensively put on combining audio and visual modalities, most of them solely adopt a straightforward concatenation of audio and visual features. To exploit the real useful information behind these two modalities, we define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation (between phoneme and lip motion). These two correlations together comprise the complete information, which shows a certain superiority in separating target speaker's voice especially in some hard cases, such as the same gender or similar content. For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations. Both of them work well, while adversarial training shows its advantage by avoiding some limitations of contrastive learning. Compared with previous research, our solution demonstrates clear improvement on experimental metrics without additional complexity. Further analysis reveals the validity of the proposed architecture and its good potential for future extension.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Towards Error-Resilient Neural Speech Coding
Authors:
Huaying Xue,
Xiulian Peng,
Xue Jiang,
Yan Lu
Abstract:
Neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. Neural codecs trained considering only source coding tend to be extremely sensitive to channel noises, especially in wireless channels with high error rate. In this paper, we investigate how to elevate the error resili…
▽ More
Neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. Neural codecs trained considering only source coding tend to be extremely sensitive to channel noises, especially in wireless channels with high error rate. In this paper, we investigate how to elevate the error resilience of neural audio codecs for packet losses that often occur during real-time communications. We propose a feature-domain packet loss concealment algorithm (FD-PLC) for real-time neural speech coding. Specifically, we introduce a self-attention-based module on the received latent features to recover lost frames in the feature domain before the decoder. A hybrid segment-level and frame-level frequency-domain discriminator is employed to guide the network to focus on both the generative quality of lost frames and the continuity with neighbouring frames. Experimental results on several error patterns show that the proposed scheme can achieve better robustness compared with the corresponding error-free and error-resilient baselines. We also show that feature-domain concealment is superior to waveform-domain counterpart as post-processing.
△ Less
Submitted 3 July, 2022;
originally announced July 2022.
-
An Indoor Environment Sensing and Localization System via mmWave Phased Array
Authors:
Yifei Sun,
Jie Li,
Tong Zhang,
Rui Wang,
Xiaohui Peng,
Tony Xiao Han,
Haisheng Tan
Abstract:
An indoor layout sensing and localization system in 60GHz millimeter wave (mmWave) band, named mmReality, is elaborated in this paper. The mmReality system consists of one transmitter and one mobile receiver, each with a phased array and a single radio frequency (RF) chain. To reconstruct the room layout, the pilot signal is delivered from the transmitter to the receiver via different pairs of tra…
▽ More
An indoor layout sensing and localization system in 60GHz millimeter wave (mmWave) band, named mmReality, is elaborated in this paper. The mmReality system consists of one transmitter and one mobile receiver, each with a phased array and a single radio frequency (RF) chain. To reconstruct the room layout, the pilot signal is delivered from the transmitter to the receiver via different pairs of transmission and receiving beams, so that the signals at all antenna elements can be resolved. Then, the spatial smoothing and two-dimensional multiple signal classification (MUSIC) algorithm is applied to detect the angle-of-arrival (AoAs) and angle-of-departure (AoDs) of the rays from the transmitter to the receiver. Moreover, the technique of multi-carrier ranging is adopted to measure the distance of each propagation path. Synthesizing the above geometrical parameters, the location of receiver relative to the transmitter can be pinpointed, both line-of-sight (LoS) and non-line-of-sight (NLoS) paths can also be determined. Therefore, the room layout can be reconstructed by moving the receiver and repeating the above measurement in different locations of the room. At the end, we show that the reconstructed room layout can be utilized to locate a mobile device according to its AoA spectrum, even with single access point.
△ Less
Submitted 9 January, 2023; v1 submitted 7 June, 2022;
originally announced June 2022.
-
A Robust Deep Learning Enabled Semantic Communication System for Text
Authors:
Xiang Peng,
Zhi** Qin,
Danlan Huang,
Xiaoming Tao,
Jianhua Lu,
Guangyi Liu,
Chengkang Pan
Abstract:
With the advent of the 6G era, the concept of semantic communication has attracted increasing attention. Compared with conventional communication systems, semantic communication systems are not only affected by physical noise existing in the wireless communication environment, e.g., additional white Gaussian noise, but also by semantic noise due to the source and the nature of deep learning-based…
▽ More
With the advent of the 6G era, the concept of semantic communication has attracted increasing attention. Compared with conventional communication systems, semantic communication systems are not only affected by physical noise existing in the wireless communication environment, e.g., additional white Gaussian noise, but also by semantic noise due to the source and the nature of deep learning-based systems. In this paper, we elaborate on the mechanism of semantic noise. In particular, we categorize semantic noise into two categories: literal semantic noise and adversarial semantic noise. The former is caused by written errors or expression ambiguity, while the latter is caused by perturbations or attacks added to the embedding layer via the semantic channel. To prevent semantic noise from influencing semantic communication systems, we present a robust deep learning enabled semantic communication system (R-DeepSC) that leverages a calibrated self-attention mechanism and adversarial training to tackle semantic noise. Compared with baseline models that only consider physical noise for text transmission, the proposed R-DeepSC achieves remarkable performance in dealing with semantic noise under different signal-to-noise ratios.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
A Pilot Study of Relating MYCN-Gene Amplification with Neuroblastoma-Patient CT Scans
Authors:
Zihan Zhang,
Xiang Xiang,
Xuehua Peng,
Jianbo Shao
Abstract:
Neuroblastoma is one of the most common cancers in infants, and the initial diagnosis of this disease is difficult. At present, the MYCN gene amplification (MNA) status is detected by invasive pathological examination of tumor samples. This is time-consuming and may have a hidden impact on children. To handle this problem, we adopt multiple machine learning (ML) algorithms to predict the presence…
▽ More
Neuroblastoma is one of the most common cancers in infants, and the initial diagnosis of this disease is difficult. At present, the MYCN gene amplification (MNA) status is detected by invasive pathological examination of tumor samples. This is time-consuming and may have a hidden impact on children. To handle this problem, we adopt multiple machine learning (ML) algorithms to predict the presence or absence of MYCN gene amplification. The dataset is composed of retrospective CT images of 23 neuroblastoma patients. Different from previous work, we develop the algorithm without manually-segmented primary tumors which is time-consuming and not practical. Instead, we only need the coordinate of the center point and the number of tumor slices given by a subspecialty-trained pediatric radiologist. Specifically, CNN-based method uses pre-trained convolutional neural network, and radiomics-based method extracts radiomics features. Our results show that CNN-based method outperforms the radiomics-based method.
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
Multi-Scale Adaptive Network for Single Image Denoising
Authors:
Yuanbiao Gou,
Peng Hu,
Jiancheng Lv,
Joey Tianyi Zhou,
Xi Peng
Abstract:
Multi-scale architectures have shown effectiveness in a variety of tasks thanks to appealing cross-scale complementarity. However, existing architectures treat different scale features equally without considering the scale-specific characteristics, \textit{i.e.}, the within-scale characteristics are ignored in the architecture design. In this paper, we reveal this missing piece for multi-scale arc…
▽ More
Multi-scale architectures have shown effectiveness in a variety of tasks thanks to appealing cross-scale complementarity. However, existing architectures treat different scale features equally without considering the scale-specific characteristics, \textit{i.e.}, the within-scale characteristics are ignored in the architecture design. In this paper, we reveal this missing piece for multi-scale architecture design and accordingly propose a novel Multi-Scale Adaptive Network (MSANet) for single image denoising. Specifically, MSANet simultaneously embraces the within-scale characteristics and the cross-scale complementarity thanks to three novel neural blocks, \textit{i.e.}, adaptive feature block (AFeB), adaptive multi-scale block (AMB), and adaptive fusion block (AFuB). In brief, AFeB is designed to adaptively preserve image details and filter noises, which is highly expected for the features with mixed details and noises. AMB could enlarge the receptive field and aggregate the multi-scale information, which meets the need of contextually informative features. AFuB devotes to adaptively sampling and transferring the features from one scale to another scale, which fuses the multi-scale features with varying characteristics from coarse to fine. Extensive experiments on both three real and six synthetic noisy image datasets show the superiority of MSANet compared with 12 methods. The code could be accessed from https://github.com/XLearning-SCU/2022-NeurIPS-MSANet.
△ Less
Submitted 29 October, 2022; v1 submitted 8 March, 2022;
originally announced March 2022.
-
End-to-End Neural Speech Coding for Real-Time Communications
Authors:
Xue Jiang,
Xiulian Peng,
Chengyu Zheng,
Huaying Xue,
Yuan Zhang,
Yan Lu
Abstract:
Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural speech codec with low latency for RTC. It takes an encoder-temporal filtering-decoder paradigm that has seldom been investigated in audio coding. An interleaved structure is proposed…
▽ More
Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural speech codec with low latency for RTC. It takes an encoder-temporal filtering-decoder paradigm that has seldom been investigated in audio coding. An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies. Furthermore, with end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks. Both subjective and objective results demonstrate the efficiency of the proposed TFNet.
△ Less
Submitted 15 February, 2022; v1 submitted 23 January, 2022;
originally announced January 2022.
-
Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation
Authors:
Wangkai **,
Junyu Liu,
Jianfeng Ren,
Xiangjun Peng
Abstract:
The challenges of polyphonic sound event detection (PSED) stem from the detection of multiple overlap** events in a time series. Recent efforts exploit Deep Neural Networks (DNNs) on Time-Frequency Representations (TFRs) of audio clips as model inputs to mitigate such issues. However, existing solutions often rely on a single type of TFR, which causes under-utilization of input features. To this…
▽ More
The challenges of polyphonic sound event detection (PSED) stem from the detection of multiple overlap** events in a time series. Recent efforts exploit Deep Neural Networks (DNNs) on Time-Frequency Representations (TFRs) of audio clips as model inputs to mitigate such issues. However, existing solutions often rely on a single type of TFR, which causes under-utilization of input features. To this end, we propose a novel PSED framework, which incorporates Multi-Type-Multi-Scale TFRs. Our key insight is that: TFRs, which are of different types or in different scales, can reveal acoustics patterns in a complementary manner, so that the overlapped events can be best extracted by combining different TFRs. Moreover, our framework design applies a novel approach, to adaptively fuse different models and TFRs symbiotically. Hence, the overall performance can be significantly improved. We quantitatively examine the benefits of our framework by using Capsule Neural Networks, a state-of-the-art approach for PSED. The experimental results show that our method achieves a reduction of 7\% in error rate compared with the state-of-the-art solutions on the TUT-SED 2016 dataset.
△ Less
Submitted 24 November, 2021;
originally announced November 2021.
-
Accelerated MRI Reconstruction with Separable and Enhanced Low-Rank Hankel Regularization
Authors:
Xinlin Zhang,
Hengfa Lu,
Di Guo,
Zongying Lai,
Huihui Ye,
Xi Peng,
Bo Zhao,
Xiaobo Qu
Abstract:
The combination of the sparse sampling and the low-rank structured matrix reconstruction has shown promising performance, enabling a significant reduction of the magnetic resonance imaging data acquisition time. However, the low-rank structured approaches demand considerable memory consumption and are time-consuming due to a noticeable number of matrix operations performed on the huge-size block H…
▽ More
The combination of the sparse sampling and the low-rank structured matrix reconstruction has shown promising performance, enabling a significant reduction of the magnetic resonance imaging data acquisition time. However, the low-rank structured approaches demand considerable memory consumption and are time-consuming due to a noticeable number of matrix operations performed on the huge-size block Hankel-like matrix. In this work, we proposed a novel framework to utilize the low-rank property but meanwhile to achieve faster reconstructions and promising results. The framework allows us to enforce the low-rankness of Hankel matrices constructing from 1D vectors instead of 2D matrices from 1D vectors and thus avoid the construction of huge block Hankel matrix for 2D k-space matrices. Moreover, under this framework, we can easily incorporate other information, such as the smooth phase of the image and the low-rankness in the parameter dimension, to further improve the image quality. We built and validated two models for parallel and parameter magnetic resonance imaging experiments, respectively. Our retrospective in-vivo results indicate that the proposed approaches enable faster reconstructions than the state-of-the-art approaches, e.g., about 8x faster than STDLRSPIRiT, and faithful removal of undersampling artifacts.
△ Less
Submitted 24 July, 2021;
originally announced July 2021.
-
Integrated Sensing and Communication from Learning Perspective: An SDP3 Approach
Authors:
Guoliang Li,
Shuai Wang,
Jie Li,
Rui Wang,
Fan Liu,
Xiaohui Peng,
Tony Xiao Han,
Chengzhong Xu
Abstract:
Characterizing the sensing and communication performance tradeoff in integrated sensing and communication (ISAC) systems is challenging in the applications of learning-based human motion recognition. This is because of the large experimental datasets and the black-box nature of deep neural networks. This paper presents SDP3, a Simulation-Driven Performance Predictor and oPtimizer, which consists o…
▽ More
Characterizing the sensing and communication performance tradeoff in integrated sensing and communication (ISAC) systems is challenging in the applications of learning-based human motion recognition. This is because of the large experimental datasets and the black-box nature of deep neural networks. This paper presents SDP3, a Simulation-Driven Performance Predictor and oPtimizer, which consists of SDP3 data simulator, SDP3 performance predictor and SDP3 performance optimizer. Specifically, the SDP3 data simulator generates vivid wireless sensing datasets in a virtual environment, the SDP3 performance predictor predicts the sensing performance based on the function regression method, and the SDP3 performance optimizer investigates the sensing and communication performance tradeoff analytically. It is shown that the simulated sensing dataset matches the experimental dataset very well in the motion recognition accuracy. By leveraging SDP3, it is found that the achievable region of recognition accuracy and communication throughput consists of a communication saturation zone, a sensing saturation zone, and a communication-sensing adversarial zone, of which the desired balanced performance for ISAC systems lies in the third one.
△ Less
Submitted 14 February, 2023; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Spatial and Temporal Networks for Facial Expression Recognition in the Wild Videos
Authors:
Shuyi Mao,
Xinqi Fan,
Xiaojiang Peng
Abstract:
The paper describes our proposed methodology for the seven basic expression classification track of Affective Behavior Analysis in-the-wild (ABAW) Competition 2021. In this task, facial expression recognition (FER) methods aim to classify the correct expression category from a diverse background, but there are several challenges. First, to adapt the model to in-the-wild scenarios, we use the knowl…
▽ More
The paper describes our proposed methodology for the seven basic expression classification track of Affective Behavior Analysis in-the-wild (ABAW) Competition 2021. In this task, facial expression recognition (FER) methods aim to classify the correct expression category from a diverse background, but there are several challenges. First, to adapt the model to in-the-wild scenarios, we use the knowledge from pre-trained large-scale face recognition data. Second, we propose an ensemble model with a convolution neural network (CNN), a CNN-recurrent neural network (CNN-RNN), and a CNN-Transformer (CNN-Transformer), to incorporate both spatial and temporal information. Our ensemble model achieved F1 as 0.4133, accuracy as 0.6216 and final metric as 0.4821 on the validation set.
△ Less
Submitted 11 July, 2021;
originally announced July 2021.
-
Wireless Sensing With Deep Spectrogram Network and Primitive Based Autoregressive Hybrid Channel Model
Authors:
Guoliang Li,
Shuai Wang,
Jie Li,
Rui Wang,
Xiaohui Peng,
Tony Xiao Han
Abstract:
Human motion recognition (HMR) based on wireless sensing is a low-cost technique for scene understanding. Current HMR systems adopt support vector machines (SVMs) and convolutional neural networks (CNNs) to classify radar signals. However, whether a deeper learning model could improve the system performance is currently not known. On the other hand, training a machine learning model requires a lar…
▽ More
Human motion recognition (HMR) based on wireless sensing is a low-cost technique for scene understanding. Current HMR systems adopt support vector machines (SVMs) and convolutional neural networks (CNNs) to classify radar signals. However, whether a deeper learning model could improve the system performance is currently not known. On the other hand, training a machine learning model requires a large dataset, but data gathering from experiment is cost-expensive and time-consuming. Although wireless channel models can be adopted for dataset generation, current channel models are mostly designed for communication rather than sensing. To address the above problems, this paper proposes a deep spectrogram network (DSN) by leveraging the residual map** technique to enhance the HMR performance. Furthermore, a primitive based autoregressive hybrid (PBAH) channel model is developed, which facilitates efficient training and testing dataset generation for HMR in a virtual environment. Experimental results demonstrate that the proposed PBAH channel model matches the actual experimental data very well and the proposed DSN achieves significantly smaller recognition error than that of CNN.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
Phoneme-based Distribution Regularization for Speech Enhancement
Authors:
Ya**g Liu,
Xiulian Peng,
Zhiwei Xiong,
Yan Lu
Abstract:
Existing speech enhancement methods mainly separate speech from noises at the signal level or in the time-frequency domain. They seldom pay attention to the semantic information of a corrupted signal. In this paper, we aim to bridge this gap by extracting phoneme identities to help speech enhancement. Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancemen…
▽ More
Existing speech enhancement methods mainly separate speech from noises at the signal level or in the time-frequency domain. They seldom pay attention to the semantic information of a corrupted signal. In this paper, we aim to bridge this gap by extracting phoneme identities to help speech enhancement. Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner. As different phonemes always lead to different feature distributions in frequency, we propose to learn a parameter pair, i.e. scale and bias, through a phoneme classification vector to modulate the speech enhancement network. The modulation parameter pair includes not only frame-wise but also frequency-wise conditions, which effectively map features to phoneme-related distributions. In this way, we explicitly regularize speech enhancement features by recognition vectors. Experiments on public datasets demonstrate that the proposed PbDr module can not only boost the perceptual quality for speech enhancement but also the recognition accuracy of an ASR system on the enhanced speech. This PbDr module could be readily incorporated into other speech enhancement networks as well.
△ Less
Submitted 8 April, 2021;
originally announced April 2021.
-
Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots
Authors:
Zhongyu Li,
Xuxin Cheng,
Xue Bin Peng,
Pieter Abbeel,
Sergey Levine,
Glen Berseth,
Koushil Sreenath
Abstract:
Develo** robust walking controllers for bipedal robots is a challenging endeavor. Traditional model-based locomotion controllers require simplifying assumptions and careful modelling; any small errors can result in unstable control. To address these challenges for bipedal locomotion, we present a model-free reinforcement learning framework for training robust locomotion policies in simulation, w…
▽ More
Develo** robust walking controllers for bipedal robots is a challenging endeavor. Traditional model-based locomotion controllers require simplifying assumptions and careful modelling; any small errors can result in unstable control. To address these challenges for bipedal locomotion, we present a model-free reinforcement learning framework for training robust locomotion policies in simulation, which can then be transferred to a real bipedal Cassie robot. To facilitate sim-to-real transfer, domain randomization is used to encourage the policies to learn behaviors that are robust across variations in system dynamics. The learned policies enable Cassie to perform a set of diverse and dynamic behaviors, while also being more robust than traditional controllers and prior learning-based methods that use residual control. We demonstrate this on versatile walking behaviors such as tracking a target walking velocity, walking height, and turning yaw.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Interactive Speech and Noise Modeling for Speech Enhancement
Authors:
Chengyu Zheng,
Xiulian Peng,
Yuan Zhang,
Sriram Srinivasan,
Yan Lu
Abstract:
Speech enhancement is challenging because of the diversity of background noise types. Most of the existing methods are focused on modelling the speech rather than the noise. In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net. In SN-Net, the two branches predict speech and noise, respectively. Instead of inform…
▽ More
Speech enhancement is challenging because of the diversity of background noise types. Most of the existing methods are focused on modelling the speech rather than the noise. In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net. In SN-Net, the two branches predict speech and noise, respectively. Instead of information fusion only at the final output layer, interaction modules are introduced at several intermediate feature domains between the two branches to benefit each other. Such an interaction can leverage features learned from one branch to counteract the undesired part and restore the missing component of the other and thus enhance their discrimination capabilities. We also design a feature extraction module, namely residual-convolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises. Evaluations on public datasets show that the interaction module plays a key role in simultaneous modeling and the SN-Net outperforms the state-of-the-art by a large margin on various evaluation metrics. The proposed SN-Net also shows superior performance for speaker separation.
△ Less
Submitted 14 April, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
An Efficient Quantitative Approach for Optimizing Convolutional Neural Networks
Authors:
Yuke Wang,
Boyuan Feng,
Xueqiao Peng,
Yufei Ding
Abstract:
With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit the potential of CNN models, a huge amount of research and industry efforts have been devoted to optimiz…
▽ More
With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit the potential of CNN models, a huge amount of research and industry efforts have been devoted to optimizing CNNs. Among these endeavors, CNN architecture design has attracted tremendous attention because of its great potential of improving model accuracy or reducing model complexity. However, existing work either introduces repeated training overhead in the search process or lacks an interpretable metric to guide the design. To clear these hurdles, we propose 3D-Receptive Field (3DRF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of 3DRF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Extensive experiments and studies show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
△ Less
Submitted 15 September, 2021; v1 submitted 11 September, 2020;
originally announced September 2020.
-
DSU-net: Dense SegU-net for automatic head-and-neck tumor segmentation in MR images
Authors:
Pin Tang,
Chen Zu,
Mei Hong,
Rui Yan,
Xingchen Peng,
Jianghong Xiao,
Xi Wu,
Jiliu Zhou,
Lu** Zhou,
Yan Wang
Abstract:
Precise and accurate segmentation of the most common head-and-neck tumor, nasopharyngeal carcinoma (NPC), in MRI sheds light on treatment and regulatory decisions making. However, the large variations in the lesion size and shape of NPC, boundary ambiguity, as well as the limited available annotated samples conspire NPC segmentation in MRI towards a challenging task. In this paper, we propose a De…
▽ More
Precise and accurate segmentation of the most common head-and-neck tumor, nasopharyngeal carcinoma (NPC), in MRI sheds light on treatment and regulatory decisions making. However, the large variations in the lesion size and shape of NPC, boundary ambiguity, as well as the limited available annotated samples conspire NPC segmentation in MRI towards a challenging task. In this paper, we propose a Dense SegU-net (DSU-net) framework for automatic NPC segmentation in MRI. Our contribution is threefold. First, different from the traditional decoder in U-net using upconvolution for upsamling, we argue that the restoration from low resolution features to high resolution output should be capable of preserving information significant for precise boundary localization. Hence, we use unpooling to unsample and propose SegU-net. Second, to combat the potential vanishing-gradient problem, we introduce dense blocks which can facilitate feature propagation and reuse. Third, using only cross entropy (CE) as loss function may bring about troubles such as miss-prediction, therefore we propose to use a loss function comprised of both CE loss and Dice loss to train the network. Quantitative and qualitative comparisons are carried out extensively on in-house datasets, the experimental results show that our proposed architecture outperforms the existing state-of-the-art segmentation networks.
△ Less
Submitted 19 December, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Deep Learning-based Modulation Detection for NOMA Systems
Authors:
Wenwu Xie,
Jian Xiao,
**xia Yang,
Xin Peng,
Chao Yu,
Peng Zhu
Abstract:
Since the signal with strong power should be demodulated first for successive interference cancellation (SIC) demodulation in non-orthogonal multiple access (NOMA) systems, the base station (BS) should inform the near user terminal (UT), which has allocated higher power, of modulation mode of the far user terminal. To avoid unnecessary signaling overhead in this process, a blind detection algorith…
▽ More
Since the signal with strong power should be demodulated first for successive interference cancellation (SIC) demodulation in non-orthogonal multiple access (NOMA) systems, the base station (BS) should inform the near user terminal (UT), which has allocated higher power, of modulation mode of the far user terminal. To avoid unnecessary signaling overhead in this process, a blind detection algorithm of NOMA signal modulation mode is designed in this paper. Taking the joint constellation density diagrams of NOMA signal as the detection features, deep residual network is built for classification, so as to detect the modulation mode of NOMA signal. In view of the fact that the joint constellation diagrams are easily polluted by high intensity noise and lose their real distribution pattern, the wavelet denoising method is adopted to improve the quality of constellations. The simulation results represent that the proposed algorithm can achieve satisfactory detection accuracy in NOMA systems. In addition, the factors affecting the recognition performance are also verified and analyzed.
△ Less
Submitted 16 October, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
New Security Challenges on Machine Learning Inference Engine: Chip Cloning and Model Reverse Engineering
Authors:
Shanshi Huang,
Xiaochen Peng,
Hongwu Jiang,
Yandong Luo,
Shimeng Yu
Abstract:
Machine learning inference engine is of great interest to smart edge computing. Compute-in-memory (CIM) architecture has shown significant improvements in throughput and energy efficiency for hardware acceleration. Emerging non-volatile memory technologies offer great potential for instant on and off by dynamic power gating. Inference engine is typically pre-trained by the cloud and then being dep…
▽ More
Machine learning inference engine is of great interest to smart edge computing. Compute-in-memory (CIM) architecture has shown significant improvements in throughput and energy efficiency for hardware acceleration. Emerging non-volatile memory technologies offer great potential for instant on and off by dynamic power gating. Inference engine is typically pre-trained by the cloud and then being deployed to the filed. There are new attack models on chip cloning and neural network model reverse engineering. In this paper, we propose countermeasures to the weight cloning and input-output pair attacks. The first strategy is the weight fine-tune to compensate the analog-to-digital converter (ADC) offset for a specific chip instance while inducing significant accuracy drop for cloned chip instances. The second strategy is the weight shuffle and fake rows insertion to allow accurate propagation of the activations of the neural network only with a key.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Deep Anomaly Detection in Packet Payload
Authors:
Jiaxin Liu,
Xucheng Song,
Yingjie Zhou,
Xi Peng,
Yanru Zhang,
Pei Liu,
Dapeng Wu
Abstract:
With the widespread adoption of cloud services, especially the extensive deployment of plenty of Web applications, it is important and challenging to detect anomalies from the packet payload. For example, the anomalies in the packet payload can be expressed as a number of specific strings which may cause attacks. Although some approaches have achieved remarkable progress, they are with limited app…
▽ More
With the widespread adoption of cloud services, especially the extensive deployment of plenty of Web applications, it is important and challenging to detect anomalies from the packet payload. For example, the anomalies in the packet payload can be expressed as a number of specific strings which may cause attacks. Although some approaches have achieved remarkable progress, they are with limited applications since they are dependent on in-depth expert knowledge, e.g., signatures describing anomalies or communication protocol at the application level. Moreover, they might fail to detect the payload anomalies that have long-term dependency relationships. To overcome these limitations and adaptively detect anomalies from the packet payload, we propose a deep learning based framework which consists of two steps. First, a novel feature engineering method is proposed to obtain the block-based features via block sequence extraction and block embedding. The block-based features could encapsulate both the high-dimension information and the underlying sequential information which facilitate the anomaly detection. Second, a neural network is designed to learn the representation of packet payload based on Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). Furthermore, we cast the anomaly detection as a classification problem and stack a Multi-Layer Perception (MLP) on the above representation learning network to detect anomalies. Extensive experimental results on three public datasets indicate that our model could achieve a higher detection rate, while kee** a lower false positive rate compared with five state-of-the-art methods.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
Deformable Non-local Network for Video Super-Resolution
Authors:
Hua Wang,
Dewei Su,
Chuangchuang Liu,
Longcun **,
Xianfang Sun,
Xinyi Peng
Abstract:
The video super-resolution (VSR) task aims to restore a high-resolution (HR) video frame by using its corresponding low-resolution (LR) frame and multiple neighboring frames. At present, many deep learning-based VSR methods rely on optical flow to perform frame alignment. The final recovery results will be greatly affected by the accuracy of optical flow. However, optical flow estimation cannot be…
▽ More
The video super-resolution (VSR) task aims to restore a high-resolution (HR) video frame by using its corresponding low-resolution (LR) frame and multiple neighboring frames. At present, many deep learning-based VSR methods rely on optical flow to perform frame alignment. The final recovery results will be greatly affected by the accuracy of optical flow. However, optical flow estimation cannot be completely accurate, and there are always some errors. In this paper, we propose a novel deformable non-local network (DNLN) which is a non-optical-flow-based method. Specifically, we apply the deformable convolution and improve its ability of adaptive alignment at the feature level. Furthermore, we utilize a non-local structure to capture the global correlation between the reference frame and the aligned neighboring frames, and simultaneously enhance desired fine details in the aligned frames. To reconstruct the final high-quality HR video frames, we use residual in residual dense blocks to take full advantage of the hierarchical features. Experimental results on benchmark datasets demonstrate that the proposed DNLN can achieve state-of-the-art performance on VSR task.
△ Less
Submitted 21 December, 2019; v1 submitted 23 September, 2019;
originally announced September 2019.
-
Mobile Formation Coordination and Tracking Control for Multiple Non-holonomic Vehicles
Authors:
Xiuhui Peng,
Zhiyong Sun,
Kexin Guo,
Zhiyong Geng
Abstract:
This paper addresses forward motion control for trajectory tracking and mobile formation coordination for a group of non-holonomic vehicles on SE(2). Firstly, by constructing an intermediate attitude variable which involves vehicles' position information and desired attitude, the translational and rotational control inputs are designed in two stages to solve the trajectory tracking problem. Second…
▽ More
This paper addresses forward motion control for trajectory tracking and mobile formation coordination for a group of non-holonomic vehicles on SE(2). Firstly, by constructing an intermediate attitude variable which involves vehicles' position information and desired attitude, the translational and rotational control inputs are designed in two stages to solve the trajectory tracking problem. Secondly, the coordination relationships of relative positions and headings are explored thoroughly for a group of non-holonomic vehicles to maintain a mobile formation with rigid body motion constraints. We prove that, except for the cases of parallel formation and translational straight line formation, a mobile formation with strict rigid-body motion can be achieved if and only if the ratios of linear speed to angular speed for each individual vehicle are constants. Motion properties for mobile formation with weak rigid-body motion are also demonstrated. Thereafter, based on the proposed trajectory tracking approach, a distributed mobile formation control law is designed under a directed tree graph. The performance of the proposed controllers is validated by both numerical simulations and experiments.
△ Less
Submitted 28 February, 2019;
originally announced February 2019.
-
A Lightweight Music Texture Transfer System
Authors:
Xutan Peng,
Chen Li,
Zhi Cai,
Faqiang Shi,
Yidan Liu,
Jianxin Li
Abstract:
Deep learning researches on the transformation problems for image and text have raised great attention. However, present methods for music feature transfer using neural networks are far from practical application. In this paper, we initiate a novel system for transferring the texture of music, and release it as an open source project. Its core algorithm is composed of a converter which represents…
▽ More
Deep learning researches on the transformation problems for image and text have raised great attention. However, present methods for music feature transfer using neural networks are far from practical application. In this paper, we initiate a novel system for transferring the texture of music, and release it as an open source project. Its core algorithm is composed of a converter which represents sounds as texture spectra, a corresponding reconstructor and a feed-forward transfer network. We evaluate this system from multiple perspectives, and experimental results reveal that it achieves convincing results in both sound effects and computational performance.
△ Less
Submitted 4 August, 2021; v1 submitted 27 September, 2018;
originally announced October 2018.