-
Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer
Authors:
Haoxu Wang,
Ming Cheng,
Qiang Fu,
Ming Li
Abstract:
In recent years, neural network-based Wake Word Spotting achieves good performance on clean audio samples but struggles in noisy environments. Audio-Visual Wake Word Spotting (AVWWS) receives lots of attention because visual lip movement information is not affected by complex acoustic scenes. Previous works usually use simple addition or concatenation for multi-modal fusion. The inter-modal correl…
▽ More
In recent years, neural network-based Wake Word Spotting achieves good performance on clean audio samples but struggles in noisy environments. Audio-Visual Wake Word Spotting (AVWWS) receives lots of attention because visual lip movement information is not affected by complex acoustic scenes. Previous works usually use simple addition or concatenation for multi-modal fusion. The inter-modal correlation remains relatively under-explored. In this paper, we propose a novel module called Frame-Level Cross-Modal Attention (FLCMA) to improve the performance of AVWWS systems. This module can help model multi-modal information at the frame-level through synchronous lip movements and speech signals. We train the end-to-end FLCMA based Audio-Visual Conformer and further improve the performance by fine-tuning pre-trained uni-modal models for the AVWWS task. The proposed system achieves a new state-of-the-art result (4.57% WWS score) on the far-field MISP dataset.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Limitations of Data-Driven Spectral Reconstruction -- Optics-Aware Analysis and Mitigation
Authors:
Qiang Fu,
Matheus Souza,
Eunsue Choi,
Suhyun Shin,
Seung-Hwan Baek,
Wolfgang Heidrich
Abstract:
Hyperspectral imaging empowers machine vision systems with the distinct capability of identifying materials through recording their spectral signatures. Recent efforts in data-driven spectral reconstruction aim at extracting spectral information from RGB images captured by cost-effective RGB cameras, instead of dedicated hardware.
In this paper we systematically analyze the performance of such m…
▽ More
Hyperspectral imaging empowers machine vision systems with the distinct capability of identifying materials through recording their spectral signatures. Recent efforts in data-driven spectral reconstruction aim at extracting spectral information from RGB images captured by cost-effective RGB cameras, instead of dedicated hardware.
In this paper we systematically analyze the performance of such methods, evaluating both the practical limitations with respect to current datasets and overfitting, as well as fundamental limitations with respect to the nature of the information encoded in the RGB images, and the dependency of this information on the optical system of the camera.
We find that, the current models are not robust under slight variations, e.g., in noise level or compression of the RGB file. Without modeling underrepresented spectral content, existing datasets and the models trained on them are limited in their ability to cope with challenging metameric colors. To mitigate this issue, we propose to exploit the combination of metameric data augmentation and optical lens aberrations to improve the encoding of the metameric information into the RGB image, which paves the road towards higher performing spectral imaging and reconstruction approaches.
△ Less
Submitted 2 April, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving
Authors:
Yizhou Wang,
Jen-Hao Cheng,
Jui-Te Huang,
Sheng-Yao Kuan,
Qiqian Fu,
Chiming Ni,
Shengyu Hao,
Gaoang Wang,
Guanbin Xing,
Hui Liu,
Jenq-Neng Hwang
Abstract:
Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an eff…
▽ More
Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an efficient, cheap, and portable solution for 3D object perception tasks. It can also be robust to different lighting or all-weather driving scenarios due to the capability of mmWave radars. In this paper, we introduce the CRUW3D dataset, including 66K synchronized and well-calibrated camera, radar, and LiDAR frames in various driving scenarios. Unlike other large-scale autonomous driving datasets, our radar data is in the format of radio frequency (RF) tensors that contain not only 3D location information but also spatio-temporal semantic information. This kind of radar format can enable machine learning models to generate more reliable object perception results after interacting and fusing the information or features between the camera and radar.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
RT-SRTS: Angle-Agnostic Real-Time Simultaneous 3D Reconstruction and Tumor Segmentation from Single X-Ray Projection
Authors:
Miao Zhu,
Qiming Fu,
Bo Liu,
Mengxi Zhang,
Bojian Li,
Xiaoyan Luo,
Fugen Zhou
Abstract:
Radiotherapy is one of the primary treatment methods for tumors, but the organ movement caused by respiration limits its accuracy. Recently, 3D imaging from a single X-ray projection has received extensive attention as a promising approach to address this issue. However, current methods can only reconstruct 3D images without directly locating the tumor and are only validated for fixed-angle imagin…
▽ More
Radiotherapy is one of the primary treatment methods for tumors, but the organ movement caused by respiration limits its accuracy. Recently, 3D imaging from a single X-ray projection has received extensive attention as a promising approach to address this issue. However, current methods can only reconstruct 3D images without directly locating the tumor and are only validated for fixed-angle imaging, which fails to fully meet the requirements of motion control in radiotherapy. In this study, a novel imaging method RT-SRTS is proposed which integrates 3D imaging and tumor segmentation into one network based on multi-task learning (MTL) and achieves real-time simultaneous 3D reconstruction and tumor segmentation from a single X-ray projection at any angle. Furthermore, the attention enhanced calibrator (AEC) and uncertain-region elaboration (URE) modules have been proposed to aid feature extraction and improve segmentation accuracy. The proposed method was evaluated on fifteen patient cases and compared with three state-of-the-art methods. It not only delivers superior 3D reconstruction but also demonstrates commendable tumor segmentation results. Simultaneous reconstruction and segmentation can be completed in approximately 70 ms, significantly faster than the required time threshold for real-time tumor tracking. The efficacies of both AEC and URE have also been validated in ablation studies. The code of work is available at https://github.com/ZywooSimple/RT-SRTS.
△ Less
Submitted 28 March, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Aberration-Aware Depth-from-Focus
Authors:
Xinge Yang,
Qiang Fu,
Mohammed Elhoseiny,
Wolfgang Heidrich
Abstract:
Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-…
▽ More
Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-focused frame in a focal stack. We then explore bridging this domain gap through aberration-aware training (AAT). Our approach involves a lightweight network that models lens aberrations at different positions and focus distances, which is then integrated into the conventional network training pipeline. We evaluate the generality of pretrained models on both synthetic and real-world data. Our experimental results demonstrate that the proposed AAT scheme can improve depth estimation accuracy without fine-tuning the model or modifying the network architecture.
△ Less
Submitted 17 July, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis
Authors:
Haoxu Wang,
Ming Cheng,
Qiang Fu,
Ming Li
Abstract:
This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the…
▽ More
This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the fusion strategies, including score-level, cascaded and neural fusion. Our proposed multimodal system leverages multimodal features and uses the complementary visual information to mitigate the performance degradation of audio-only systems in complex acoustic scenarios. Our system obtains a false reject rate of 2.15% and a false alarm rate of 3.44% in the evaluation set of the competition database, which achieves the new state-of-the-art performance by 21% relative improvement compared to previous systems.
△ Less
Submitted 4 March, 2023;
originally announced March 2023.
-
Snake and Snake Robot Locomotion in Complex, 3-D Terrain
Authors:
Qiyuan Fu
Abstract:
Snakes can traverse almost all types of environments by bending their elongate bodies in 3-D to interact with the terrain. Similarly, a snake robot is a promising platform to perform critical tasks in various environments. Understanding how 3-D body bending effectively interacts with the terrain for propulsion and stability can not only inform how snakes traverse natural environments, but also all…
▽ More
Snakes can traverse almost all types of environments by bending their elongate bodies in 3-D to interact with the terrain. Similarly, a snake robot is a promising platform to perform critical tasks in various environments. Understanding how 3-D body bending effectively interacts with the terrain for propulsion and stability can not only inform how snakes traverse natural environments, but also allow snake robots to achieve similar performance.
How snakes and snake robots move on flat surfaces has been understood well. However, such ideal terrain is rare in natural environments and little was understood about how to generate propulsion and maintain stability in 3-D terrain, except for some studies on arboreal snake locomotion and on robots using geometric planning. To bridge the knowledge gap, we integrated animal experiments and robotic studies in three representative environments: a large smooth step, an uneven arena of blocks of large height variation, and large bumps.
We discovered that vertical body bending induces stability challenges but can generate large propulsion. When traversing a large smooth step, a snake robot is challenged by roll instability that increases with the amplitude of vertical bending. The instability can be reduced by body compliance that statistically improves body-terrain contact. Despite this, vertical body bending can potentially allow snakes to push against terrain for propulsion, as demonstrated by corn snakes traversing an uneven arena. A snake robot can generate large propulsion like this if contact is well maintained. Contact feedback control can help accommodate perturbations such as novel terrain geometry or excessive external forces by improving contact. Our findings provide insights into how snakes and snake robots can use vertical body bending for efficient and versatile traversal of the 3-D world stably.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Joint Acoustic Echo Cancellation and Speech Dereverberation Using Kalman filters
Authors:
Ziteng Wang,
Yueyue Na,
Biao Tian,
Qiang Fu
Abstract:
This paper proposes a joint acoustic echo cancellation (AEC) and speech dereverberation (DR) algorithm in the short-time Fourier transform domain. The reverberant microphone signals are described using an auto-regressive (AR) model. The AR coefficients and the loudspeaker-to-microphone acoustic transfer functions (ATFs) are considered time-varying and are modeled simultaneously using a first-order…
▽ More
This paper proposes a joint acoustic echo cancellation (AEC) and speech dereverberation (DR) algorithm in the short-time Fourier transform domain. The reverberant microphone signals are described using an auto-regressive (AR) model. The AR coefficients and the loudspeaker-to-microphone acoustic transfer functions (ATFs) are considered time-varying and are modeled simultaneously using a first-order Markov process. This leads to a solution where these parameters can be optimally estimated using Kalman filters. It is shown that the proposed algorithm outperforms vanilla solutions that solve AEC and DR sequentially and one state-of-the-art joint DRAEC algorithm based on semi-blind source separation, in terms of both speech quality and echo reduction performance.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Curriculum Learning for ab initio Deep Learned Refractive Optics
Authors:
Xinge Yang,
Qiang Fu,
Wolfgang Heidrich
Abstract:
Deep optical optimization has recently emerged as a new paradigm for designing computational imaging systems using only the output image as the objective. However, it has been limited to either simple optical systems consisting of a single element such as a diffractive optical element (DOE) or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a DeepLens des…
▽ More
Deep optical optimization has recently emerged as a new paradigm for designing computational imaging systems using only the output image as the objective. However, it has been limited to either simple optical systems consisting of a single element such as a diffractive optical element (DOE) or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a DeepLens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces without human intervention, therefore overcoming the need for a good initial design. We demonstrate the effectiveness of our approach by fully automatically designing both classical imaging lenses and a large field-of-view extended depth-of-field computational lens in a cellphone-style form factor, with highly aspheric surfaces and a short back focal length.
△ Less
Submitted 17 March, 2024; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Vector Quantized Semantic Communication System
Authors:
Qifan Fu,
Huiqiang Xie,
Zhi** Qin,
Gregory Slabaugh,
Xiaoming Tao
Abstract:
Although analog semantic communication systems have received considerable attention in the literature, there is less work on digital semantic communication systems. In this paper, we develop a deep learning (DL)-enabled vector quantized (VQ) semantic communication system for image transmission, named VQ-DeepSC. Specifically, we propose a convolutional neural network (CNN)-based transceiver to extr…
▽ More
Although analog semantic communication systems have received considerable attention in the literature, there is less work on digital semantic communication systems. In this paper, we develop a deep learning (DL)-enabled vector quantized (VQ) semantic communication system for image transmission, named VQ-DeepSC. Specifically, we propose a convolutional neural network (CNN)-based transceiver to extract multi-scale semantic features of images and introduce multi-scale semantic embedding spaces to perform semantic feature quantization, rendering the data compatible with digital communication systems. Furthermore, we employ adversarial training to improve the quality of received images by introducing a PatchGAN discriminator. Experimental results demonstrate that the proposed VQ-DeepSC is more robustness than BPG in digital communication systems and has comparable MS-SSIM performance to the DeepJSCC method.
△ Less
Submitted 12 April, 2023; v1 submitted 23 September, 2022;
originally announced September 2022.
-
CM-MLP: Cascade Multi-scale MLP with Axial Context Relation Encoder for Edge Segmentation of Medical Image
Authors:
**kai Lv,
Yuyong Hu,
Quanshui Fu,
Zhiwang Zhang,
Yuqiang Hu,
Lin Lv,
Guoqing Yang,
**peng Li,
Yi Zhao
Abstract:
The convolutional-based methods provide good segmentation performance in the medical image segmentation task. However, those methods have the following challenges when dealing with the edges of the medical images: (1) Previous convolutional-based methods do not focus on the boundary relationship between foreground and background around the segmentation edge, which leads to the degradation of segme…
▽ More
The convolutional-based methods provide good segmentation performance in the medical image segmentation task. However, those methods have the following challenges when dealing with the edges of the medical images: (1) Previous convolutional-based methods do not focus on the boundary relationship between foreground and background around the segmentation edge, which leads to the degradation of segmentation performance when the edge changes complexly. (2) The inductive bias of the convolutional layer cannot be adapted to complex edge changes and the aggregation of multiple-segmented areas, resulting in its performance improvement mostly limited to segmenting the body of segmented areas instead of the edge. To address these challenges, we propose the CM-MLP framework on MFI (Multi-scale Feature Interaction) block and ACRE (Axial Context Relation Encoder) block for accurate segmentation of the edge of medical image. In the MFI block, we propose the cascade multi-scale MLP (Cascade MLP) to process all local information from the deeper layers of the network simultaneously and utilize a cascade multi-scale mechanism to fuse discrete local information gradually. Then, the ACRE block is used to make the deep supervision focus on exploring the boundary relationship between foreground and background to modify the edge of the medical image. The segmentation accuracy (Dice) of our proposed CM-MLP framework reaches 96.96%, 96.76%, and 82.54% on three benchmark datasets: CVC-ClinicDB dataset, sub-Kvasir dataset, and our in-house dataset, respectively, which significantly outperform the state-of-the-art method. The source code and trained models will be available at https://github.com/ProgrammerHyy/CM-MLP.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Personalized Acoustic Echo Cancellation for Full-duplex Communications
Authors:
Shimin Zhang,
Ziteng Wang,
Yukai Ju,
Yihui Fu,
Yueyue Na,
Qiang Fu,
Lei Xie
Abstract:
Deep neural networks (DNNs) have shown promising results for acoustic echo cancellation (AEC). But the DNN-based AEC models let through all near-end speakers including the interfering speech. In light of recent studies on personalized speech enhancement, we investigate the feasibility of personalized acoustic echo cancellation (PAEC) in this paper for full-duplex communications, where background n…
▽ More
Deep neural networks (DNNs) have shown promising results for acoustic echo cancellation (AEC). But the DNN-based AEC models let through all near-end speakers including the interfering speech. In light of recent studies on personalized speech enhancement, we investigate the feasibility of personalized acoustic echo cancellation (PAEC) in this paper for full-duplex communications, where background noise and interfering speakers may coexist with acoustic echoes. Specifically, we first propose a novel backbone neural network termed as gated temporal convolutional neural network (GTCNN) that outperforms state-of-the-art AEC models in performance. Speaker embeddings like d-vectors are further adopted as auxiliary information to guide the GTCNN to focus on the target speaker. A special case in PAEC is that speech snippets of both parties on the call are enrolled. Experimental results show that auxiliary information from either the near-end speaker or the far-end speaker can improve the DNN-based AEC performance. Nevertheless, there is still much room for improvement in the utilization of the finite-dimensional speaker embeddings.
△ Less
Submitted 29 June, 2022; v1 submitted 30 May, 2022;
originally announced May 2022.
-
Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness
Authors:
Dianwen Ng,
** Hui Pang,
Yang Xiao,
Biao Tian,
Qiang Fu,
Eng Siong Chng
Abstract:
It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we pres…
▽ More
It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we present a multi-channel ConvMixer for speech command recognitions. The novel architecture introduces an additional audio channel mixing for channel audio interaction in a multi-channel audio setting to achieve better noise-robust features with more efficient computation. Besides, we proposed a centroid based awareness component to enhance the system by equip** it with additional spatial geometry information in the latent feature projection space. We evaluate our model using the new MISP challenge 2021 dataset. Our model achieves significant improvement against the official baseline with a 55% gain in the competition score (0.152) on raw microphone array input and a 63% (0.126) boost upon front-end speech enhancement.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
SA-SASV: An End-to-End Spoof-Aggregated Spoofing-Aware Speaker Verification System
Authors:
Zhongwei Teng,
Quchen Fu,
Jules White,
Maria E. Powell,
Douglas C. Schmidt
Abstract:
Research in the past several years has boosted the performance of automatic speaker verification systems and countermeasure systems to deliver low Equal Error Rates (EERs) on each system. However, research on joint optimization of both systems is still limited. The Spoofing-Aware Speaker Verification (SASV) 2022 challenge was proposed to encourage the development of integrated SASV systems with ne…
▽ More
Research in the past several years has boosted the performance of automatic speaker verification systems and countermeasure systems to deliver low Equal Error Rates (EERs) on each system. However, research on joint optimization of both systems is still limited. The Spoofing-Aware Speaker Verification (SASV) 2022 challenge was proposed to encourage the development of integrated SASV systems with new metrics to evaluate joint model performance. This paper proposes an ensemble-free end-to-end solution, known as Spoof-Aggregated-SASV (SA-SASV) to build a SASV system with multi-task classifiers, which are optimized by multiple losses and has more flexible requirements in training set. The proposed system is trained on the ASVSpoof 2019 LA dataset, a spoof verification dataset with small number of bonafide speakers. Results of SASV-EER indicate that the model performance can be further improved by training in complete automatic speaker verification and countermeasure datasets.
△ Less
Submitted 24 March, 2022; v1 submitted 12 March, 2022;
originally announced March 2022.
-
Multi-Task Deep Residual Echo Suppression with Echo-aware Loss
Authors:
Shimin Zhang,
Ziteng Wang,
Jiayao Sun,
Yihui Fu,
Biao Tian,
Qiang Fu,
Lei Xie
Abstract:
This paper introduces the NWPU Team's entry to the ICASSP 2022 AEC Challenge. We take a hybrid approach that cascades a linear AEC with a neural post-filter. The former is used to deal with the linear echo components while the latter suppresses the residual non-linear echo components. We use gated convolutional F-T-LSTM neural network (GFTNN) as the backbone and shape the post-filter by a multi-ta…
▽ More
This paper introduces the NWPU Team's entry to the ICASSP 2022 AEC Challenge. We take a hybrid approach that cascades a linear AEC with a neural post-filter. The former is used to deal with the linear echo components while the latter suppresses the residual non-linear echo components. We use gated convolutional F-T-LSTM neural network (GFTNN) as the backbone and shape the post-filter by a multi-task learning (MTL) framework, where a voice activity detection (VAD) module is adopted as an auxiliary task along with echo suppression, with the aim to avoid over suppression that may cause speech distortion. Moreover, we adopt an echo-aware loss function, where the mean square error (MSE) loss can be optimized particularly for every time-frequency bin (TF-bin) according to the signal-to-echo ratio (SER), leading to further suppression on the echo. Extensive ablation study shows that the time delay estimation (TDE) module in neural post-filter leads to better perceptual quality, and an adaptive filter with better convergence will bring consistent performance gain for the post-filter. Besides, we find that using the linear echo as the input of our neural post-filter is a better choice than using the reference signal directly. In the ICASSP 2022 AEC-Challenge, our approach has ranked the 1st place on word accuracy (WAcc) (0.817) and the 3rd place on both mean opinion score (MOS) (4.502) and the final score (0.864).
△ Less
Submitted 20 February, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
ConvMixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-field Keyword Spotting
Authors:
Dianwen Ng,
Yunqi Chen,
Biao Tian,
Qiang Fu,
Eng Siong Chng
Abstract:
Building efficient architecture in neural speech processing is paramount to success in keyword spotting deployment. However, it is very challenging for lightweight models to achieve noise robustness with concise neural operations. In a real-world application, the user environment is typically noisy and may also contain reverberations. We proposed a novel feature interactive convolutional model wit…
▽ More
Building efficient architecture in neural speech processing is paramount to success in keyword spotting deployment. However, it is very challenging for lightweight models to achieve noise robustness with concise neural operations. In a real-world application, the user environment is typically noisy and may also contain reverberations. We proposed a novel feature interactive convolutional model with merely 100K parameters to tackle this under the noisy far-field condition. The interactive unit is proposed in place of the attention module that promotes the flow of information with more efficient computations. Moreover, curriculum-based multi-condition training is adopted to attain better noise robustness. Our model achieves 98.2% top-1 accuracy on Google Speech Command V2-12 and is competitive against large transformer models under the designed noise condition.
△ Less
Submitted 15 January, 2022;
originally announced January 2022.
-
Contact feedback helps snake robots propel against uneven terrain using vertical bending
Authors:
Qiyuan Fu,
Chen Li
Abstract:
Snakes can bend their elongate bodies in various forms to traverse various environments. We understand how snakes use lateral bending to push against asperities on flat ground for propulsion, and snake robots can do so effectively. However, snakes can also use vertical bending to push against terrain of large height variation for propulsion, and they can adjust it to adapt to novel terrain presuma…
▽ More
Snakes can bend their elongate bodies in various forms to traverse various environments. We understand how snakes use lateral bending to push against asperities on flat ground for propulsion, and snake robots can do so effectively. However, snakes can also use vertical bending to push against terrain of large height variation for propulsion, and they can adjust it to adapt to novel terrain presumably using mechano-sensing feedback control. Although some snake robots can traverse uneven terrain, few have used vertical bending for propulsion, and how to control it in novel environments is poorly understood. Here we systematically studied a snake robot with force sensors pushing against large bumps using vertical bending to understand the role of sensory feedback control. We compared a feedforward controller and four feedback controllers that use different senses and generate distinct bending patterns and body-terrain interaction. We challenged the robot with increasing backward load and novel terrain geometry that break its contact with the terrain. We further varied how much the feedback control modulated bending to conform to or push against the terrain to test their effects. Feedforward propagation of vertical bending generated large propulsion when the shape matched terrain geometry. However, when perturbations caused loss of contact, the robot easily lost propulsion or had motor overload. Contact feedback control resolved these by improving contact. Yet excessive conformation interrupted propagation and excessive pushing stalled motors. Unlike that using lateral bending, for propulsion generation using vertical bending, body weight can help maintain contact with the environment but may also overload motors. Our results will help snake robots better traverse terrain with large height variation and can inform how snakes use sensory feedback to control vertical bending for propulsion.
△ Less
Submitted 1 August, 2023; v1 submitted 14 December, 2021;
originally announced December 2021.
-
Snapshot HDR Video Construction Using Coded Mask
Authors:
Masheal Alghamdi,
Qiang Fu,
Ali Thabet,
Wolfgang Heidrich
Abstract:
This paper study the reconstruction of High Dynamic Range (HDR) video from snapshot-coded LDR video. Constructing an HDR video requires restoring the HDR values for each frame and maintaining the consistency between successive frames. HDR image acquisition from single image capture, also known as snapshot HDR imaging, can be achieved in several ways. For example, the reconfigurable snapshot HDR ca…
▽ More
This paper study the reconstruction of High Dynamic Range (HDR) video from snapshot-coded LDR video. Constructing an HDR video requires restoring the HDR values for each frame and maintaining the consistency between successive frames. HDR image acquisition from single image capture, also known as snapshot HDR imaging, can be achieved in several ways. For example, the reconfigurable snapshot HDR camera is realized by introducing an optical element into the optical stack of the camera; by placing a coded mask at a small standoff distance in front of the sensor. High-quality HDR image can be recovered from the captured coded image using deep learning methods. This study utilizes 3D-CNNs to perform a joint demosaicking, denoising, and HDR video reconstruction from coded LDR video. We enforce more temporally consistent HDR video reconstruction by introducing a temporal loss function that considers the short-term and long-term consistency. The obtained results are promising and could lead to affordable HDR video capture using conventional cameras.
△ Less
Submitted 5 December, 2021;
originally announced December 2021.
-
Controllable Multichannel Speech Dereverberation based on Deep Neural Networks
Authors:
Ziteng Wang,
Yueyue Na,
Biao Tian,
Qiang Fu
Abstract:
Neural network based speech dereverberation has achieved promising results in recent studies. Nevertheless, many are focused on recovery of only the direct path sound and early reflections, which could be beneficial to speech perception, are discarded. The performance of a model trained to recover clean speech degrades when evaluated on early reverberation targets, and vice versa. This paper propo…
▽ More
Neural network based speech dereverberation has achieved promising results in recent studies. Nevertheless, many are focused on recovery of only the direct path sound and early reflections, which could be beneficial to speech perception, are discarded. The performance of a model trained to recover clean speech degrades when evaluated on early reverberation targets, and vice versa. This paper proposes a novel deep neural network based multichannel speech dereverberation algorithm, in which the dereverberation level is controllable. This is realized by adding a simple floating-point number as target controller of the model. Experiments are conducted using spatially distributed microphones, and the efficacy of the proposed algorithm is confirmed in various simulated conditions.
△ Less
Submitted 15 October, 2021;
originally announced October 2021.
-
NN3A: Neural Network supported Acoustic Echo Cancellation, Noise Suppression and Automatic Gain Control for Real-Time Communications
Authors:
Ziteng Wang,
Yueyue Na,
Biao Tian,
Qiang Fu
Abstract:
Acoustic echo cancellation (AEC), noise suppression (NS) and automatic gain control (AGC) are three often required modules for real-time communications (RTC). This paper proposes a neural network supported algorithm for RTC, namely NN3A, which incorporates an adaptive filter and a multi-task model for residual echo suppression, noise reduction and near-end speech activity detection. The proposed a…
▽ More
Acoustic echo cancellation (AEC), noise suppression (NS) and automatic gain control (AGC) are three often required modules for real-time communications (RTC). This paper proposes a neural network supported algorithm for RTC, namely NN3A, which incorporates an adaptive filter and a multi-task model for residual echo suppression, noise reduction and near-end speech activity detection. The proposed algorithm is shown to outperform both a method using separate models and an end-to-end alternative. It is further shown that there exists a trade-off in the model between residual suppression and near-end speech distortion, which could be balanced by a novel loss weighting function. Several practical aspects of training the joint model are also investigated to push its performance to limit.
△ Less
Submitted 15 October, 2021;
originally announced October 2021.
-
Neural Étendue Expander for Ultra-Wide-Angle High-Fidelity Holographic Display
Authors:
Ethan Tseng,
Grace Kuo,
Seung-Hwan Baek,
Nathan Matsuda,
Andrew Maimone,
Florian Schiffers,
Praneeth Chakravarthula,
Qiang Fu,
Wolfgang Heidrich,
Douglas Lanman,
Felix Heide
Abstract:
Holographic displays can generate light fields by dynamically modulating the wavefront of a coherent beam of light using a spatial light modulator, promising rich virtual and augmented reality applications. However, the limited spatial resolution of existing dynamic spatial light modulators imposes a tight bound on the diffraction angle. As a result, modern holographic displays possess low étendue…
▽ More
Holographic displays can generate light fields by dynamically modulating the wavefront of a coherent beam of light using a spatial light modulator, promising rich virtual and augmented reality applications. However, the limited spatial resolution of existing dynamic spatial light modulators imposes a tight bound on the diffraction angle. As a result, modern holographic displays possess low étendue, which is the product of the display area and the maximum solid angle of diffracted light. The low étendue forces a sacrifice of either the field-of-view (FOV) or the display size. In this work, we lift this limitation by presenting neural étendue expanders. This new breed of optical elements, which is learned from a natural image dataset, enables higher diffraction angles for ultra-wide FOV while maintaining both a compact form factor and the fidelity of displayed contents to human viewers. With neural étendue expanders, we experimentally achieve 64$\times$ étendue expansion of natural images in full color, expanding the FOV by an order of magnitude horizontally and vertically, with high-fidelity reconstruction quality (measured in PSNR) over 29 dB on retinal-resolution images.
△ Less
Submitted 26 April, 2024; v1 submitted 16 September, 2021;
originally announced September 2021.
-
FastAudio: A Learnable Audio Front-End for Spoof Speech Detection
Authors:
Quchen Fu,
Zhongwei Teng,
Jules White,
Maria Powell,
Douglas C. Schmidt
Abstract:
Voice assistants, such as smart speakers, have exploded in popularity. It is currently estimated that the smart speaker adoption rate has exceeded 35% in the US adult population. Manufacturers have integrated speaker identification technology, which attempts to determine the identity of the person speaking, to provide personalized services to different members of the same family. Speaker identific…
▽ More
Voice assistants, such as smart speakers, have exploded in popularity. It is currently estimated that the smart speaker adoption rate has exceeded 35% in the US adult population. Manufacturers have integrated speaker identification technology, which attempts to determine the identity of the person speaking, to provide personalized services to different members of the same family. Speaker identification can also play an important role in controlling how the smart speaker is used. For example, it is not critical to correctly identify the user when playing music. However, when reading the user's email out loud, it is critical to correctly verify the speaker that making the request is the authorized user. Speaker verification systems, which authenticate the speaker identity, are therefore needed as a gatekeeper to protect against various spoofing attacks that aim to impersonate the enrolled user. This paper compares popular learnable front-ends which learn the representations of audio by joint training with downstream tasks (End-to-End). We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We propose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the LA track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 27% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model
Authors:
Zhongwei Teng,
Quchen Fu,
Jules White,
Maria Powell,
Douglas C. Schmidt
Abstract:
An emerging trend in audio processing is capturing low-level speech representations from raw waveforms. These representations have shown promising results on a variety of tasks, such as speech recognition and speech separation. Compared to handcrafted features, learning speech features via backpropagation provides the model greater flexibility in how it represents data for different tasks theoreti…
▽ More
An emerging trend in audio processing is capturing low-level speech representations from raw waveforms. These representations have shown promising results on a variety of tasks, such as speech recognition and speech separation. Compared to handcrafted features, learning speech features via backpropagation provides the model greater flexibility in how it represents data for different tasks theoretically. However, results from empirical study shows that, in some tasks, such as voice spoof detection, handcrafted features are more competitive than learned features. Instead of evaluating handcrafted features and raw waveforms independently, this paper proposes an Auxiliary Rawnet model to complement handcrafted features with features learned from raw waveforms. A key benefit of the approach is that it can improve accuracy at a relatively low computational cost. The proposed Auxiliary Rawnet model is tested using the ASVspoof 2019 dataset and the results from this dataset indicate that a light-weight waveform encoder can potentially boost the performance of handcrafted-features-based encoders in exchange for a small amount of additional computational work.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
A Miniature Biological Eagle-Eye Vision System for Small Target Detection
Authors:
Shutai Wang,
Qiang Fu,
Yinhao Hu,
Chunhua Zhang,
Wei He
Abstract:
Small target detection is known to be a challenging problem. Inspired by the structural characteristics and physiological mechanism of eagle-eye, a miniature vision system is designed for small target detection in this paper. First, a hardware platform is established, which consists of a pan-tilt, a short-focus camera and a long-focus camera. Then, based on the visual attention mechanism of eagle-…
▽ More
Small target detection is known to be a challenging problem. Inspired by the structural characteristics and physiological mechanism of eagle-eye, a miniature vision system is designed for small target detection in this paper. First, a hardware platform is established, which consists of a pan-tilt, a short-focus camera and a long-focus camera. Then, based on the visual attention mechanism of eagle-eye, the cameras with different focal lengths are controlled cooperatively to achieve small target detection. Experimental results show that the designed biological eagle-eye vision system can accurately detect small targets, which has a strong adaptive ability.
△ Less
Submitted 18 July, 2021;
originally announced July 2021.
-
Vision-Based Target Localization for a Flap**-Wing Aerial Vehicle
Authors:
Xinghao Dong,
Qiang Fu,
Chunhua Zhang,
Wei He
Abstract:
The flap**-wing aerial vehicle (FWAV) is a new type of flying robot that mimics the flight mode of birds and insects. However, FWAVs have their special characteristics of less load capacity and short endurance time, so that most existing systems of ground target localization are not suitable for them. In this paper, a vision-based target localization algorithm is proposed for FWAVs based on a ge…
▽ More
The flap**-wing aerial vehicle (FWAV) is a new type of flying robot that mimics the flight mode of birds and insects. However, FWAVs have their special characteristics of less load capacity and short endurance time, so that most existing systems of ground target localization are not suitable for them. In this paper, a vision-based target localization algorithm is proposed for FWAVs based on a generic camera model. Since sensors exist measurement error and the camera exists jitter and motion blur during flight, Gaussian noises are introduced in the simulation experiment, and then a first-order low-pass filter is used to stabilize the localization values. Moreover, in order to verify the feasibility and accuracy of the target localization algorithm, we design a set of simulation experiments where various noises are added. From the simulation results, it is found that the target localization algorithm has a good performance.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
A survey on computational spectral reconstruction methods from RGB to hyperspectral imaging
Authors:
**gang Zhang,
Runmu Su,
Wenqi Ren,
Qiang Fu,
Felix Heide,
Yunfeng Nie
Abstract:
Hyperspectral imaging enables versatile applications due to its competence in capturing abundant spatial and spectral information, which are crucial for identifying substances. However, the devices for acquiring hyperspectral images are expensive and complicated. Therefore, many alternative spectral imaging methods have been proposed by directly reconstructing the hyperspectral information from lo…
▽ More
Hyperspectral imaging enables versatile applications due to its competence in capturing abundant spatial and spectral information, which are crucial for identifying substances. However, the devices for acquiring hyperspectral images are expensive and complicated. Therefore, many alternative spectral imaging methods have been proposed by directly reconstructing the hyperspectral information from lower-cost, more available RGB images. We present a thorough investigation of these state-of-the-art spectral reconstruction methods from the widespread RGB images. A systematic study and comparison of more than 25 methods has revealed that most of the data-driven deep learning methods are superior to prior-based methods in terms of reconstruction accuracy and quality despite lower speeds. This comprehensive review can serve as a fruitful reference source for peer researchers, thus further inspiring future development directions in related domains.
△ Less
Submitted 13 July, 2022; v1 submitted 30 June, 2021;
originally announced June 2021.
-
Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation
Authors:
Yueyue Na,
Ziteng Wang,
Zhang Liu,
Biao Tian,
Qiang Fu
Abstract:
This paper presents a joint source separation algorithm that simultaneously reduces acoustic echo, reverberation and interfering sources. Target speeches are separated from the mixture by maximizing independence with respect to the other sources. It is shown that the separation process can be decomposed into cascading sub-processes that separately relate to acoustic echo cancellation, speech derev…
▽ More
This paper presents a joint source separation algorithm that simultaneously reduces acoustic echo, reverberation and interfering sources. Target speeches are separated from the mixture by maximizing independence with respect to the other sources. It is shown that the separation process can be decomposed into cascading sub-processes that separately relate to acoustic echo cancellation, speech dereverberation and source separation, all of which are solved using the auxiliary function based independent component/vector analysis techniques, and their solving orders are exchangeable. The cascaded solution not only leads to lower computational complexity but also better separation performance than the vanilla joint algorithm.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Mask-ToF: Learning Microlens Masks for Flying Pixel Correction in Time-of-Flight Imaging
Authors:
Ilya Chugunov,
Seung-Hwan Baek,
Qiang Fu,
Wolfgang Heidrich,
Felix Heide
Abstract:
We introduce Mask-ToF, a method to reduce flying pixels (FP) in time-of-flight (ToF) depth captures. FPs are pervasive artifacts which occur around depth edges, where light paths from both an object and its background are integrated over the aperture. This light mixes at a sensor pixel to produce erroneous depth estimates, which can adversely affect downstream 3D vision tasks. Mask-ToF starts at t…
▽ More
We introduce Mask-ToF, a method to reduce flying pixels (FP) in time-of-flight (ToF) depth captures. FPs are pervasive artifacts which occur around depth edges, where light paths from both an object and its background are integrated over the aperture. This light mixes at a sensor pixel to produce erroneous depth estimates, which can adversely affect downstream 3D vision tasks. Mask-ToF starts at the source of these FPs, learning a microlens-level occlusion mask which effectively creates a custom-shaped sub-aperture for each sensor pixel. This modulates the selection of foreground and background light mixtures on a per-pixel basis and thereby encodes scene geometric information directly into the ToF measurements. We develop a differentiable ToF simulator to jointly train a convolutional neural network to decode this information and produce high-fidelity, low-FP depth reconstructions. We test the effectiveness of Mask-ToF on a simulated light field dataset and validate the method with an experimental prototype. To this end, we manufacture the learned amplitude mask and design an optical relay system to virtually place it on a high-resolution ToF sensor. We find that Mask-ToF generalizes well to real data without retraining, cutting FP counts in half.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Continuous body 3-D reconstruction of limbless animals
Authors:
Qiyuan Fu,
Thomas W. Mitchel,
** Seob Kim,
Gregory S. Chirikjian,
Chen Li
Abstract:
Limbless animals such as snakes, limbless lizards, worms, eels, and lampreys move their slender, long bodies in three dimensions to traverse diverse environments. Accurately quantifying their continuous body's 3-D shape and motion is important for understanding body-environment interactions in complex terrain, but this is difficult to achieve (especially for local orientation and rotation). Here,…
▽ More
Limbless animals such as snakes, limbless lizards, worms, eels, and lampreys move their slender, long bodies in three dimensions to traverse diverse environments. Accurately quantifying their continuous body's 3-D shape and motion is important for understanding body-environment interactions in complex terrain, but this is difficult to achieve (especially for local orientation and rotation). Here, we describe an interpolation method to quantify continuous body 3-D position and orientation. We simplify the body as an elastic rod and apply a backbone optimization method to interpolate continuous body shape between end constraints imposed by tracked markers. Despite over-simplifying the biomechanics, our method achieves a higher interpolation accuracy (~50% error) in both 3-D position and orientation compared with the widely-used cubic B-spline interpolation method. Beyond snakes traversing large obstacles as demonstrated, our method applies to other long, slender, limbless animals and continuum robots. We provide codes and demo files for easy application of our method.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Weighted Recursive Least Square Filter and Neural Network based Residual Echo Suppression for the AEC-Challenge
Authors:
Ziteng Wang,
Yueyue Na,
Zhang Liu,
Biao Tian,
Qiang Fu
Abstract:
This paper presents a real-time Acoustic Echo Cancellation (AEC) algorithm submitted to the AEC-Challenge. The algorithm consists of three modules: Generalized Cross-Correlation with PHAse Transform (GCC-PHAT) based time delay compensation, weighted Recursive Least Square (wRLS) based linear adaptive filtering and neural network based residual echo suppression. The wRLS filter is derived from a no…
▽ More
This paper presents a real-time Acoustic Echo Cancellation (AEC) algorithm submitted to the AEC-Challenge. The algorithm consists of three modules: Generalized Cross-Correlation with PHAse Transform (GCC-PHAT) based time delay compensation, weighted Recursive Least Square (wRLS) based linear adaptive filtering and neural network based residual echo suppression. The wRLS filter is derived from a novel semi-blind source separation perspective. The neural network model predicts a Phase-Sensitive Mask (PSM) based on the aligned reference and the linear filter output. The algorithm achieved a mean subjective score of 4.00 and ranked 2nd in the AEC-Challenge.
△ Less
Submitted 18 February, 2021; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Learning an Adaptive Model for Extreme Low-light Raw Image Processing
Authors:
Qingxu Fu,
Xiaoguang Di,
Yu Zhang
Abstract:
Low-light images suffer from severe noise and low illumination. Current deep learning models that are trained with real-world images have excellent noise reduction, but a ratio parameter must be chosen manually to complete the enhancement pipeline. In this work, we propose an adaptive low-light raw image enhancement network to avoid parameter-handcrafting and to improve image quality. The proposed…
▽ More
Low-light images suffer from severe noise and low illumination. Current deep learning models that are trained with real-world images have excellent noise reduction, but a ratio parameter must be chosen manually to complete the enhancement pipeline. In this work, we propose an adaptive low-light raw image enhancement network to avoid parameter-handcrafting and to improve image quality. The proposed method can be divided into two sub-models: Brightness Prediction (BP) and Exposure Shifting (ES). The former is designed to control the brightness of the resulting image by estimating a guideline exposure time $t_1$. The latter learns to approximate an exposure-shifting operator $ES$, converting a low-light image with real exposure time $t_0$ to a noise-free image with guideline exposure time $t_1$. Additionally, structural similarity (SSIM) loss and Image Enhancement Vector (IEV) are introduced to promote image quality, and a new Campus Image Dataset (CID) is proposed to overcome the limitations of the existing datasets and to supervise the training of the proposed model. Using the proposed model, we can achieve high-quality low-light image enhancement from a single raw image. In quantitative tests, it is shown that the proposed method has the lowest Noise Level Estimation (NLE) score compared with the state-of-the-art low-light algorithms, suggesting a superior denoising performance. Furthermore, those tests illustrate that the proposed method is able to adaptively control the global image brightness according to the content of the image scene. Lastly, the potential application in video processing is briefly discussed.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Lateral oscillation and body compliance help snakes and snake robots stably traverse large, smooth obstacles
Authors:
Qiyuan Fu,
Sean W. Gart,
Thomas W. Mitchel,
** Seob Kim,
Gregory S. Chirikjian,
Chen Li
Abstract:
Snakes can move through almost any terrain. Similarly, snake robots hold the promise as a versatile platform to traverse complex environments like earthquake rubble. Unlike snake locomotion on flat surfaces which is inherently stable, when snakes traverse complex terrain by deforming their body out of plane, it becomes challenging to maintain stability. Here, we review our recent progress in under…
▽ More
Snakes can move through almost any terrain. Similarly, snake robots hold the promise as a versatile platform to traverse complex environments like earthquake rubble. Unlike snake locomotion on flat surfaces which is inherently stable, when snakes traverse complex terrain by deforming their body out of plane, it becomes challenging to maintain stability. Here, we review our recent progress in understanding how snakes and snake robots traverse large, smooth obstacles that lack anchor points for grip** or bracing. First, we discovered that the generalist variable kingsnake combines lateral oscillation and cantilevering. Regardless of step height and surface friction, the overall gait is preserved. Next, to quantify static stability of the snake, we developed a method to interpolate continuous body in three dimensions (both position and orientation) between discrete tracked markers. By analyzing the base of support using the interpolated continuous body 3-D kinematics, we discovered that the snake maintained perfect stability during traversal, even on the most challenging low friction, high step. Finally, we applied this gait to a snake robot and systematically tested its performance traversing large steps with variable heights to further understand stability principles. The robot rapidly and stably traversed steps nearly as high as a third of its body length. As step height increased, the robot rolled more frequently to the extent of flip** over, reducing traversal probability. The absence of such failure in the snake with a compliant body inspired us to add body compliance to the robot. With better surface contact, the compliant body robot suffered less roll instability and traversed high steps at higher probability, without sacrificing traversal speed. Our robot traversed large step-like obstacles more rapidly than most previous snake robots, approaching that of the animal.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
Robotic modeling of snake traversing large, smooth obstacles reveals stability benefits of body compliance
Authors:
Qiyuan Fu,
Chen Li
Abstract:
Snakes can move through almost any terrain. Although their locomotion on flat surfaces using planar gaits is inherently stable, when snakes deform their body out of plane to traverse complex terrain, maintaining stability becomes a challenge. On trees and desert dunes, snakes grip branches or brace against depressed sand for stability. However, how they stably surmount obstacles like boulders too…
▽ More
Snakes can move through almost any terrain. Although their locomotion on flat surfaces using planar gaits is inherently stable, when snakes deform their body out of plane to traverse complex terrain, maintaining stability becomes a challenge. On trees and desert dunes, snakes grip branches or brace against depressed sand for stability. However, how they stably surmount obstacles like boulders too large and smooth to gain such anchor points is less understood. Similarly, snake robots are challenged to stably traverse large, smooth obstacles for search and rescue and building inspection. Our recent study discovered that snakes combine body lateral undulation and cantilevering to stably traverse large steps. Here, we developed a snake robot with this gait and snake-like anisotropic friction and used it as a physical model to understand stability principles. The robot traversed steps as high as a third of its body length rapidly and stably. However, on higher steps, it was more likely to fail due to more frequent rolling and flip** over, which was absent in the snake with a compliant body. Adding body compliance reduced the robot roll instability by statistically improving surface contact, without reducing speed. Besides advancing understanding of snake locomotion, our robot achieved high traversal speed surpassing most previous snake robots and approaching snakes, while maintaining high traversal probability.
△ Less
Submitted 3 May, 2021; v1 submitted 22 February, 2020;
originally announced February 2020.