-
SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR
Authors:
Qiuming Zhao,
Guangzhi Sun,
Chao Zhang,
Mingxing Xu,
Thomas Fang Zheng
Abstract:
Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters…
▽ More
Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Prioritized experience replay-based DDQN for Unmanned Vehicle Path Planning
Authors:
Liu Lipeng,
Letian Xu,
Jiabei Liu,
Haopeng Zhao,
Tongzhou Jiang,
Tianyao Zheng
Abstract:
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it wi…
▽ More
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it with the prioritized experience replay method to solve the problem that traditional path planning algorithms often fall into dead zones. A series of simulation experiment results prove that the path planning algorithm based on DDQN is significantly better than other methods in terms of speed and accuracy, especially the ability to break through dead zones in extreme environments. Research shows that the path planning algorithm based on DDQN performs well in terms of path quality and safety. These research results provide an important reference for the research on automatic navigation of autonomous vehicles.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Near-Field Wideband Beam Training Based on Distance-Dependent Beam Split
Authors:
Tianyue Zheng,
Mingyao Cui,
Zidong Wu,
Linglong Dai
Abstract:
Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring…
▽ More
Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring with a single pilot. However, the method still requires exhaustive search in the distance domain, which limits its efficiency. To address the problem, we propose a distance-dependent beam-split-based beam training method to further reduce the training overheads. Specifically, we first reveal the new phenomenon of distance-dependent beam split, where by manipulating the configurations of time-delay and phase-shift, beams at different frequencies can simultaneously scan the angular domain in multiple distance rings. Leveraging the phenomenon, we propose a near-field beam training method where both different angles and distances can simultaneously be searched in one time slot. Thus, a few pilots are capable of covering the whole angle-distance space for wideband XL-MIMO. Theoretical analysis and numerical simulations are also displayed to verify the superiority of the proposed method on beamforming gain and training overhead.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
MuPT: A Generative Symbolic Music Pretrained Transformer
Authors:
Xingwei Qu,
Yuelin Bai,
Yinghao Ma,
Ziya Zhou,
Ka Man Lo,
Jiaheng Liu,
Ruibin Yuan,
Lejun Min,
Xueling Liu,
Tianyu Zhang,
Xinrun Du,
Shuyue Guo,
Yiming Liang,
Yizhi Li,
Shangda Wu,
Junting Zhou,
Tianyu Zheng,
Ziyang Ma,
Fengze Han,
Wei Xue,
Gus Xia,
Emmanouil Benetos,
Xiang Yue,
Chenghua Lin,
Xu Tan
, et al. (4 additional authors not shown)
Abstract:
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal…
▽ More
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
△ Less
Submitted 10 April, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Authors:
Ruibin Yuan,
Hanfeng Lin,
Yi Wang,
Zeyue Tian,
Shangda Wu,
Tianhao Shen,
Ge Zhang,
Yuhang Wu,
Cong Liu,
Ziya Zhou,
Ziyang Ma,
Liumeng Xue,
Ziyu Wang,
Qin Liu,
Tianyu Zheng,
Yizhi Li,
Yinghao Ma,
Yiming Liang,
Xiaowei Chi,
Ruibo Liu,
Zili Wang,
Pengfei Li,
**gcheng Wu,
Chenghua Lin,
Qifeng Liu
, et al. (10 additional authors not shown)
Abstract:
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the…
▽ More
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Contrastive Loss Based Frame-wise Feature disentanglement for Polyphonic Sound Event Detection
Authors:
Yadong Guan,
Jiqing Han,
Hongwei Song,
Wenjie Song,
Guibin Zheng,
Tieran Zheng,
Yongjun He
Abstract:
Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning fram…
▽ More
Overlap** sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlap** events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Coded Beam Training
Authors:
Tianyue Zheng,
Jieao Zhu,
Qiumo Yu,
Yongli Yan,
Linglong Dai
Abstract:
In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users wi…
▽ More
In extremely large-scale multiple input multiple output (XL-MIMO) systems for future sixth-generation (6G) communications, codebook-based beam training stands out as a promising technology to acquire channel state information (CSI). Despite their effectiveness, when the pilot overhead is limited, existing beam training methods suffer from significant achievable rate degradation for remote users with low signal-to-noise ratio (SNR). To tackle this challenge, leveraging the error-correcting capability of channel codes, we introduce channel coding theory into hierarchical beam training to extend the coverage area. Specifically, we establish the duality between hierarchical beam training and channel coding, and the proposed coded beam training scheme serves as a general framework. Then, we present two specific implementations exemplified by coded beam training methods based on Hamming codes and convolutional codes, during which the beam encoding and decoding processes are refined respectively to better accommodate the beam training problem. Simulation results have demonstrated that the proposed coded beam training method can enable reliable beam training performance for remote users with low SNR while kee** training overhead low.
△ Less
Submitted 6 March, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Closed-Loop Motion Planning for Differentially Flat Systems: A Time-Varying Optimization Framework
Authors:
Tianqi Zheng,
John W. Simpson-Porco,
Enrique Mallada
Abstract:
Motion planning and control are two core components of the robotic systems autonomy stack. The standard approach to combine these methodologies comprises an offline/open-loop stage, planning, that designs a feasible and safe trajectory to follow, and an online/closed-loop stage, tracking, that corrects for unmodeled dynamics and disturbances. Such an approach generally introduces conservativeness…
▽ More
Motion planning and control are two core components of the robotic systems autonomy stack. The standard approach to combine these methodologies comprises an offline/open-loop stage, planning, that designs a feasible and safe trajectory to follow, and an online/closed-loop stage, tracking, that corrects for unmodeled dynamics and disturbances. Such an approach generally introduces conservativeness into the planning stage, which becomes difficult to overcome as the model complexity increases and real-time decisions need to be made in a changing environment. This work addresses these challenges for the class of differentially flat nonlinear systems by integrating planning and control into a cohesive closed-loop task. Precisely, we develop an optimization-based framework that aims to steer a differentially flat system to a trajectory implicitly defined via a constrained time-varying optimization problem. To that end, we generalize the notion of feedback linearization, which makes non-linear systems behave as linear systems, and develop controllers that effectively transform a differentially flat system into an optimization algorithm that seeks to find the optimal solution of a (possibly time-varying) optimization problem. Under sufficient regularity assumptions, we prove global asymptotic convergence for the optimization dynamics to the minimizer of the time-varying optimization problem. We illustrate the effectiveness of our method with two numerical examples: a multi-robot tracking problem and a robot obstacle avoidance problem.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
HoloFed: Environment-Adaptive Positioning via Multi-band Reconfigurable Holographic Surfaces and Federated Learning
Authors:
**gzhi Hu,
Zhe Chen,
Tianyue Zheng,
Robert Schober,
Jun Luo
Abstract:
Positioning is an essential service for various applications and is expected to be integrated with existing communication infrastructures in 5G and 6G. Though current Wi-Fi and cellular base stations (BSs) can be used to support this integration, the resulting precision is unsatisfactory due to the lack of precise control of the wireless signals. Recently, BSs adopting reconfigurable holographic s…
▽ More
Positioning is an essential service for various applications and is expected to be integrated with existing communication infrastructures in 5G and 6G. Though current Wi-Fi and cellular base stations (BSs) can be used to support this integration, the resulting precision is unsatisfactory due to the lack of precise control of the wireless signals. Recently, BSs adopting reconfigurable holographic surfaces (RHSs) have been advocated for positioning as RHSs' large number of antenna elements enable generation of arbitrary and highly-focused signal beam patterns. However, existing designs face two major challenges: i) RHSs only have limited operating bandwidth, and ii) the positioning methods cannot adapt to the diverse environments encountered in practice. To overcome these challenges, we present HoloFed, a system providing high-precision environment-adaptive user positioning services by exploiting multi-band(MB)-RHS and federated learning (FL). For improving the positioning performance, a lower bound on the error variance is obtained and utilized for guiding MB-RHS's digital and analog beamforming design. For better adaptability while preserving privacy, an FL framework is proposed for users to collaboratively train a position estimator, where we exploit the transfer learning technique to handle the lack of position labels of the users. Moreover, a scheduling algorithm for the BS to select which users train the position estimator is designed, jointly considering the convergence and efficiency of FL. Our simulation results confirm that HoloFed achieves a 57% lower positioning error variance compared to a beam-scanning baseline and can effectively adapt to diverse environments.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Enhancing Quantised End-to-End ASR Models via Personalisation
Authors:
Qiuming Zhao,
Guangzhi Sun,
Chao Zhang,
Mingxing Xu,
Thomas Fang Zheng
Abstract:
Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker ad…
▽ More
Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker adaptive training (SAT) with model quantisation to improve the performance of heavily compressed models. Specifically, PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size and 1% additional speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer-based attention-based encoder-decoder ASR models respectively, comparing to the original full precision models.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
A Motion Assessment Method for Reference Stack Selection in Fetal Brain MRI Reconstruction Based on Tensor Rank Approximation
Authors:
Haoan Xu,
Wen Shi,
Jiwei Sun,
Tianshu Zheng,
Cong Sun,
Sun Yi,
Guangbin Wang,
Dan Wu
Abstract:
Purpose: Slice-to-volume registration and super-resolution reconstruction (SVR-SRR) is commonly used to generate 3D volumes of the fetal brain from 2D stacks of slices acquired in multiple orientations. A critical initial step in this pipeline is to select one stack with the minimum motion as a reference for registration. An accurate and unbiased motion assessment (MA) is thus crucial for successf…
▽ More
Purpose: Slice-to-volume registration and super-resolution reconstruction (SVR-SRR) is commonly used to generate 3D volumes of the fetal brain from 2D stacks of slices acquired in multiple orientations. A critical initial step in this pipeline is to select one stack with the minimum motion as a reference for registration. An accurate and unbiased motion assessment (MA) is thus crucial for successful selection. Methods: We presented a MA method that determines the minimum motion stack based on 3D low-rank approximation using CANDECOMP/PARAFAC (CP) decomposition. Compared to the current 2D singular value decomposition (SVD) based method that requires flattening stacks into matrices to obtain ranks, in which the spatial information is lost, the CP-based method can factorize 3D stack into low-rank and sparse components in a computationally efficient manner. The difference between the original stack and its low-rank approximation was proposed as the motion indicator. Results: Compared to SVD-based methods, our proposed CP-based MA demonstrated higher sensitivity in detecting small motion with a lower baseline bias. Experiments on randomly simulated motion illustrated that the proposed CP method achieved a higher success rate of 95.45% in identifying the minimum motion stack, compared to SVD-based method with a success rate of 58.18%. We further demonstrated that combining CP-based MA with existing SRR-SVR pipeline significantly improved 3D volume reconstruction. Conclusion: The proposed CP-based MA method showed superior performance compared to SVD-based methods with higher sensitivity to motion, success rate, and lower baseline bias, and can be used as a prior step to improve fetal brain reconstruction.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
STAR-RIS Assisted Covert Communications in NOMA Systems
Authors:
Han Xiao,
Xiaoyan Hu,
Tong-Xing Zheng,
Kai-Kit Wong
Abstract:
Covert communications assisted by simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) in non-orthogonal multiple access (NOMA) systems have been explored in this paper. In particular, the access point (AP) transmitter adopts NOMA to serve a downlink covert user and a public user. The minimum detection error probability (DEP) at the warden is derived considering…
▽ More
Covert communications assisted by simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) in non-orthogonal multiple access (NOMA) systems have been explored in this paper. In particular, the access point (AP) transmitter adopts NOMA to serve a downlink covert user and a public user. The minimum detection error probability (DEP) at the warden is derived considering the uncertainty of its background noise, which is used as a covertness constraint. We aim at maximizing the covert rate of the system by jointly optimizing APs transmit power and passive beamforming of STAR-RIS, under the covertness and quality of service (QoS) constraints. An iterative algorithm is proposed to effectively solve the non-convex optimization problem. Simulation results show that the proposed scheme significantly outperforms the conventional RIS-based scheme in ensuring system covert performance.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
STAR-RIS Aided Covert Communication
Authors:
Han Xiao,
Xiaoyan Hu,
Pengcheng Mu,
Wenjie Wang,
Tong-Xing Zheng,
Kai-Kit Wong,
Kun Yang
Abstract:
This paper investigates the multi-antenna covert communications assisted by a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). In particular, to shelter the existence of communications between transmitter and receiver from a warden, a friendly full-duplex receiver with two antennas is leveraged to make contributions to confuse the warden. Considering the wo…
▽ More
This paper investigates the multi-antenna covert communications assisted by a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). In particular, to shelter the existence of communications between transmitter and receiver from a warden, a friendly full-duplex receiver with two antennas is leveraged to make contributions to confuse the warden. Considering the worst case, the closed-form expression of the minimum detection error probability (DEP) at the warden is derived and utilized as a covert constraint. Then, we formulate an optimization problem maximizing the covert rate of the system under the covertness constraint and quality of service (QoS) constraint with communication outage analysis. To jointly design the active and passive beamforming of the transmitter and STAR-RIS, an iterative algorithm based on globally convergent version of method of moving asymptotes (GCMMA) is proposed to effectively solve the non-convex optimization problem. Simulation results show that the proposed STAR-RIS-assisted scheme highly outperforms the case with conventional RIS.
△ Less
Submitted 30 August, 2023; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Time-weighted Frequency Domain Audio Representation with GMM Estimator for Anomalous Sound Detection
Authors:
Jian Guan,
Youde Liu,
Qiaoxi Zhu,
Tieran Zheng,
Jiqing Han,
Wenwu Wang
Abstract:
Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for…
▽ More
Although deep learning is the mainstream method in unsupervised anomalous sound detection, Gaussian Mixture Model (GMM) with statistical audio frequency representation as input can achieve comparable results with much lower model complexity and fewer parameters. Existing statistical frequency representations, e.g, the log-Mel spectrogram's average or maximum over time, do not always work well for different machines. This paper presents Time-Weighted Frequency Domain Representation (TWFR) with the GMM method (TWFR-GMM) for anomalous sound detection. The TWFR is a generalized statistical frequency domain representation that can adapt to different machine types, using the global weighted ranking pooling over time-domain. This allows GMM estimator to recognize anomalies, even under domain-shift conditions, as visualized with a Mahalanobis distance-based metric. Experiments on DCASE 2022 Challenge Task2 dataset show that our method has better detection performance than recent deep learning methods. TWFR-GMM is the core of our submission that achieved the 3rd place in DCASE 2022 Challenge Task2.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Novel Quality Measure and Efficient Resolution of Convex Hull Pricing for Unit Commitment
Authors:
Mikhail A. Bragin,
Farhan Hyder,
Bing Yan,
Peter B. Luh,
**ye Zhao,
Feng Zhao,
Dane A. Schiro,
Tongxin Zheng
Abstract:
Electricity prices determined by economic dispatch that do not consider fixed costs may lead to significant uplift payments. However, when fixed costs are included, prices become non-monotonic with respect to demand, which can adversely impact market transparency. To overcome this issue, convex hull (CH) pricing has been introduced for unit commitment with fixed costs. Several CH pricing methods h…
▽ More
Electricity prices determined by economic dispatch that do not consider fixed costs may lead to significant uplift payments. However, when fixed costs are included, prices become non-monotonic with respect to demand, which can adversely impact market transparency. To overcome this issue, convex hull (CH) pricing has been introduced for unit commitment with fixed costs. Several CH pricing methods have been presented, and a feasible cost has been used as a quality measure for the CH price. However, obtaining a feasible cost requires a computationally intensive optimization procedure, and the associated duality gap may not provide an accurate quality measure. This paper presents a new approach for quantifying the quality of the CH price by establishing an upper bound on the optimal dual value. The proposed approach uses Surrogate Lagrangian Relaxation (SLR) to efficiently obtain near-optimal CH prices, while the upper bound decreases rapidly due to the convergence of SLR. Testing results on the IEEE 118-bus system demonstrate that the novel quality measure is more accurate than the measure provided by a feasible cost, indicating the high quality of the upper bound and the efficiency of SLR.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Full State Estimation of Continuum Robots From Tip Velocities: A Cosserat-Theoretic Boundary Observer
Authors:
Tongjia Zheng,
Qing Han,
Hai Lin
Abstract:
State estimation of robotic systems is essential to implementing feedback controllers which usually provide better robustness to modeling uncertainties than open-loop controllers. However, state estimation of soft robots is very challenging because soft robots have theoretically infinite degrees of freedom while existing sensors only provide a limited number of discrete measurements. In this paper…
▽ More
State estimation of robotic systems is essential to implementing feedback controllers which usually provide better robustness to modeling uncertainties than open-loop controllers. However, state estimation of soft robots is very challenging because soft robots have theoretically infinite degrees of freedom while existing sensors only provide a limited number of discrete measurements. In this paper, we design an observer for soft continuum robotic arms based on the well-known Cosserat rod theory which models continuum robotic arms by nonlinear partial differential equations (PDEs). The observer is able to estimate all the continuum (infinite-dimensional) robot states (poses, strains, and velocities) by only sensing the tip velocity of the continuum robot (and hence it is called a ``boundary'' observer). More importantly, the estimation error dynamics is formally proven to be locally input-to-state stable. The key idea is to inject sequential tip velocity measurements into the observer in a way that dissipates the energy of the estimation errors through the boundary. Furthermore, this boundary observer can be implemented by simply changing a boundary condition in any numerical solvers of Cosserat rod models. Extensive numerical studies are included and suggest that the domain of attraction is large and the observer is robust to uncertainties of tip velocity measurements and model parameters.
△ Less
Submitted 26 June, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Multi-Robot-Guided Crowd Evacuation: Two-Scale Modeling and Control
Authors:
Tongjia Zheng,
Zhenyuan Yuan,
Mollik Nayyar,
Alan R. Wagner,
Minghui Zhu,
Hai Lin
Abstract:
Emergency evacuation describes a complex situation involving time-critical decision-making by evacuees. Mobile robots are being actively explored as a potential solution to provide timely guidance. In this work, we study a robot-guided crowd evacuation problem where a small group of robots is used to guide a large human crowd to safe locations. The challenge lies in how to use micro-level human-ro…
▽ More
Emergency evacuation describes a complex situation involving time-critical decision-making by evacuees. Mobile robots are being actively explored as a potential solution to provide timely guidance. In this work, we study a robot-guided crowd evacuation problem where a small group of robots is used to guide a large human crowd to safe locations. The challenge lies in how to use micro-level human-robot interactions to indirectly influence a population that significantly outnumbers the robots to achieve the collective evacuation objective. To address the challenge, we follow a two-scale modeling strategy and explore hydrodynamic models, which consist of a family of microscopic social force models that describe how human movements are locally affected by other humans, the environment, and robots, and associated macroscopic equations for the temporal and spatial evolution of the crowd density and flow velocity. We design controllers for the robots such that they not only automatically explore the environment (with unknown dynamic obstacles) to cover it as much as possible, but also dynamically adjust the directions of their local navigation force fields based on the real-time macrostates of the crowd to guide the crowd to a safe location. We prove the stability of the proposed evacuation algorithm and conduct extensive simulations to investigate the performance of the algorithm with different combinations of human numbers, robot numbers, and obstacle settings.
△ Less
Submitted 11 January, 2024; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics
Authors:
Tianqi Zheng,
Pengcheng You,
Enrique Mallada
Abstract:
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward while satisfying minimum requirements in secondary cumulative reward constraints. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. However, such methods are based on stochas…
▽ More
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward while satisfying minimum requirements in secondary cumulative reward constraints. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. However, such methods are based on stochastic gradient descent ascent algorithms whose trajectories are connected to the optimal policy only after a mixing output stage that depends on the algorithm's history. As a result, there is a mismatch between the behavioral policy and the optimal one. In this work, we propose a novel algorithm for constrained RL that does not suffer from these limitations. Leveraging recent results on regularized saddle-flow dynamics, we develop a novel stochastic gradient descent-ascent algorithm whose trajectories converge to the optimal policy almost surely.
△ Less
Submitted 2 December, 2022;
originally announced December 2022.
-
A Linearly Convergent Algorithm for Rotationally Invariant $\ell_1$-Norm Principal Component Analysis
Authors:
Taoli Zheng,
Peng Wang,
Anthony Man-Cho So
Abstract:
To do dimensionality reduction on the datasets with outliers, the $\ell_1$-norm principal component analysis (L1-PCA) as a typical robust alternative of the conventional PCA has enjoyed great popularity over the past years. In this work, we consider a rotationally invariant L1-PCA, which is hardly studied in the literature. To tackle it, we propose a proximal alternating linearized minimization me…
▽ More
To do dimensionality reduction on the datasets with outliers, the $\ell_1$-norm principal component analysis (L1-PCA) as a typical robust alternative of the conventional PCA has enjoyed great popularity over the past years. In this work, we consider a rotationally invariant L1-PCA, which is hardly studied in the literature. To tackle it, we propose a proximal alternating linearized minimization method with a nonlinear extrapolation for solving its two-block reformulation. Moreover, we show that the proposed method converges at least linearly to a limiting critical point of the reformulated problem. Such a point is proved to be a critical point of the original problem under a condition imposed on the step size. Finally, we conduct numerical experiments on both synthetic and real datasets to support our theoretical developments and demonstrate the efficacy of our approach.
△ Less
Submitted 26 October, 2022; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Task Space Tracking of Soft Manipulators: Inner-Outer Loop Control Based on Cosserat-Rod Models
Authors:
Tongjia Zheng,
Qing Han,
Hai Lin
Abstract:
Soft robots are robotic systems made of deformable materials and exhibit unique flexibility that can be exploited for complex environments and tasks. However, their control problem has been considered a challenging subject because they are of infinite degrees of freedom and highly under-actuated. Existing studies have mainly relied on simplified and approximated finite-dimensional models. In this…
▽ More
Soft robots are robotic systems made of deformable materials and exhibit unique flexibility that can be exploited for complex environments and tasks. However, their control problem has been considered a challenging subject because they are of infinite degrees of freedom and highly under-actuated. Existing studies have mainly relied on simplified and approximated finite-dimensional models. In this work, we exploit infinite-dimensional nonlinear control for soft robots. We adopt the Cosserat-rod theory and employ nonlinear partial differential equations (PDEs) to model the kinematics and dynamics of soft manipulators, including their translational motions (for shear and elongation) and rotational motions (for bending and torsion). The objective is to achieve position tracking of the whole manipulator in a planar task space by controlling the moments (generated by actuators). The control design is inspired by the energy decay property of damped wave equations and has an inner-outer loop structure. In the outer loop, we design desired rotational motions that rotate the translational component into a direction that asymptotically dissipates the energy associated with position tracking errors. In the inner loop, we design inputs for the rotational components to track their desired motions, again by dissipating the rotational energy. We prove that the closed-loop system is exponentially stable and evaluate its performance through simulations.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Multi-Robot-Assisted Human Crowd Evacuation using Navigation Velocity Fields
Authors:
Tongjia Zheng,
Zhenyuan Yuan,
Mollik Nayyar,
Alan R. Wagner,
Minghui Zhu,
Hai Lin
Abstract:
This work studies a robot-assisted crowd evacuation problem where we control a small group of robots to guide a large human crowd to safe locations. The challenge lies in how to model human-robot interactions and design robot controls to indirectly control a human population that significantly outnumbers the robots. To address the challenge, we treat the crowd as a continuum and formulate the evac…
▽ More
This work studies a robot-assisted crowd evacuation problem where we control a small group of robots to guide a large human crowd to safe locations. The challenge lies in how to model human-robot interactions and design robot controls to indirectly control a human population that significantly outnumbers the robots. To address the challenge, we treat the crowd as a continuum and formulate the evacuation objective as driving the crowd density to target locations. We propose a novel mean-field model which consists of a family of microscopic equations that explicitly model how human motions are locally guided by the robots and an associated macroscopic equation that describes how the crowd density is controlled by the navigation velocity fields generated by all robots. Then, we design density feedback controllers for the robots to dynamically adjust their states such that the generated navigation velocity fields drive the crowd density to a target density. Stability guarantees of the proposed controllers are proven. Agent-based simulations are included to evaluate the proposed evacuation algorithms.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Safe Human-Robot Collaborative Transportation via Trust-Driven Role Adaptation
Authors:
Tony Zheng,
Monimoy Bujarbaruah,
Yvonne R. Stürz,
Francesco Borrelli
Abstract:
We study a human-robot collaborative transportation task in presence of obstacles. The task for each agent is to carry a rigid object to a common target position, while safely avoiding obstacles and satisfying the compliance and actuation constraints of the other agent. Human and robot do not share the local view of the environment. The human policy either assists the robot when they deem the robo…
▽ More
We study a human-robot collaborative transportation task in presence of obstacles. The task for each agent is to carry a rigid object to a common target position, while safely avoiding obstacles and satisfying the compliance and actuation constraints of the other agent. Human and robot do not share the local view of the environment. The human policy either assists the robot when they deem the robot actions safe based on their perception of the environment, or actively leads the task. Using estimated human inputs, the robot plans a trajectory for the transported object by solving a constrained finite time optimal control problem. Sensors on the robot measure the inputs applied by the human. The robot then appropriately applies a weighted combination of the human's applied and its own planned inputs, where the weights are chosen based on the robot's trust value on its estimates of the human's inputs. This allows for a dynamic leader-follower role adaptation of the robot throughout the task. Furthermore, under a low value of trust, if the robot approaches any obstacle potentially unknown to the human, it triggers a safe stop** policy, maintaining safety of the system and signaling a required change in the human's intent. With experimental results, we demonstrate the efficacy of the proposed approach.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners
Authors:
Hao Quan,
Xingyu Li,
Weixing Chen,
Qun Bai,
Mingchen Zou,
Ruijie Yang,
Tingting Zheng,
Ruiqun Qi,
Xinghua Gao,
Xiaoyu Cui
Abstract:
Based on digital pathology slice scanning technology, artificial intelligence algorithms represented by deep learning have achieved remarkable results in the field of computational pathology. Compared to other medical images, pathology images are more difficult to annotate, and thus, there is an extreme lack of available datasets for conducting supervised learning to train robust deep learning mod…
▽ More
Based on digital pathology slice scanning technology, artificial intelligence algorithms represented by deep learning have achieved remarkable results in the field of computational pathology. Compared to other medical images, pathology images are more difficult to annotate, and thus, there is an extreme lack of available datasets for conducting supervised learning to train robust deep learning models. In this paper, we propose a self-supervised learning (SSL) model, the global contrast-masked autoencoder (GCMAE), which can train the encoder to have the ability to represent local-global features of pathological images, also significantly improve the performance of transfer learning across data sets. In this study, the ability of the GCMAE to learn migratable representations was demonstrated through extensive experiments using a total of three different disease-specific hematoxylin and eosin (HE)-stained pathology datasets: Camelyon16, NCTCRC and BreakHis. In addition, this study designed an effective automated pathology diagnosis process based on the GCMAE for clinical applications. The source code of this paper is publicly available at https://github.com/StarUniversus/gcmae.
△ Less
Submitted 15 November, 2023; v1 submitted 18 May, 2022;
originally announced May 2022.
-
A microstructure estimation Transformer inspired by sparse representation for diffusion MRI
Authors:
Tianshu Zheng,
Cong Sun,
Weihao Zheng,
Wen Shi,
Haotian Li,
Yi Sun,
Yi Zhang,
Guangbin Wang,
Chuyang Ye,
Dan Wu
Abstract:
Diffusion magnetic resonance imaging (dMRI) is an important tool in characterizing tissue microstructure based on biophysical models, which are complex and highly non-linear. Resolving microstructures with optimization techniques is prone to estimation errors and requires dense sampling in the q-space. Deep learning based approaches have been proposed to overcome these limitations. Motivated by th…
▽ More
Diffusion magnetic resonance imaging (dMRI) is an important tool in characterizing tissue microstructure based on biophysical models, which are complex and highly non-linear. Resolving microstructures with optimization techniques is prone to estimation errors and requires dense sampling in the q-space. Deep learning based approaches have been proposed to overcome these limitations. Motivated by the superior performance of the Transformer, in this work, we present a learning-based framework based on Transformer, namely, a Microstructure Estimation Transformer with Sparse Coding (METSC) for dMRI-based microstructure estimation with downsampled q-space data. To take advantage of the Transformer while addressing its limitation in large training data requirements, we explicitly introduce an inductive bias - model bias into the Transformer using a sparse coding technique to facilitate the training process. Thus, the METSC is composed with three stages, an embedding stage, a sparse representation stage, and a map** stage. The embedding stage is a Transformer-based structure that encodes the signal to ensure the voxel is represented effectively. In the sparse representation stage, a dictionary is constructed by solving a sparse reconstruction problem that unfolds the Iterative Hard Thresholding (IHT) process. The map** stage is essentially a decoder that computes the microstructural parameters from the output of the second stage, based on the weighted sum of normalized dictionary coefficients where the weights are also learned. We tested our framework on two dMRI models with downsampled q-space data, including the intravoxel incoherent motion (IVIM) model and the neurite orientation dispersion and density imaging (NODDI) model. The proposed method achieved up to 11.25 folds of acceleration in scan time and outperformed the other state-of-the-art learning-based methods.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
AFFIRM: Affinity Fusion-based Framework for Iteratively Random Motion correction of multi-slice fetal brain MRI
Authors:
Wen Shi,
Haoan Xu,
Cong Sun,
Jiwei Sun,
Yamin Li,
Xinyi Xu,
Tianshu Zheng,
Yi Zhang,
Guangbin Wang,
Dan Wu
Abstract:
Multi-slice magnetic resonance images of the fetal brain are usually contaminated by severe and arbitrary fetal and maternal motion. Hence, stable and robust motion correction is necessary to reconstruct high-resolution 3D fetal brain volume for clinical diagnosis and quantitative analysis. However, the conventional registration-based correction has a limited capture range and is insufficient for…
▽ More
Multi-slice magnetic resonance images of the fetal brain are usually contaminated by severe and arbitrary fetal and maternal motion. Hence, stable and robust motion correction is necessary to reconstruct high-resolution 3D fetal brain volume for clinical diagnosis and quantitative analysis. However, the conventional registration-based correction has a limited capture range and is insufficient for detecting relatively large motions. Here, we present a novel Affinity Fusion-based Framework for Iteratively Random Motion (AFFIRM) correction of the multi-slice fetal brain MRI. It learns the sequential motion from multiple stacks of slices and integrates the features between 2D slices and reconstructed 3D volume using affinity fusion, which resembles the iterations between slice-to-volume registration and volumetric reconstruction in the regular pipeline. The method accurately estimates the motion regardless of brain orientations and outperforms other state-of-the-art learning-based methods on the simulated motion-corrupted data, with a 48.4% reduction of mean absolute error for rotation and 61.3% for displacement. We then incorporated AFFIRM into the multi-resolution slice-to-volume registration and tested it on the real-world fetal MRI scans at different gestation stages. The results indicated that adding AFFIRM to the conventional pipeline improved the success rate of fetal brain super-resolution reconstruction from 77.2% to 91.9%.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
A Deep Reinforcement Learning Framework for Rapid Diagnosis of Whole Slide Pathological Images
Authors:
Tingting Zheng,
Weixing chen,
Shuqin Li,
Hao Quan,
Qun Bai,
Tianhang Nan,
Song Zheng,
Xinghua Gao,
Yue Zhao,
Xiaoyu Cui
Abstract:
The deep neural network is a research hotspot for histopathological image analysis, which can improve the efficiency and accuracy of diagnosis for pathologists or be used for disease screening. The whole slide pathological image can reach one gigapixel and contains abundant tissue feature information, which needs to be divided into a lot of patches in the training and inference stages. This will l…
▽ More
The deep neural network is a research hotspot for histopathological image analysis, which can improve the efficiency and accuracy of diagnosis for pathologists or be used for disease screening. The whole slide pathological image can reach one gigapixel and contains abundant tissue feature information, which needs to be divided into a lot of patches in the training and inference stages. This will lead to a long convergence time and large memory consumption. Furthermore, well-annotated data sets are also in short supply in the field of digital pathology. Inspired by the pathologist's clinical diagnosis process, we propose a weakly supervised deep reinforcement learning framework, which can greatly reduce the time required for network inference. We use neural network to construct the search model and decision model of reinforcement learning agent respectively. The search model predicts the next action through the image features of different magnifications in the current field of view, and the decision model is used to return the predicted probability of the current field of view image. In addition, an expert-guided model is constructed by multi-instance learning, which not only provides rewards for search model, but also guides decision model learning by the knowledge distillation method. Experimental results show that our proposed method can achieve fast inference and accurate prediction of whole slide images without any pixel-level annotations.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Physical layer security in large-scale random multiple access wireless sensor networks: a stochastic geometry approach
Authors:
Tong-Xing Zheng,
Xin Chen,
Chao Wang,
Kai-Kit Wong,
**hong Yuan
Abstract:
This paper investigates physical layer security for a large-scale WSN with random multiple access, where each fusion center in the network randomly schedules a number of sensors to upload their sensed data subject to the overhearing of randomly distributed eavesdroppers. We propose an uncoordinated random jamming scheme in which those unscheduled sensors send jamming signals with a certain probabi…
▽ More
This paper investigates physical layer security for a large-scale WSN with random multiple access, where each fusion center in the network randomly schedules a number of sensors to upload their sensed data subject to the overhearing of randomly distributed eavesdroppers. We propose an uncoordinated random jamming scheme in which those unscheduled sensors send jamming signals with a certain probability to defeat the eavesdroppers. With the aid of stochastic geometry theory and order statistics, we derive analytical expressions for the connection outage probability and secrecy outage probability to characterize transmission reliability and secrecy, respectively. Based on the obtained analytical results, we formulate an optimization problem for maximizing the sum secrecy throughput subject to both reliability and secrecy constraints, considering a joint design of the wiretap code rates for each scheduled sensor and the jamming probability for the unscheduled sensors. We provide both optimal and low-complexity sub-optimal algorithms to tackle the above problem, and further reveal various properties on the optimal parameters which are useful to guide practical designs. In particular, we demonstrate that the proposed random jamming scheme is beneficial for improving the sum secrecy throughput, and the optimal jamming probability is the result of trade-off between secrecy and throughput. We also show that the throughput performance of the sub-optimal scheme approaches that of the optimal one when facing a stringent reliability constraint or a loose secrecy constraint.
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
PDE-based Dynamic Control and Estimation of Soft Robotic Arms
Authors:
Tongjia Zheng,
Hai Lin
Abstract:
Compared with traditional rigid-body robots, soft robots not only exhibit unprecedented adaptation and flexibility but also present novel challenges in their modeling and control because of their infinite degrees of freedom. Most of the existing approaches have mainly relied on approximated models so that the well-developed finite-dimensional control theory can be exploited. However, this may brin…
▽ More
Compared with traditional rigid-body robots, soft robots not only exhibit unprecedented adaptation and flexibility but also present novel challenges in their modeling and control because of their infinite degrees of freedom. Most of the existing approaches have mainly relied on approximated models so that the well-developed finite-dimensional control theory can be exploited. However, this may bring in modeling uncertainty and performance degradation. Hence, we propose to exploit infinite-dimensional analysis for soft robotic systems. Our control design is based on the increasingly adopted Cosserat rod model, which describes the kinematics and dynamics of soft robotic arms using nonlinear partial differential equations (PDE). We design infinite-dimensional state feedback control laws for the Cosserat PDE model to achieve trajectory tracking (consisting of position, rotation, linear and angular velocities) and prove their uniform tracking convergence. We also design an infinite-dimensional extended Kalman filter on Lie groups for the PDE system to estimate all the state variables (including position, rotation, strains, curvature, linear and angular velocities) using only position measurements. The proposed algorithms are evaluated using simulations.
△ Less
Submitted 20 September, 2022; v1 submitted 25 March, 2022;
originally announced March 2022.
-
COVID-19 Infection Segmentation from Chest CT Images Based on Scale Uncertainty
Authors:
Masahiro Oda,
Tong Zheng,
Yuichiro Hayashi,
Yoshito Otake,
Masahiro Hashimoto,
Toshiaki Akashi,
Shigeki Aoki,
Kensaku Mori
Abstract:
This paper proposes a segmentation method of infection regions in the lung from CT volumes of COVID-19 patients. COVID-19 spread worldwide, causing many infected patients and deaths. CT image-based diagnosis of COVID-19 can provide quick and accurate diagnosis results. An automated segmentation method of infection regions in the lung provides a quantitative criterion for diagnosis. Previous method…
▽ More
This paper proposes a segmentation method of infection regions in the lung from CT volumes of COVID-19 patients. COVID-19 spread worldwide, causing many infected patients and deaths. CT image-based diagnosis of COVID-19 can provide quick and accurate diagnosis results. An automated segmentation method of infection regions in the lung provides a quantitative criterion for diagnosis. Previous methods employ whole 2D image or 3D volume-based processes. Infection regions have a considerable variation in their sizes. Such processes easily miss small infection regions. Patch-based process is effective for segmenting small targets. However, selecting the appropriate patch size is difficult in infection region segmentation. We utilize the scale uncertainty among various receptive field sizes of a segmentation FCN to obtain infection regions. The receptive field sizes can be defined as the patch size and the resolution of volumes where patches are clipped from. This paper proposes an infection segmentation network (ISNet) that performs patch-based segmentation and a scale uncertainty-aware prediction aggregation method that refines the segmentation result. We design ISNet to segment infection regions that have various intensity values. ISNet has multiple encoding paths to process patch volumes normalized by multiple intensity ranges. We collect prediction results generated by ISNets having various receptive field sizes. Scale uncertainty among the prediction results is extracted by the prediction aggregation method. We use an aggregation FCN to generate a refined segmentation result considering scale uncertainty among the predictions. In our experiments using 199 chest CT volumes of COVID-19 cases, the prediction aggregation method improved the dice similarity score from 47.6% to 62.1%.
△ Less
Submitted 9 January, 2022;
originally announced January 2022.
-
Subject-Independent Drowsiness Recognition from Single-Channel EEG with an Interpretable CNN-LSTM model
Authors:
Jian Cui,
Zirui Lan,
Tianhu Zheng,
Yisi Liu,
Olga Sourina,
Lipo Wang,
Wolfgang Müller-Wittig
Abstract:
For EEG-based drowsiness recognition, it is desirable to use subject-independent recognition since conducting calibration on each subject is time-consuming. In this paper, we propose a novel Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) model for subject-independent drowsiness recognition from single-channel EEG signals. Different from existing deep learning models that are most…
▽ More
For EEG-based drowsiness recognition, it is desirable to use subject-independent recognition since conducting calibration on each subject is time-consuming. In this paper, we propose a novel Convolutional Neural Network (CNN)-Long Short-Term Memory (LSTM) model for subject-independent drowsiness recognition from single-channel EEG signals. Different from existing deep learning models that are mostly treated as black-box classifiers, the proposed model can explain its decisions for each input sample by revealing which parts of the sample contain important features identified by the model for classification. This is achieved by a visualization technique by taking advantage of the hidden states output by the LSTM layer. Results show that the model achieves an average accuracy of 72.97% on 11 subjects for leave-one-out subject-independent drowsiness recognition on a public dataset, which is higher than the conventional baseline methods of 55.42%-69.27%, and state-of-the-art deep learning methods. Visualization results show that the model has discovered meaningful patterns of EEG signals related to different mental states across different subjects.
△ Less
Submitted 21 November, 2021;
originally announced December 2021.
-
How Speech is Recognized to Be Emotional - A Study Based on Information Decomposition
Authors:
Haoran Sun,
Lantian Li,
Thomas Fang Zheng,
Dong Wang
Abstract:
The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we…
▽ More
The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we are able to decompose speech signals into separate information factors (content, pitch, rhythm). Based on this decomposition, we carefully studied the performance of each information component and their combinations. We conducted the study on three different speech emotion corpora and chose an attention-based convolutional RNN as the emotion classifier. Our results show that rhythm is the most important component for emotional expression. Moreover, the cross-corpus results are very bad (even worse than guess), demonstrating that the present speech emotion recognition model is rather weak. Interestingly, by removing one or several unimportant components, the cross-corpus results can be improved. This demonstrates the potential of the decomposition approach towards a generalizable emotion recognition.
△ Less
Submitted 24 November, 2021;
originally announced November 2021.
-
MoRe-Fi: Motion-robust and Fine-grained Respiration Monitoring via Deep-Learning UWB Radar
Authors:
Tianyue Zheng,
Zhe Chen,
Shujie Zhang,
Chao Cai,
Jun Luo
Abstract:
Crucial for healthcare and biomedical applications, respiration monitoring often employs wearable sensors in practice, causing inconvenience due to their direct contact with human bodies. Therefore, researchers have been constantly searching for contact-free alternatives. Nonetheless, existing contact-free designs mostly require human subjects to remain static, largely confining their adoptions in…
▽ More
Crucial for healthcare and biomedical applications, respiration monitoring often employs wearable sensors in practice, causing inconvenience due to their direct contact with human bodies. Therefore, researchers have been constantly searching for contact-free alternatives. Nonetheless, existing contact-free designs mostly require human subjects to remain static, largely confining their adoptions in everyday environments where body movements are inevitable. Fortunately, radio-frequency (RF) enabled contact-free sensing, though suffering motion interference inseparable by conventional filtering, may offer a potential to distill respiratory waveform with the help of deep learning. To realize this potential, we introduce MoRe-Fi to conduct fine-grained respiration monitoring under body movements. MoRe-Fi leverages an IR-UWB radar to achieve contact-free sensing, and it fully exploits the complex radar signal for data augmentation. The core of MoRe-Fi is a novel variational encoder-decoder network; it aims to single out the respiratory waveforms that are modulated by body movements in a non-linear manner. Our experiments with 12 subjects and 66-hour data demonstrate that MoRe-Fi accurately recovers respiratory waveform despite the interference caused by body movements. We also discuss potential applications of MoRe-Fi for pulmonary disease diagnoses.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
RF-Net: a Unified Meta-learning Framework for RF-enabled One-shot Human Activity Recognition
Authors:
Shuya Ding,
Zhe Chen,
Tianyue Zheng,
Jun Luo
Abstract:
Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications. However, device-free (or contactless) sensing is often more sensitive to environment changes than device-based (or wearable) sensing. Also, RF datasets strictly require on-line labeling during collection, starkly different from image and text data collections where human int…
▽ More
Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications. However, device-free (or contactless) sensing is often more sensitive to environment changes than device-based (or wearable) sensing. Also, RF datasets strictly require on-line labeling during collection, starkly different from image and text data collections where human interpretations can be leveraged to perform off-line labeling. Therefore, existing solutions to RF-HAR entail a laborious data collection process for adapting to new environments. To this end, we propose RF-Net as a meta-learning based approach to one-shot RF-HAR; it reduces the labeling efforts for environment adaptation to the minimum level. In particular, we first examine three representative RF sensing techniques and two major meta-learning approaches. The results motivate us to innovate in two designs: i) a dual-path base HAR network, where both time and frequency domains are dedicated to learning powerful RF features including spatial and attention-based temporal ones, and ii) a metric-based meta-learning framework to enhance the fast adaption capability of the base network, including an RF-specific metric module along with a residual classification module. We conduct extensive experiments based on all three RF sensing techniques in multiple real-world indoor environments; all results strongly demonstrate the efficacy of RF-Net compared with state-of-the-art baselines.
△ Less
Submitted 28 October, 2021;
originally announced November 2021.
-
Enhancing RF Sensing with Deep Learning: A Layered Approach
Authors:
Tianyue Zheng,
Zhe Chen,
Shuya Ding,
Jun Luo
Abstract:
In recent years, radio frequency (RF) sensing has gained increasing popularity due to its pervasiveness, low cost, non-intrusiveness, and privacy preservation. However, realizing the promises of RF sensing is highly nontrivial, given typical challenges such as multipath and interference. One potential solution leverages deep learning to build direct map**s from the RF domain to target domains, h…
▽ More
In recent years, radio frequency (RF) sensing has gained increasing popularity due to its pervasiveness, low cost, non-intrusiveness, and privacy preservation. However, realizing the promises of RF sensing is highly nontrivial, given typical challenges such as multipath and interference. One potential solution leverages deep learning to build direct map**s from the RF domain to target domains, hence avoiding complex RF physical modeling. While earlier solutions exploit only simple feature extraction and classification modules, an emerging trend adds functional layers on top of elementary modules for more powerful generalizability and flexible applicability. To better understand this potential, this article takes a layered approach to summarize RF sensing enabled by deep learning. Essentially, we present a four-layer framework: physical, backbone, generalization, and application. While this layered framework provides readers a systematic methodology for designing deep interpreted RF sensing, it also facilitates making improvement proposals and hints at future research opportunities.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
V2iFi: in-Vehicle Vital Sign Monitoring via Compact RF Sensing
Authors:
Tianyue Zheng,
Zhe Chen,
Chao Cai,
Jun Luo,
Xu Zhang
Abstract:
Given the significant amount of time people spend in vehicles, health issues under driving condition have become a major concern. Such issues may vary from fatigue, asthma, stroke, to even heart attack, yet they can be adequately indicated by vital signs and abnormal activities. Therefore, in-vehicle vital sign monitoring can help us predict and hence prevent these issues. Whereas existing sensor-…
▽ More
Given the significant amount of time people spend in vehicles, health issues under driving condition have become a major concern. Such issues may vary from fatigue, asthma, stroke, to even heart attack, yet they can be adequately indicated by vital signs and abnormal activities. Therefore, in-vehicle vital sign monitoring can help us predict and hence prevent these issues. Whereas existing sensor-based (including camera) methods could be used to detect these indicators, privacy concern and system complexity both call for a convenient yet effective and robust alternative. This paper aims to develop V2iFi, an intelligent system performing monitoring tasks using a COTS impulse radio mounted on the windshield. V2iFi is capable of reliably detecting driver's vital signs under driving condition and with the presence of passengers, thus allowing for potentially inferring corresponding health issues. Compared with prior work based on Wi-Fi CSI, V2iFi is able to distinguish reflected signals from multiple users, and hence provide finer-grained measurements under more realistic settings. We evaluate V2iFi both in lab environments and during real-life road tests; the results demonstrate that respiratory rate, heart rate, and heart rate variability can all be estimated accurately. Based on these estimation results, we further discuss how machine learning models can be applied on top of V2iFi so as to improve both physiological and psychological wellbeing in driving environments.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
RF-Based Human Activity Recognition Using Signal Adapted Convolutional Neural Network
Authors:
Zhe Chen,
Chao Cai,
Tianyue Zheng,
Jun Luo,
Jie Xiong,
Xin Wang
Abstract:
Human Activity Recognition (HAR) plays a critical role in a wide range of real-world applications, and it is traditionally achieved via wearable sensing. Recently, to avoid the burden and discomfort caused by wearable devices, device-free approaches exploiting RF signals arise as a promising alternative for HAR. Most of the latest device-free approaches require training a large deep neural network…
▽ More
Human Activity Recognition (HAR) plays a critical role in a wide range of real-world applications, and it is traditionally achieved via wearable sensing. Recently, to avoid the burden and discomfort caused by wearable devices, device-free approaches exploiting RF signals arise as a promising alternative for HAR. Most of the latest device-free approaches require training a large deep neural network model in either time or frequency domain, entailing extensive storage to contain the model and intensive computations to infer activities. Consequently, even with some major advances on device-free HAR, current device-free approaches are still far from practical in real-world scenarios where the computation and storage resources possessed by, for example, edge devices, are limited. Therefore, we introduce HAR-SAnet which is a novel RF-based HAR framework. It adopts an original signal adapted convolutional neural network architecture: instead of feeding the handcraft features of RF signals into a classifier, HAR-SAnet fuses them adaptively from both time and frequency domains to design an end-to-end neural network model. We apply point-wise grouped convolution and depth-wise separable convolutions to confine the model scale and to speed up the inference execution time. The experiment results show that the recognition accuracy of HAR-SAnet outperforms state-of-the-art algorithms and systems.
△ Less
Submitted 27 October, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
SiWa: See into Walls via Deep UWB Radar
Authors:
Tianyue Zheng,
Zhe Chen,
Jun Luo,
Lin Ke,
Chaoyang Zhao,
Yaowen Yang
Abstract:
Being able to see into walls is crucial for diagnostics of building health; it enables inspections of wall structure without undermining the structural integrity. However, existing sensing devices do not seem to offer a full capability in map** the in-wall structure while identifying their status (e.g., seepage and corrosion). In this paper, we design and implement SiWa as a low-cost and portabl…
▽ More
Being able to see into walls is crucial for diagnostics of building health; it enables inspections of wall structure without undermining the structural integrity. However, existing sensing devices do not seem to offer a full capability in map** the in-wall structure while identifying their status (e.g., seepage and corrosion). In this paper, we design and implement SiWa as a low-cost and portable system for wall inspections. Built upon a customized IR-UWB radar, SiWa scans a wall as a user swipes its probe along the wall surface; it then analyzes the reflected signals to synthesize an image and also to identify the material status. Although conventional schemes exist to handle these problems individually, they require troublesome calibrations that largely prevent them from practical adoptions. To this end, we equip SiWa with a deep learning pipeline to parse the rich sensory data. With an ingenious construction and innovative training, the deep learning modules perform structural imaging and the subsequent analysis on material status, without the need for parameter tuning and calibrations. We build SiWa as a prototype and evaluate its performance via extensive experiments and field studies; results confirm that SiWa accurately maps in-wall structures, identifies their materials, and detects possible failures, suggesting a promising solution for diagnosing building health with lower effort and cost.
△ Less
Submitted 27 October, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing
Authors:
Wei Liu,
Meng Sun,
Xiongwei Zhang,
Hugo Van hamme,
Thomas Fang Zheng
Abstract:
The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a n…
▽ More
The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by a learnable neural network inspired by the weighting block in squeeze-and-excitation networks (SENet). Furthermore, the refinement of the chosen timefrequency resolutions is investigated by pruning the ones with relatively low importance, which reduces the complexity and size of the model. The proposed method is evaluated on the tasks of speech anti-spoofing in ASVSpoof 2019 and its superiority has been justified by comparing with similar baselines.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Backstep** Mean-Field Density Control for Large-Scale Heterogeneous Nonlinear Stochastic Systems
Authors:
Tongjia Zheng,
Qing Han,
Hai Lin
Abstract:
This work studies the problem of controlling the mean-field density of large-scale stochastic systems, which has applications in various fields such as swarm robotics. Recently, there is a growing amount of literature that employs mean-field partial differential equations (PDEs) to model the density evolution and uses density feedback to design control laws which, by acting on individual systems,…
▽ More
This work studies the problem of controlling the mean-field density of large-scale stochastic systems, which has applications in various fields such as swarm robotics. Recently, there is a growing amount of literature that employs mean-field partial differential equations (PDEs) to model the density evolution and uses density feedback to design control laws which, by acting on individual systems, stabilize their density towards a target profile. In spite of its stability property and computational efficiency, the success of density feedback relies on assuming the systems to be homogeneous first-order integrators (plus white noise) and ignores higher-order dynamics, making it less applicable in practice. In this work, we present a backstep** design algorithm that extends density control to heterogeneous and higher-order stochastic systems in strict-feedback forms. We show that the strict-feedback form in the individual level corresponds to, in the collective level, a PDE (of densities) distributedly driven by a collection of heterogeneous stochastic systems. The presented backstep** design then starts with a density feedback design for the PDE, followed by a sequence of stabilizing design for the remaining stochastic systems. We present a candidate control law with stability proof and apply it to nonholonomic mobile robots. A simulation is included to verify the effectiveness of the algorithm.
△ Less
Submitted 24 March, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Physical Layer Security for NOMA-Enabled Multi-Access Edge Computing Wireless Networks
Authors:
Yating Wen,
Tong-Xing Zheng,
Yongxia Tong,
Hao-Wen Liu,
Xin Chen,
Pengcheng Mu,
Hui-Ming Wang
Abstract:
Multi-access edge computing (MEC) has been regarded as a promising technique for enhancing computation capabilities for wireless networks. In this paper, we study physical layer security in an MEC system where multiple users offload partial of their computation tasks to a base station simultaneously based on non-orthogonal multiple access (NOMA), in the presence of a malicious eavesdropper. Secrec…
▽ More
Multi-access edge computing (MEC) has been regarded as a promising technique for enhancing computation capabilities for wireless networks. In this paper, we study physical layer security in an MEC system where multiple users offload partial of their computation tasks to a base station simultaneously based on non-orthogonal multiple access (NOMA), in the presence of a malicious eavesdropper. Secrecy outage probability is adopted to measure the security performance of the computation offloading against eavesdrop** attacks. We aim to minimize the sum energy consumption of all the users, subject to constraints in terms of the secrecy offloading rate, the secrecy outage probability, and the decoding order of NOMA. Although the original optimization problem is non-convex and challenging to solve, we put forward an efficient algorithm based on sequential convex approximation and penalty dual decomposition. Numerical results are eventually provided to validate the convergence of the proposed algorithm and its superior energy efficiency with secrecy requirements.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Distributed Mean-Field Density Estimation for Large-Scale Systems
Authors:
Tongjia Zheng,
Qing Han,
Hai Lin
Abstract:
This work studies how to estimate the mean-field density of large-scale systems in a distributed manner. Such problems are motivated by the recent swarm control technique that uses mean-field approximations to represent the collective effect of the swarm, wherein the mean-field density (especially its gradient) is usually used in feedback control design. In the first part, we formulate the density…
▽ More
This work studies how to estimate the mean-field density of large-scale systems in a distributed manner. Such problems are motivated by the recent swarm control technique that uses mean-field approximations to represent the collective effect of the swarm, wherein the mean-field density (especially its gradient) is usually used in feedback control design. In the first part, we formulate the density estimation problem as a filtering problem of the associated mean-field partial differential equation (PDE), for which we employ kernel density estimation (KDE) to construct noisy observations and use filtering theory of PDE systems to design an optimal (centralized) density filter. It turns out that the covariance operator of observation noise depends on the unknown density. Hence, we use approximations for the covariance operator to obtain a suboptimal density filter, and prove that both the density estimates and their gradient are convergent and remain close to the optimal one using the notion of input-to-state stability (ISS). In the second part, we continue to study how to decentralize the density filter such that each agent can estimate the mean-field density based on only its own position and local information exchange with neighbors. We prove that the local density filter is also convergent and remains close to the centralized one in the sense of ISS. Simulation results suggest that the centralized suboptimal density filter is able to generate convergent density estimates, and the local density filter is able to converge and remain close to the centralized filter.
△ Less
Submitted 10 October, 2021; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Feedback Interconnected Mean-Field Density Estimation and Control
Authors:
Tongjia Zheng,
Qing Han,
Hai Lin
Abstract:
Swarm robotic systems have foreseeable applications in the near future. Recently, there has been an increasing amount of literature that employs mean-field partial differential equations (PDEs) to model the time-evolution of the probability density of swarm robotic systems and uses density feedback to design stabilizing control laws that act on individuals such that their density converges to a ta…
▽ More
Swarm robotic systems have foreseeable applications in the near future. Recently, there has been an increasing amount of literature that employs mean-field partial differential equations (PDEs) to model the time-evolution of the probability density of swarm robotic systems and uses density feedback to design stabilizing control laws that act on individuals such that their density converges to a target profile. However, it remains largely unexplored considering problems of how to estimate the mean-field density, how the density estimation algorithms affect the control performance, and whether the estimation performance in turn depends on the control algorithms. In this work, we focus on studying the interplay of these algorithms. Specifically, we propose new density control laws which use the mean-field density and its gradient as feedback, and prove that they are globally input-to-state stable (ISS) with respect to estimation errors. Then, we design filtering algorithms to estimate the density and its gradient separately, and prove that these estimates are convergent assuming the control laws are known. Finally, we show that the feedback interconnection of these estimation and control algorithms is still globally ISS, which is attributed to the bilinearity of the PDE system. An agent-based simulation is included to verify the stability of these algorithms and their feedback interconnection.
△ Less
Submitted 23 March, 2022; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Field Estimation using Robotic Swarms through Bayesian Regression and Mean-Field Feedback
Authors:
Tongjia Zheng,
Hai Lin
Abstract:
Recent years have seen an increased interest in using mean-field density based modelling and control strategy for deploying robotic swarms. In this paper, we study how to dynamically deploy the robots subject to their physical constraints to efficiently measure and reconstruct certain unknown spatial field (e.g. the air pollution index over a city). Specifically, the evolution of the robots' densi…
▽ More
Recent years have seen an increased interest in using mean-field density based modelling and control strategy for deploying robotic swarms. In this paper, we study how to dynamically deploy the robots subject to their physical constraints to efficiently measure and reconstruct certain unknown spatial field (e.g. the air pollution index over a city). Specifically, the evolution of the robots' density is modelled by mean-field partial differential equations (PDEs) which are uniquely determined by the robots' individual dynamics. Bayesian regression models are used to obtain predictions and return a variance function that represents the confidence of the prediction. We formulate a PDE constrained optimization problem based on this variance function to dynamically generate a reference density signal which guides the robots to uncertain areas to collect new data, and design mean-field feedback-based control laws such that the robots' density converges to this reference signal. We also show that the proposed feedback law is robust to density estimation errors in the sense of input-to-state stability. Simulations are included to verify the effectiveness of the algorithms.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Inner Approximations of the Positive-Semidefinite Cone via Grassmannian Packings
Authors:
Tianqi Zheng,
James Guthrie,
Enrique Mallada
Abstract:
We investigate the problem of finding inner ap-proximations of positive semidefinite (PSD) cones. We developa novel decomposition framework of the PSD cone by meansof conical combinations of smaller dimensional sub-cones. Weshow that many inner approximation techniques could besummarized within this framework, including the set of (scaled)diagonally dominant matrices, Factor-widthkmatrices, andCho…
▽ More
We investigate the problem of finding inner ap-proximations of positive semidefinite (PSD) cones. We developa novel decomposition framework of the PSD cone by meansof conical combinations of smaller dimensional sub-cones. Weshow that many inner approximation techniques could besummarized within this framework, including the set of (scaled)diagonally dominant matrices, Factor-widthkmatrices, andChordal Sparse matrices. Furthermore, we provide a moreflexible family of inner approximations of the PSD cone, wherewe aim to arrange the sub-cones so that they are maximallyseparated from each other. In doing so, these approximationstend to occupy large fractions of the volume of the PSD cone.The proposed approach is connected to a classical packingproblem in Riemannian Geometry. Precisely, we show thatthe problem of finding maximally distant sub-cones in anambient PSD cone is equivalent to the problem of packingsub-spaces in a Grassmannian Manifold. We further leverageexisting computational method for constructing packings inGrassmannian manifolds to build tighter approximations ofthe PSD cone. Numerical experiments show how the proposedframework can balance between accuracy and computationalcomplexity, to efficiently solve positive-semidefinite programs.
△ Less
Submitted 30 September, 2021; v1 submitted 25 May, 2021;
originally announced May 2021.
-
Attack on practical speaker verification system using universal adversarial perturbations
Authors:
Weiyi Zhang,
Shuning Zhao,
Le Liu,
Jianmin Li,
Xingliang Cheng,
Thomas Fang Zheng,
Xiaolin Hu
Abstract:
In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate s…
▽ More
In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
CN-Celeb: multi-genre speaker recognition
Authors:
Lantian Li,
Ruiqi Liu,
Jiawen Kang,
Yue Fan,
Hao Cui,
Yunqi Cai,
Ravichander Vipperla,
Thomas Fang Zheng,
Dong Wang
Abstract:
Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic…
▽ More
Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3,000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition and the performance gain when the new dataset is used to conduct multi-genre training.
△ Less
Submitted 24 November, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Squeezing value of cross-domain labels: a decoupled scoring approach for speaker verification
Authors:
Lantian Li,
Yang Zhang,
Jiawen Kang,
Thomas Fang Zheng,
Dong Wang
Abstract:
Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems. The common wisdom is to collect cross-domain data and train a multi-domain PLDA model, with the hope to learn a domain-independent speaker subspace. In this paper, we firstly present an empirical study to show that simply adding cross-domain data does not help performance in…
▽ More
Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems. The common wisdom is to collect cross-domain data and train a multi-domain PLDA model, with the hope to learn a domain-independent speaker subspace. In this paper, we firstly present an empirical study to show that simply adding cross-domain data does not help performance in conditions with enrollment-test mismatch. Careful analysis shows that this striking result is caused by the incoherent statistics between the enrollment and test conditions. Based on this analysis, we present a decoupled scoring approach that can maximally squeeze the value of cross-domain labels and obtain optimal verification scores when the enrollment and test are mismatched. When the statistics are coherent, the new formulation falls back to the conventional PLDA. Experimental results on cross-channel test show that the proposed approach is highly effective and is a principle solution to domain mismatch.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Deep generative factorization for speech signal
Authors:
Haoran Sun,
Lantian Li,
Yunqi Cai,
Yang Zhang,
Thomas Fang Zheng,
Dong Wang
Abstract:
Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discrim…
▽ More
Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discriminative normalization flow model (factorial DNF). Experiments conducted on a two-factor case that involves phonetic content and speaker trait demonstrates that the proposed factorial DNF has powerful capability to factorize speech signals and outperforms several comparative models in terms of information representation and manipulation.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Micro CT Image-Assisted Cross Modality Super-Resolution of Clinical CT Images Utilizing Synthesized Training Dataset
Authors:
Tong Zheng,
Hirohisa Oda,
Masahiro Oda,
Shota Nakamura,
Masaki Mori,
Hirotsugu Takabatake,
Hiroshi Natori,
Kensaku Mori
Abstract:
This paper proposes a novel, unsupervised super-resolution (SR) approach for performing the SR of a clinical CT into the resolution level of a micro CT ($μ$CT). The precise non-invasive diagnosis of lung cancer typically utilizes clinical CT data. Due to the resolution limitations of clinical CT (about $0.5 \times 0.5 \times 0.5$ mm$^3$), it is difficult to obtain enough pathological information s…
▽ More
This paper proposes a novel, unsupervised super-resolution (SR) approach for performing the SR of a clinical CT into the resolution level of a micro CT ($μ$CT). The precise non-invasive diagnosis of lung cancer typically utilizes clinical CT data. Due to the resolution limitations of clinical CT (about $0.5 \times 0.5 \times 0.5$ mm$^3$), it is difficult to obtain enough pathological information such as the invasion area at alveoli level. On the other hand, $μ$CT scanning allows the acquisition of volumes of lung specimens with much higher resolution ($50 \times 50 \times 50 μ{\rm m}^3$ or higher). Thus, super-resolution of clinical CT volume may be helpful for diagnosis of lung cancer. Typical SR methods require aligned pairs of low-resolution (LR) and high-resolution (HR) images for training. Unfortunately, obtaining paired clinical CT and $μ$CT volumes of human lung tissues is infeasible. Unsupervised SR methods are required that do not need paired LR and HR images. In this paper, we create corresponding clinical CT-$μ$CT pairs by simulating clinical CT images from $μ$CT images by modified CycleGAN. After this, we use simulated clinical CT-$μ$CT image pairs to train an SR network based on SRGAN. Finally, we use the trained SR network to perform SR of the clinical CT images. We compare our proposed method with another unsupervised SR method for clinical CT images named SR-CycleGAN. Experimental results demonstrate that the proposed method can successfully perform SR of clinical CT images of lung cancer patients with $μ$CT level resolution, and quantitatively and qualitatively outperformed conventional method (SR-CycleGAN), improving the SSIM (structure similarity) form 0.40 to 0.51.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
When Automatic Voice Disguise Meets Automatic Speaker Verification
Authors:
Linlin Zheng,
Jiakang Li,
Meng Sun,
Xiongwei Zhang,
Thomas Fang Zheng
Abstract:
The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (AS…
▽ More
The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (ASV). In this paper, we have found that ASV is not only a victim of AVD but could be a tool to beat some simple types of AVD. Firstly, three types of AVD, pitch scaling, vocal tract length normalization (VTLN) and voice conversion (VC), are introduced as representative methods. State-of-the-art ASV methods are subsequently utilized to objectively evaluate the impact of AVD on ASV by equal error rates (EER). Moreover, an approach to restore disguised voice to its original version is proposed by minimizing a function of ASV scores w.r.t. restoration parameters. Experiments are then conducted on disguised voices from Voxceleb, a dataset recorded in real-world noisy scenario. The results have shown that, for the voice disguise by pitch scaling, the proposed approach obtains an EER around 7% comparing to the 30% EER of a recently proposed baseline using the ratio of fundamental frequencies. The proposed approach generalizes well to restore the disguise with nonlinear frequency war** in VTLN by reducing its EER from 34.3% to 18.5%. However, it is difficult to restore the source speakers in VC by our approach, where more complex forms of restoration functions or other paralinguistic cues might be necessary to restore the nonlinear transform in VC. Finally, contrastive visualization on ASV features with and without restoration illustrate the role of the proposed approach in an intuitive way.
△ Less
Submitted 15 September, 2020;
originally announced September 2020.