-
Efficient Economic Model Predictive Control of Water Treatment Process with Learning-based Koopman Operator
Authors:
Minghao Han,
**gshi Yao,
Adrian Wing-Keung Law,
Xunyuan Yin
Abstract:
Used water treatment plays a pivotal role in advancing environmental sustainability. Economic model predictive control holds the promise of enhancing the overall operational performance of the water treatment facilities. In this study, we propose a data-driven economic predictive control approach within the Koopman modeling framework. First, we propose a deep learning-enabled input-output Koopman…
▽ More
Used water treatment plays a pivotal role in advancing environmental sustainability. Economic model predictive control holds the promise of enhancing the overall operational performance of the water treatment facilities. In this study, we propose a data-driven economic predictive control approach within the Koopman modeling framework. First, we propose a deep learning-enabled input-output Koopman modeling approach, which predicts the overall economic operational cost of the wastewater treatment process based on input data and available output measurements that are directly linked to the operational costs. Subsequently, by leveraging this learned input-output Koopman model, a convex economic predictive control scheme is developed. The resulting predictive control problem can be efficiently solved by leveraging quadratic programming solvers, and complex non-convex optimization problems are bypassed. The proposed method is applied to a benchmark wastewater treatment process. The proposed method significantly improves the overall economic operational performance of the water treatment process. Additionally, the computational efficiency of the proposed method is significantly enhanced as compared to benchmark control solutions.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
HILCodec: High Fidelity and Lightweight Neural Audio Codec
Authors:
Sunghwan Ahn,
Beom Jun Woo,
Min Hyun Han,
Chanyeong Moon,
Nam Soo Kim
Abstract:
The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of Wave-U-Net does not increase consist…
▽ More
The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of Wave-U-Net does not increase consistently as the network depth increases. We analyze the root cause of such a phenomenon and suggest a variance-constrained design. Also, we reveal various distortions in previous waveform domain discriminators and propose a novel distortion-free discriminator. The resulting model, \textit{HILCodec}, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Reduced-order Koopman modeling and predictive control of nonlinear processes
Authors:
Xuewen Zhang,
Minghao Han,
Xunyuan Yin
Abstract:
In this paper, we propose an efficient data-driven predictive control approach for general nonlinear processes based on a reduced-order Koopman operator. A Kalman-based sparse identification of nonlinear dynamics method is employed to select lifting functions for Koopman identification. The selected lifting functions are used to project the original nonlinear state-space into a higher-dimensional…
▽ More
In this paper, we propose an efficient data-driven predictive control approach for general nonlinear processes based on a reduced-order Koopman operator. A Kalman-based sparse identification of nonlinear dynamics method is employed to select lifting functions for Koopman identification. The selected lifting functions are used to project the original nonlinear state-space into a higher-dimensional linear function space, in which Koopman-based linear models can be constructed for the underlying nonlinear process. To curb the significant increase in the dimensionality of the resulting full-order Koopman models caused by the use of lifting functions, we propose a reduced-order Koopman modeling approach based on proper orthogonal decomposition. A computationally efficient linear robust predictive control scheme is established based on the reduced-order Koopman model. A case study on a benchmark chemical process is conducted to illustrate the effectiveness of the proposed method. Comprehensive comparisons are conducted to demonstrate the advantage of the proposed method.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Towards 3D Vision with Low-Cost Single-Photon Cameras
Authors:
Fangzhou Mu,
Carter Sifferman,
Sacha Jungerman,
Yiquan Li,
Mark Han,
Michael Gleicher,
Mohit Gupta,
Yin Li
Abstract:
We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature, energy-efficient, low-cost single-photon cameras. These cameras, operating as time resolved image sensors, illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to mo…
▽ More
We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature, energy-efficient, low-cost single-photon cameras. These cameras, operating as time resolved image sensors, illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to model this image formation process, account for its non-idealities, and adapt neural rendering to reconstruct 3D geometry from a set of spatially distributed sensors with known poses. We show that our approach can successfully recover complex 3D shapes from simulated data. We further demonstrate 3D object reconstruction from real-world captures, utilizing measurements from a commodity proximity sensor. Our work draws a connection between image-based modeling and active range scanning and is a step towards 3D vision with single-photon cameras.
△ Less
Submitted 29 March, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Bidirectional Autoregressive Diffusion Model for Dance Generation
Authors:
Canyu Zhang,
Youbao Tang,
Ning Zhang,
Ruei-Sung Lin,
Mei Han,
**g Xiao,
Song Wang
Abstract:
Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create…
▽ More
Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create entire motion sequences directly and unidirectionally, lacking focus on the motion with local and bidirectional enhancement. When choreographing high-quality dance movements, people need to take into account not only the musical context but also the nearby music-aligned dance motions. To authentically capture human behavior, we propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions. To make the generated dance motion smoother, a local information decoder is built for local motion enhancement. The proposed framework is able to generate new motions based on the input conditions and nearby motions, which foresees individual motion slices iteratively and consolidates all predictions. To further refine the synchronicity between the generated dance and the beat, the beat information is incorporated as an input to generate better music-aligned dance movements. Experimental results demonstrate that the proposed model achieves state-of-the-art performance compared to existing unidirectional approaches on the prominent benchmark for music-to-dance generation.
△ Less
Submitted 22 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings
Authors:
Sung Hwan Mun,
Min Hyun Han,
Canyeong Moon,
Nam Soo Kim
Abstract:
In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity…
▽ More
In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity. EEND-DEMUX can directly obtain separated speaker embeddings through the demultiplexing operation in the inference phase without an external speaker diarization system, an embedding extractor, or a heuristic decoding technique. Furthermore, we employ a multi-head cross-attention mechanism to capture the correlation between mixture and separated speaker embeddings effectively. We formulate three loss functions based on matching, orthogonality, and sparsity constraints to learn robust demultiplexed speaker embeddings. The experimental results on the LibriMix dataset show consistently improved performance in both a fixed and flexible number of speakers scenarios.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
Deep Reinforcement Learning-driven Cross-Community Energy Interaction Optimal Scheduling
Authors:
Yang Li,
Wenjie Ma,
Fan** Bu,
Zhen Yang,
Bin Wang,
Meng Han
Abstract:
In order to coordinate energy interactions among various communities and energy conversions among multi-energy subsystems within the multi-community integrated energy system under uncertain conditions, and achieve overall optimization and scheduling of the comprehensive energy system, this paper proposes a comprehensive scheduling model that utilizes a multi-agent deep reinforcement learning algor…
▽ More
In order to coordinate energy interactions among various communities and energy conversions among multi-energy subsystems within the multi-community integrated energy system under uncertain conditions, and achieve overall optimization and scheduling of the comprehensive energy system, this paper proposes a comprehensive scheduling model that utilizes a multi-agent deep reinforcement learning algorithm to learn load characteristics of different communities and make decisions based on this knowledge. In this model, the scheduling problem of the integrated energy system is transformed into a Markov decision process and solved using a data-driven deep reinforcement learning algorithm, which avoids the need for modeling complex energy coupling relationships between multi-communities and multi-energy subsystems. The simulation results show that the proposed method effectively captures the load characteristics of different communities and utilizes their complementary features to coordinate reasonable energy interactions among them. This leads to a reduction in wind curtailment rate from 16.3% to 0% and lowers the overall operating cost by 5445.6 Yuan, demonstrating significant economic and environmental benefits.
△ Less
Submitted 2 September, 2023; v1 submitted 24 August, 2023;
originally announced August 2023.
-
On-Device Speaker Anonymization of Acoustic Embeddings for ASR based onFlexible Location Gradient Reversal Layer
Authors:
Md Asif Jalal,
Pablo Peso Parada,
Jisi Zhang,
Karthikeyan Saravanan,
Mete Ozay,
Myoungji Han,
Jung In Lee,
Seokyeong Jung
Abstract:
Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition…
▽ More
Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition (ASR). The proposed framework attaches flexible gradient reversal based speaker adversarial layers to target layers within an ASR model, where speaker adversarial training anonymizes acoustic embeddings generated by the targeted layers to remove speaker identity. We propose on-device deployment by execution of initial layers of the ASR model, and transmitting anonymized embeddings to the cloud, where the rest of the model is executed while preserving privacy. Experimental results show that our method efficiently reduces speaker recognition relative accuracy by 33%, and improves ASR performance by achieving 6.2% relative Word Error Rate (WER) reduction.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition
Authors:
Ziyi Ni,
Minglun Han,
Feilong Chen,
Linghui Meng,
**g Shi,
Pin Lv,
Bo Xu
Abstract:
Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision a…
▽ More
Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K and self-constructed VSDial datasets. We explore various cross-modal fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition.
△ Less
Submitted 18 December, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Towards single integrated spoofing-aware speaker verification embeddings
Authors:
Sung Hwan Mun,
Hye-** Shim,
Hemlata Tak,
Xin Wang,
Xuechen Liu,
Md Sahidullah,
Myeonghun Jeong,
Min Hyun Han,
Massimiliano Todisco,
Kong Aik Lee,
Junichi Yamagishi,
Nicholas Evans,
Tomi Kinnunen,
Nam Soo Kim,
Jee-weon Jung
Abstract:
This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe…
▽ More
This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
△ Less
Submitted 1 June, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
Realistic Noise Synthesis with Diffusion Models
Authors:
Qi Wu,
Mingyan Han,
Ting Jiang,
Haoqiang Fan,
Bing Zeng,
Shuaicheng Liu
Abstract:
Deep image denoising models often rely on large amount of training data for the high quality performance. However, it is challenging to obtain sufficient amount of data under real-world scenarios for the supervised training. As such, synthesizing realistic noise becomes an important solution. However, existing techniques have limitations in modeling complex noise distributions, resulting in residu…
▽ More
Deep image denoising models often rely on large amount of training data for the high quality performance. However, it is challenging to obtain sufficient amount of data under real-world scenarios for the supervised training. As such, synthesizing realistic noise becomes an important solution. However, existing techniques have limitations in modeling complex noise distributions, resulting in residual noise and edge artifacts in denoising methods relying on synthetic data. To overcome these challenges, we propose a novel method that synthesizes realistic noise using diffusion models, namely Realistic Noise Synthesize Diffusor (RNSD). In particular, the proposed time-aware controlling module can simulate various environmental conditions under given camera settings. RNSD can incorporate guided multiscale content, such that more realistic noise with spatial correlations can be generated at multiple frequencies. In addition, we construct an inversion mechanism to predict the unknown camera setting, which enables the extension of RNSD to datasets without setting information. Extensive experiments demonstrate that our RNSD method significantly outperforms the existing methods not only in the synthesized noise under multiple realism metrics, but also in the single image denoising performances.
△ Less
Submitted 3 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Deep Learning-based Data-aided Activity Detection with Extraction Network in Grant-free Sparse Code Multiple Access Systems
Authors:
Minsig Han,
Ameha T. Abebe,
Chung G. Kang
Abstract:
This letter proposes a deep learning-based data-aided active user detection network (D-AUDN) for grant-free sparse code multiple access (SCMA) systems that leverages both SCMA codebook and Zadoff-Chu preamble for activity detection. Due to disparate data and preamble distribution as well as codebook collision, existing D-AUDNs experience performance degradation when multiple preambles are associat…
▽ More
This letter proposes a deep learning-based data-aided active user detection network (D-AUDN) for grant-free sparse code multiple access (SCMA) systems that leverages both SCMA codebook and Zadoff-Chu preamble for activity detection. Due to disparate data and preamble distribution as well as codebook collision, existing D-AUDNs experience performance degradation when multiple preambles are associated with each codebook. To address this, a user activity extraction network (UAEN) is integrated within the D-AUDN to extract a-priori activity information from the codebook, improving activity detection of the associated preambles. Additionally, efficient SCMA codebook design and Zadoff-Chu preamble association are considered to further enhance performance.
△ Less
Submitted 19 May, 2023; v1 submitted 13 May, 2023;
originally announced May 2023.
-
X-LLM: Bootstrap** Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Authors:
Feilong Chen,
Minglun Han,
Haozhi Zhao,
Qingyang Zhang,
**g Shi,
Shuang Xu,
Bo Xu
Abstract:
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimod…
▽ More
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, ho** to promote the era of LLM-based speech recognition.
△ Less
Submitted 21 May, 2023; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Experimental Validation of Coherent Joint Transmission in a Distributed-MIMO System with Analog Fronthaul for 6G
Authors:
Rafael Puerta,
Mahdieh Joharifar,
Mengyao Han,
Anders Djupsjöbacka,
Vjaceslavs Bobrovs,
Sergei Popov,
Oskars Ozolins,
Xiaodan Pang
Abstract:
The sixth-generation (6G) mobile networks must increase coverage and improve spectral efficiency, especially for cell-edge users. Distributed multiple-input multiple-output (D-MIMO) networks can fulfill these requirements provided that transmission/reception points (TRxPs) of the network can be synchronized with sub nanosecond precision, however, synchronization with current backhaul and fronthaul…
▽ More
The sixth-generation (6G) mobile networks must increase coverage and improve spectral efficiency, especially for cell-edge users. Distributed multiple-input multiple-output (D-MIMO) networks can fulfill these requirements provided that transmission/reception points (TRxPs) of the network can be synchronized with sub nanosecond precision, however, synchronization with current backhaul and fronthaul digital interfaces is challenging. For 6G new services and scenarios, analog radio-over-fiber (ARoF) is a prospective alternative for future mobile fronthaul where current solutions fall short to fulfill future demands on bandwidth, synchronization, and/or power consumption. This paper presents an experimental validation of coherent joint transmissions (CJTs) in a two TRxPs D-MIMO network where ARoF fronthaul links allow to meet the required level of synchronization. Results show that by means of CJT a combined diversity and power gain of +5 dB is realized in comparison with a single TRxP transmission.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
NR Conformance Testing of Analog Radio-over-LWIR FSO Fronthaul link for 6G Distributed MIMO Networks
Authors:
Rafael Puerta,
Mengyao Han,
Mahdieh Joharifar,
Richard Schatz,
Yan-Ting Sun,
Yuchuan Fan,
Anders Djupsjöbacka,
Grégory Maisons,
Johan Abautret,
Roland Teissier,
Lu Zhang,
Sandis Spolitis,
Muguang Wang,
Vjaceslavs Bobrovs,
Sebastian Lourdudoss,
Xianbin Yu,
Sergei Popov,
Oskars Ozolins,
Xiaodan Pang
Abstract:
We experimentally test the compliance with 5G/NR 3GPP technical specifications of an analog radio-over-FSO link at 9 μm. The ACLR and EVM transmitter requirements are fulfilled validating the suitability of LWIR FSO for 6G fronthaul.
We experimentally test the compliance with 5G/NR 3GPP technical specifications of an analog radio-over-FSO link at 9 μm. The ACLR and EVM transmitter requirements are fulfilled validating the suitability of LWIR FSO for 6G fronthaul.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
Authors:
Minglun Han,
Feilong Chen,
**g Shi,
Shuang Xu,
Bo Xu
Abstract:
Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. Leveraging the capabilities of PLMs to enhance automatic speech recognition (ASR) systems has also emerged as a promising research direction. However, previous works may be limited by the inflexible structures of PLMs and the insufficient utilization of PLMs. To alleviate these problems,…
▽ More
Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. Leveraging the capabilities of PLMs to enhance automatic speech recognition (ASR) systems has also emerged as a promising research direction. However, previous works may be limited by the inflexible structures of PLMs and the insufficient utilization of PLMs. To alleviate these problems, we propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge distillation with contrastive loss at the acoustic level and knowledge distillation with regression loss at the linguistic level. Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets, respectively.
△ Less
Submitted 28 May, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Data-Driven Distributionally Robust Scheduling of Community Integrated Energy Systems with Uncertain Renewable Generations Considering Integrated Demand Response
Authors:
Yang Li,
Meng Han,
Mohammad Shahidehpour,
Jiazheng Li,
Chao Long
Abstract:
A community integrated energy system (CIES) is an important carrier of the energy internet and smart city in geographical and functional terms. Its emergence provides a new solution to the problems of energy utilization and environmental pollution. To coordinate the integrated demand response and uncertainty of renewable energy generation (RGs), a data-driven two-stage distributionally robust opti…
▽ More
A community integrated energy system (CIES) is an important carrier of the energy internet and smart city in geographical and functional terms. Its emergence provides a new solution to the problems of energy utilization and environmental pollution. To coordinate the integrated demand response and uncertainty of renewable energy generation (RGs), a data-driven two-stage distributionally robust optimization (DRO) model is constructed. A comprehensive norm consisting of the 1-norm and infinity-norm is used as the uncertainty probability distribution information set, thereby avoiding complex probability density information. To address multiple uncertainties of RGs, a generative adversarial network based on the Wasserstein distance with gradient penalty is proposed to generate RG scenarios, which has wide applicability. To further tap the potential of the demand response, we take into account the ambiguity of human thermal comfort and the thermal inertia of buildings. Thus, an integrated demand response mechanism is developed that effectively promotes the consumption of renewable energy. The proposed method is simulated in an actual CIES in North China. In comparison with traditional stochastic programming and robust optimization, it is verified that the proposed DRO model properly balances the relationship between economical operation and robustness while exhibiting stronger adaptability. Furthermore, our approach outperforms other commonly used DRO methods with better operational economy, lower renewable power curtailment rate, and higher computational efficiency.
△ Less
Submitted 27 January, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
Physics-informed Deep Diffusion MRI Reconstruction with Synthetic Data: Break Training Data Bottleneck in Artificial Intelligence
Authors:
Chen Qian,
Yuncheng Gao,
Mingyang Han,
Zi Wang,
Dan Ruan,
Yu Shen,
Ya** Wu,
Yirong Zhou,
Chengyan Wang,
Boyu Jiang,
Ran Tao,
Zhigang Wu,
Jiazheng Wang,
Liuhong Zhu,
Yi Guo,
Taishan Kang,
Jianzhong Lin,
Tao Gong,
Chen Yang,
Guoqiang Fei,
Mei** Lin,
Di Guo,
Jianjun Zhou,
Meiyun Wang,
Xiaobo Qu
Abstract:
Diffusion magnetic resonance imaging (MRI) is the only imaging modality for non-invasive movement detection of in vivo water molecules, with significant clinical and research applications. Diffusion MRI (DWI) acquired by multi-shot techniques can achieve higher resolution, better signal-to-noise ratio, and lower geometric distortion than single-shot, but suffers from inter-shot motion-induced arti…
▽ More
Diffusion magnetic resonance imaging (MRI) is the only imaging modality for non-invasive movement detection of in vivo water molecules, with significant clinical and research applications. Diffusion MRI (DWI) acquired by multi-shot techniques can achieve higher resolution, better signal-to-noise ratio, and lower geometric distortion than single-shot, but suffers from inter-shot motion-induced artifacts. These artifacts cannot be removed prospectively, leading to the absence of artifact-free training labels. Thus, the potential of deep learning in multi-shot DWI reconstruction remains largely untapped. To break the training data bottleneck, here, we propose a Physics-Informed Deep DWI reconstruction method (PIDD) to synthesize high-quality paired training data by leveraging the physical diffusion model (magnitude synthesis) and inter-shot motion-induced phase model (motion phase synthesis). The network is trained only once with 100,000 synthetic samples, achieving encouraging results on multiple realistic in vivo data reconstructions. Advantages over conventional methods include: (a) Better motion artifact suppression and reconstruction stability; (b) Outstanding generalization to multi-scenario reconstructions, including multi-resolution, multi-b-value, multi-undersampling, multi-vendor, and multi-center; (c) Excellent clinical adaptability to patients with verifications by seven experienced doctors (p<0.001). In conclusion, PIDD presents a novel deep learning framework by exploiting the power of MRI physics, providing a cost-effective and explainable way to break the data bottleneck in deep learning medical imaging.
△ Less
Submitted 5 February, 2024; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Fully Unsupervised Training of Few-shot Keyword Spotting
Authors:
Dongjune Lee,
Minchan Kim,
Sung Hwan Mun,
Min Hyun Han,
Nam Soo Kim
Abstract:
For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric le…
▽ More
For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric learning enabling target keywords to be detected using distance metrics. Exploiting the speech synthesis model that generates speech with pseudo phonemes instead of texts, we easily obtain a large collection of multi-view samples with the same semantics. These samples are sufficient for training, considering metric learning does not intrinsically necessitate labeled data. All of the components in our framework do not require any supervision, making our method unsupervised. Experimental results on real datasets show our proposed method is competitive even without any labeled and real datasets.
△ Less
Submitted 6 October, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Accurate and Robust Lesion RECIST Diameter Prediction and Segmentation with Transformers
Authors:
Youbao Tang,
Ning Zhang,
Yirui Wang,
Shenghua He,
Mei Han,
**g Xiao,
Ruei-Sung Lin
Abstract:
Automatically measuring lesion/tumor size with RECIST (Response Evaluation Criteria In Solid Tumors) diameters and segmentation is important for computer-aided diagnosis. Although it has been studied in recent years, there is still space to improve its accuracy and robustness, such as (1) enhancing features by incorporating rich contextual information while kee** a high spatial resolution and (2…
▽ More
Automatically measuring lesion/tumor size with RECIST (Response Evaluation Criteria In Solid Tumors) diameters and segmentation is important for computer-aided diagnosis. Although it has been studied in recent years, there is still space to improve its accuracy and robustness, such as (1) enhancing features by incorporating rich contextual information while kee** a high spatial resolution and (2) involving new tasks and losses for joint optimization. To reach this goal, this paper proposes a transformer-based network (MeaFormer, Measurement transFormer) for lesion RECIST diameter prediction and segmentation (LRDPS). It is formulated as three correlative and complementary tasks: lesion segmentation, heatmap prediction, and keypoint regression. To the best of our knowledge, it is the first time to use keypoint regression for RECIST diameter prediction. MeaFormer can enhance high-resolution features by employing transformers to capture their long-range dependencies. Two consistency losses are introduced to explicitly build relationships among these tasks for better optimization. Experiments show that MeaFormer achieves the state-of-the-art performance of LRDPS on the large-scale DeepLesion dataset and produces promising results of two downstream clinic-relevant tasks, i.e., 3D lesion segmentation and RECIST assessment in longitudinal studies.
△ Less
Submitted 27 August, 2022;
originally announced August 2022.
-
On the Performance of Deep Learning-based Data-aided Active User Detection for GF-SCMA System
Authors:
Minsig Han,
Ameha Tsegaye Abebe,
Chung G. Kang
Abstract:
The recent works on a deep learning (DL)-based joint design of preamble set for the transmitters and data-aided active user detection (AUD) in the receiver has demonstrated a significant performance improvement for grant-free sparse code multiple access (GF-SCMA) system. The autoencoder for the joint design can be trained only in a given environment, but in an actual situation where the operating…
▽ More
The recent works on a deep learning (DL)-based joint design of preamble set for the transmitters and data-aided active user detection (AUD) in the receiver has demonstrated a significant performance improvement for grant-free sparse code multiple access (GF-SCMA) system. The autoencoder for the joint design can be trained only in a given environment, but in an actual situation where the operating environment is constantly changing, it is difficult to optimize the preamble set for every possible environment. Therefore, a conventional, yet general approach may implement the data-aided AUD while relying on the preamble set that is designed independently rather than the joint design. In this paper, the activity detection error rate (ADER) performance of the data-aided AUD subject to the two preamble designs, i.e., independently designed preamble and jointly designed preamble, were directly compared. Fortunately, it was found that the performance loss in the data-aided AUD induced by the independent preamble design is limited to only 1dB. Furthermore, such performance characteristics of jointly designed preamble set is interpreted through average cross-correlation among the preambles associated with the same codebook (CB) (average intra-CB cross-correlation) and average cross-correlation among preambles associated with the different CBs (average inter-CB cross-correlation).
△ Less
Submitted 5 September, 2022; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Disentangled Speaker Representation Learning via Mutual Information Minimization
Authors:
Sung Hwan Mun,
Min Hyun Han,
Minchan Kim,
Dongjune Lee,
Nam Soo Kim
Abstract:
Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive lo…
▽ More
Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.
△ Less
Submitted 12 October, 2022; v1 submitted 16 August, 2022;
originally announced August 2022.
-
Digitally-assisted photonic analog domain self-interference cancellation for in-band full-duplex MIMO systems via LS algorithm with adaptive order
Authors:
Moxuan Han,
Yang Chen
Abstract:
A digitally-assisted photonic analog domain self-interference cancellation (SIC) and frequency downconversion method is proposed for in-band full-duplex multiple-input multiple-output (MIMO) systems using the least square (LS) algorithm with adaptive order. The SIC and frequency downconversion are achieved in the optical domain via a dual-parallel Mach-Zehnder modulator (DP-MZM), while the downcon…
▽ More
A digitally-assisted photonic analog domain self-interference cancellation (SIC) and frequency downconversion method is proposed for in-band full-duplex multiple-input multiple-output (MIMO) systems using the least square (LS) algorithm with adaptive order. The SIC and frequency downconversion are achieved in the optical domain via a dual-parallel Mach-Zehnder modulator (DP-MZM), while the downconverted signal is processed by the LS algorithm with adaptive order that is used to track the response of the multipath self-interference (SI) channel and reconstruct the reference signal for SIC. The proposed method can overcome the reconstruction difficulty of the multipath analog reference signal for SIC with high complexity in the MIMO scenario and can also solve the problem that the order of the reference reconstruction algorithm is not optimized when the wireless environment changes. An experiment is carried out to verify the concept. 30.2, 26.9, 23.5, 19.5, and 15.8 dB SIC depths are achieved when the SI signal has a carrier frequency of 10 GHz and baud rates of 0.1, 0.25, 0.5, 1, and 2 Gbaud, respectively. The convergence of the LS algorithm with adaptive order is also verified for different MIMO multipath SI signals.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
NTIRE 2022 Challenge on High Dynamic Range Imaging: Methods and Results
Authors:
Eduardo Pérez-Pellitero,
Sibi Catley-Chandar,
Richard Shaw,
Aleš Leonardis,
Radu Timofte,
Zexin Zhang,
Cen Liu,
Yunbo Peng,
Yue Lin,
Gaocheng Yu,
** Zhang,
Zhe Ma,
Hongbin Wang,
Xiangyu Chen,
Xintao Wang,
Haiwei Wu,
Lin Liu,
Chao Dong,
Jiantao Zhou,
Qingsen Yan,
Song Zhang,
Weiye Chen,
Yuhang Liu,
Zhen Zhang,
Yanning Zhang
, et al. (68 additional authors not shown)
Abstract:
This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscript focuses on the competition set-up, datasets, the proposed methods and their results. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR)…
▽ More
This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscript focuses on the competition set-up, datasets, the proposed methods and their results. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR) observations, which might suffer from under- or over-exposed regions and different sources of noise. The challenge is composed of two tracks with an emphasis on fidelity and complexity constraints: In Track 1, participants are asked to optimize objective fidelity scores while imposing a low-complexity constraint (i.e. solutions can not exceed a given number of operations). In Track 2, participants are asked to minimize the complexity of their solutions while imposing a constraint on fidelity scores (i.e. solutions are required to obtain a higher fidelity score than the prescribed baseline). Both tracks use the same data and metrics: Fidelity is measured by means of PSNR with respect to a ground-truth HDR image (computed both directly and with a canonical tonemap** operation), while complexity metrics include the number of Multiply-Accumulate (MAC) operations and runtime (in seconds).
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Data-aided Active User Detection with a User Activity Extraction Network for Grant-free SCMA Systems
Authors:
Minsig Han,
Ameha T. Abebe,
Chung G. Kang
Abstract:
In grant-free sparse code multiple access (GF-SCMA) system, active user detection (AUD) is a major performance bottleneck as it involves complex combinatorial problem, which makes joint design of contention resources for users and AUD at the receiver a crucial but a challenging problem. To this end, we propose autoencoder (AE)-based joint optimization of both preamble generation networks (PGNs) in…
▽ More
In grant-free sparse code multiple access (GF-SCMA) system, active user detection (AUD) is a major performance bottleneck as it involves complex combinatorial problem, which makes joint design of contention resources for users and AUD at the receiver a crucial but a challenging problem. To this end, we propose autoencoder (AE)-based joint optimization of both preamble generation networks (PGNs) in the encoder side and data-aided AUD in the decoder side. The core architecture of the proposed AE is a novel user activity extraction network (UAEN) in the decoder that extracts a priori user activity information from the SCMA codeword data for the data-aided AUD. An end-to-end training of the proposed AE enables joint optimization of the contention resources, i.e., preamble sequences, each associated with one of the codebooks, and extraction of user activity information from both preamble and SCMA-based data transmission. Furthermore, we propose a self-supervised pre-training scheme for the UAEN prior to the end-to-end training, to ensure the convergence of the UAEN which lies deep inside the AE network. Simulation results demonstrated that the proposed AUD scheme achieved 3 to 5dB gain at a target activity detection error rate of $\bf{{10}^{-3}}$ compared to the state-of-the-art DL-based AUD schemes.
△ Less
Submitted 8 August, 2022; v1 submitted 22 May, 2022;
originally announced May 2022.
-
Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification
Authors:
Sung Hwan Mun,
Jee-weon Jung,
Min Hyun Han,
Nam Soo Kim
Abstract:
The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional lay…
▽ More
The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the kernel size in a data-driven fashion. It is based on an attention mechanism which exploits both frequency and channel domain. We first apply existing SKA module to our baseline. Then we propose two SKA variants where the first variant is applied in front of the ECAPA-TDNN model and the other is combined with the Res2net backbone block. Through extensive experiments, we demonstrate that our two proposed SKA variants consistently improves the performance and are complementary when tested on three different evaluation protocols.
△ Less
Submitted 12 October, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection
Authors:
Minglun Han,
Linhao Dong,
Zhenlin Liang,
Meng Cai,
Shiyu Zhou,
Zejun Ma,
Bo Xu
Abstract:
Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar context-specific phrases, which hurts predictions at the token level. In this work, we focus on mitigati…
▽ More
Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar context-specific phrases, which hurts predictions at the token level. In this work, we focus on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS). In FineCoS, we introduce fine-grained knowledge to reduce the uncertainty of token predictions. Specifically, we first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates. Moreover, we re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations, and inject position information to better discriminate phrases or tokens. On LibriSpeech and an in-house 160,000-hour dataset, we explore the proposed methods based on a controllable all-neural biasing method, collaborative decoding (ColDec). The proposed methods provide at most 6.1% relative word error rate reduction on LibriSpeech and 16.4% relative character error rate reduction on the in-house dataset over ColDec.
△ Less
Submitted 2 March, 2022; v1 submitted 30 January, 2022;
originally announced January 2022.
-
Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification
Authors:
Sung Hwan Mun,
Min Hyun Han,
Dongjune Lee,
Jihwan Kim,
Nam Soo Kim
Abstract:
In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In t…
▽ More
In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the back-end stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.
△ Less
Submitted 24 December, 2021; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Photonics-based de-chir** and leakage cancellation for frequency-modulated continuous-wave radar system
Authors:
Taixia Shi,
Dingding Liang,
Moxuan Han,
Yang Chen
Abstract:
A photonics-based leakage cancellation and echo signal de-chir** approach for frequency-modulated continuous-wave radar systems is proposed based on a dual-drive Mach-Zehnder modulator (DD-MZM), with its performance evaluated by the radar measurement and imaging. The de-chirp reference signal and the leakage cancellation reference signal are combined and applied to the upper arm of the DD-MZM, w…
▽ More
A photonics-based leakage cancellation and echo signal de-chir** approach for frequency-modulated continuous-wave radar systems is proposed based on a dual-drive Mach-Zehnder modulator (DD-MZM), with its performance evaluated by the radar measurement and imaging. The de-chirp reference signal and the leakage cancellation reference signal are combined and applied to the upper arm of the DD-MZM, while the received signal including the leakage signal and echo signals is applied to the lower arm of the DD-MZM. When the amplitudes and delays of the leakage cancellation reference signal and the leakage signal are precisely matched and the DD-MZM is biased at the minimum transmission point, the leakage signal is canceled in the optical domain. The de-chirped signals are obtained after the leakage-free optical signal is detected in a photodetector. An experiment is performed. The cancellation depth of the de-chirped leakage signal is around 23 dB when the center frequency and bandwidth of the linearly frequency-modulated signal are 11.5 and 2 GHz. The leakage cancellation scheme is used in a radar system. When the leakage cancellation is not employed, the leakage signal will seriously affect the imaging results and distance measurement accuracy of the radar system. When the leakage cancellation is applied, the imaging results of multiple targets can be clearly distinguished, and the error of the distance measurement results is significantly reduced to 10 cm.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Digital-assisted photonic analog wideband multipath self-interference cancellation
Authors:
Moxuan Han,
Taixia Shi,
Yang Chen
Abstract:
A digital-assisted photonic analog wideband radio-frequency multipath self-interference cancellation (SIC) and frequency downconversion method based on a dual-drive Mach-Zehnder modulator and the recursive least square (RLS) algorithm is proposed and demonstrated for in-band full-duplex systems. Besides the reference for the direct-path self-interference (SI) signal, the RLS algorithm is used to c…
▽ More
A digital-assisted photonic analog wideband radio-frequency multipath self-interference cancellation (SIC) and frequency downconversion method based on a dual-drive Mach-Zehnder modulator and the recursive least square (RLS) algorithm is proposed and demonstrated for in-band full-duplex systems. Besides the reference for the direct-path self-interference (SI) signal, the RLS algorithm is used to construct another reference for the residual SI signal from the direct path and the SI signals from the reflection paths. The proposed method can solve the performance limitation in the previously reported SIC methods of constructing the multipath SI signal using a single reference caused by the limited dynamic range of the digital-to-analog converter when the direct-path SI signal is much stronger than the sub-weak reflection-path SI signals. An experiment is performed. When the carrier frequency of the multipath SI signal is 10 GHz and the direct-path SI signal is much stronger than the sub-weak multipath SI signal, the cancellation depths of about 26.7 and 26.1 dB are realized with SI baud rates of 0.5 and 1 Gbaud. When the direct-path SI signal and sub-weak multipath SI signal own closer power, the corresponding cancellation depths are 24.7 and 20.8 dB, respectively.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Modeling and Control of an Omnidirectional Micro Aerial Vehicle Equipped with a Soft Robotic Arm
Authors:
Róbert Szász,
Mike Allenspach,
Minghao Han,
Marco Tognon,
Robert. K. Katzschmann
Abstract:
Flying manipulators are aerial drones with attached rigid-bodied robotic arms and belong to the latest and most actively developed research areas in robotics. The rigid nature of these arms often lack compliance, flexibility, and smoothness in movement. This work proposes to use a soft-bodied robotic arm attached to an omnidirectional micro aerial vehicle (OMAV) to leverage the compliant and flexi…
▽ More
Flying manipulators are aerial drones with attached rigid-bodied robotic arms and belong to the latest and most actively developed research areas in robotics. The rigid nature of these arms often lack compliance, flexibility, and smoothness in movement. This work proposes to use a soft-bodied robotic arm attached to an omnidirectional micro aerial vehicle (OMAV) to leverage the compliant and flexible behavior of the arm, while remaining maneuverable and dynamic thanks to the omnidirectional drone as the floating base. The unification of the arm with the drone poses challenges in the modeling and control of such a combined platform; these challenges are addressed with this work. We propose a unified model for the flying manipulator based on three modeling principles: the Piecewise Constant Curvature (PCC) and Augmented Rigid Body Model (ARBM) hypotheses for modeling soft continuum robots and a floating-base approach borrowed from the traditional rigid-body robotics literature. To demonstrate the validity and usefulness of this parametrisation, a hierarchical model-based feedback controller is implemented. The controller is verified and evaluated in simulation on various dynamical tasks, where the nullspace motions, disturbance recovery, and trajectory tracking capabilities of the platform are examined and validated. The soft flying manipulator platform could open new application fields in aerial construction, goods delivery, human assistance, maintenance, and warehouse automation.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Photonics-assisted wideband RF self-interference cancellation with digital domain amplitude and delay pre-matching
Authors:
Taixia Shi,
Moxuan Han,
Yang Chen
Abstract:
A photonics-based digital and analog self-interference cancellation approach for in-band full-duplex communication systems and frequency-modulated continuous-wave radar systems is reported. One dual-drive Mach-Zehnder modulator is used to implement the analog self-interference cancellation by pre-adjusting the delay and amplitude of the reference signal applied to the dual-drive Mach-Zehnder modul…
▽ More
A photonics-based digital and analog self-interference cancellation approach for in-band full-duplex communication systems and frequency-modulated continuous-wave radar systems is reported. One dual-drive Mach-Zehnder modulator is used to implement the analog self-interference cancellation by pre-adjusting the delay and amplitude of the reference signal applied to the dual-drive Mach-Zehnder modulator in the digital domain. The amplitude is determined via the received signal power, while the delay is searched by the cross-correlation and bisection methods. Furthermore, recursive least squared or normalized least mean square algorithms are used to suppress the residual self-interference in the digital domain. Quadrature phase-shift keying modulated signals and linearly frequency-modulated signals are used to experimentally verify the proposed method. The analog cancellation depth is around 20 dB, and the total cancellation depth is more than 36 dB for the 2-Gbaud quadrature phase-shift keying modulated signals. For the linearly frequency-modulated signals, the analog and total cancellation depths are around 19 dB and 34 dB, respectively.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
Coordinating Flexible Demand Response and Renewable Uncertainties for Scheduling of Community Integrated Energy Systems with an Electric Vehicle Charging Station: A Bi-level Approach
Authors:
Yang Li,
Meng Han,
Zhen Yang,
Guoqing Li
Abstract:
A community integrated energy system (CIES) with an electric vehicle charging station (EVCS) provides a new way for tackling growing concerns of energy efficiency and environmental pollution, it is a critical task to coordinate flexible demand response and multiple renewable uncertainties. To this end, a novel bi-level optimal dispatching model for the CIES with an EVCS in multi-stakeholder scenar…
▽ More
A community integrated energy system (CIES) with an electric vehicle charging station (EVCS) provides a new way for tackling growing concerns of energy efficiency and environmental pollution, it is a critical task to coordinate flexible demand response and multiple renewable uncertainties. To this end, a novel bi-level optimal dispatching model for the CIES with an EVCS in multi-stakeholder scenarios is established in this paper. In this model, an integrated demand response program is designed to promote a balance between energy supply and demand while maintaining a user comprehensive satisfaction within an acceptable range. To further tap the potential of demand response through flexibly guiding users' energy consumption and electric vehicles' behaviors (charging, discharging and providing spinning reserves), a dynamic pricing mechanism combining time-of-use and real-time pricing is put forward. In the solution phase, by using sequence operation theory (SOT), the original chance-constrained programming (CCP) model is converted into a readily solvable mixed-integer linear programming (MILP) formulation and finally solved by CPLEX solver. The simulation results on a practical CIES located in North China demonstrate that the presented method manages to balance the interests between CIES and EVCS via the coordination of flexible demand response and uncertain renewables.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
Multimodal Fusion of EMG and Vision for Human Grasp Intent Inference in Prosthetic Hand Control
Authors:
Mehrshad Zandigohar,
Mo Han,
Mohammadreza Sharif,
Sezen Yagmur Gunay,
Mariusz P. Furmanek,
Mathew Yarossi,
Paolo Bonato,
Cagdas Onal,
Taskin Padir,
Deniz Erdogmus,
Gunar Schirner
Abstract:
Objective: For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. Current control methods based on physiological signals such as electromyography (EMG) are prone to yielding poor inference outcomes due to motion artifacts, muscle fatigue, and many more. Vision sensors are a major source of information about the environment state and…
▽ More
Objective: For transradial amputees, robotic prosthetic hands promise to regain the capability to perform daily living activities. Current control methods based on physiological signals such as electromyography (EMG) are prone to yielding poor inference outcomes due to motion artifacts, muscle fatigue, and many more. Vision sensors are a major source of information about the environment state and can play a vital role in inferring feasible and intended gestures. However, visual evidence is also susceptible to its own artifacts, most often due to object occlusion, lighting changes, etc. Multimodal evidence fusion using physiological and vision sensor measurements is a natural approach due to the complementary strengths of these modalities. Methods: In this paper, we present a Bayesian evidence fusion framework for grasp intent inference using eye-view video, eye-gaze, and EMG from the forearm processed by neural network models. We analyze individual and fused performance as a function of time as the hand approaches the object to grasp it. For this purpose, we have also developed novel data processing and augmentation techniques to train neural network components. Results: Our results indicate that, on average, fusion improves the instantaneous upcoming grasp type classification accuracy while in the reaching phase by 13.66% and 14.8%, relative to EMG (81.64% non-fused) and visual evidence (80.5% non-fused) individually, resulting in an overall fusion accuracy of 95.3%. Conclusion: Our experimental data analyses demonstrate that EMG and visual evidence show complementary strengths, and as a consequence, fusion of multimodal evidence can outperform each individual evidence modality at any given time.
△ Less
Submitted 27 February, 2024; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Deep Learning-based Codebook Design for Code-domain Non-Orthogonal Multiple Access Approaching Single-User Bit Error Rate Performance
Authors:
Minsig Han,
Hanchang Seo,
Ameha Tsegaye Abebe,
Chung G. Kang
Abstract:
A general form of codebook design for code-domain non-orthogonal multiple access (CD-NOMA) can be considered equivalent to an autoencoder (AE)-based constellation design for multi-user multidimensional modulation (MU-MDM). Due to a constrained design space for optimal constellation, e.g., fixed resource map** and equal power allocation to all codebooks, however, existing AE architectures produce…
▽ More
A general form of codebook design for code-domain non-orthogonal multiple access (CD-NOMA) can be considered equivalent to an autoencoder (AE)-based constellation design for multi-user multidimensional modulation (MU-MDM). Due to a constrained design space for optimal constellation, e.g., fixed resource map** and equal power allocation to all codebooks, however, existing AE architectures produce constellations with suboptimal bit-error-rate (BER) performance. Accordingly, we propose a new architecture for MU-MDM AE and underlying training methodology for joint optimization of resource map** and a constellation design with bit-to-symbol map**, aiming at approaching the BER performance of a single-user MDM (SU-MDM) AE model with the same spectral efficiency. The core design of the proposed AE architecture is dense resource map** combined with the novel power allocation layer that normalizes the sum of user codebook power across the entire resources. This globalizes the domain of the constellation design by enabling flexible resource map** and power allocation. Furthermore, it allows the AE-based training to approach a global optimal MU-MDM constellations for CD-NOMA. Extensive BER simulation results demonstrate that the proposed design outperforms the existing CD-NOMA designs while approaching the single-user BER performance achieved by the equivalent SU-MDM AE within 0.3 dB over the additive white Gaussian noise channel.
△ Less
Submitted 10 October, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
CIF-based Collaborative Decoding for End-to-end Contextual Speech Recognition
Authors:
Minglun Han,
Linhao Dong,
Shiyu Zhou,
Bo Xu
Abstract:
End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is no…
▽ More
End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is not explicitly controllable. In this paper, we focus on incorporating context information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion. Specifically, an extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution, thus forming a collaborative decoding with the decoder of the CIF-based model. Evaluated on the named entity rich evaluation sets of HKUST/AISHELL-2, our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline. Besides, it keeps the performance on original evaluation set without degradation.
△ Less
Submitted 18 February, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
Reinforcement Learning Control of Constrained Dynamic Systems with Uniformly Ultimate Boundedness Stability Guarantee
Authors:
Minghao Han,
Yuan Tian,
Lixian Zhang,
Jun Wang,
Wei Pan
Abstract:
Reinforcement learning (RL) is promising for complicated stochastic nonlinear control problems. Without using a mathematical model, an optimal controller can be learned from data evaluated by certain performance criteria through trial-and-error. However, the data-based learning approach is notorious for not guaranteeing stability, which is the most fundamental property for any control system. In t…
▽ More
Reinforcement learning (RL) is promising for complicated stochastic nonlinear control problems. Without using a mathematical model, an optimal controller can be learned from data evaluated by certain performance criteria through trial-and-error. However, the data-based learning approach is notorious for not guaranteeing stability, which is the most fundamental property for any control system. In this paper, the classic Lyapunov's method is explored to analyze the uniformly ultimate boundedness stability (UUB) solely based on data without using a mathematical model. It is further shown how RL with UUB guarantee can be applied to control dynamic systems with safety constraints. Based on the theoretical results, both off-policy and on-policy learning algorithms are proposed respectively. As a result, optimal controllers can be learned to guarantee UUB of the closed-loop system both at convergence and during learning. The proposed algorithms are evaluated on a series of robotic continuous control tasks with safety constraints. In comparison with the existing RL algorithms, the proposed method can achieve superior performance in terms of maintaining safety. As a qualitative evaluation of stability, our method shows impressive resilience even in the presence of external disturbances.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.
-
Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning
Authors:
Sung Hwan Mun,
Woo Hyun Kang,
Min Hyun Han,
Nam Soo Kim
Abstract:
In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the pr…
▽ More
In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% and 4.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters.
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020
Authors:
Sung Hwan Mun,
Woo Hyun Kang,
Min Hyun Han,
Nam Soo Kim
Abstract:
This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods…
▽ More
This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge's evaluation subset.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Universal Physiological Representation Learning with Soft-Disentangled Rateless Autoencoders
Authors:
Mo Han,
Ozan Ozdenizci,
Toshiaki Koike-Akino,
Ye Wang,
Deniz Erdogmus
Abstract:
Human computer interaction (HCI) involves a multidisciplinary fusion of technologies, through which the control of external devices could be achieved by monitoring physiological status of users. However, physiological biosignals often vary across users and recording sessions due to unstable physical/mental conditions and task-irrelevant activities. To deal with this challenge, we propose a method…
▽ More
Human computer interaction (HCI) involves a multidisciplinary fusion of technologies, through which the control of external devices could be achieved by monitoring physiological status of users. However, physiological biosignals often vary across users and recording sessions due to unstable physical/mental conditions and task-irrelevant activities. To deal with this challenge, we propose a method of adversarial feature encoding with the concept of a Rateless Autoencoder (RAE), in order to exploit disentangled, nuisance-robust, and universal representations. We achieve a good trade-off between user-specific and task-relevant features by making use of the stochastic disentanglement of the latent representations by adopting additional adversarial networks. The proposed model is applicable to a wider range of unknown users and tasks as well as different classifiers. Results on cross-subject transfer evaluations show the advantages of the proposed framework, with up to an 11.6% improvement in the average subject-transfer classification accuracy.
△ Less
Submitted 28 September, 2020;
originally announced September 2020.
-
Disentangled Adversarial Autoencoder for Subject-Invariant Physiological Feature Extraction
Authors:
Mo Han,
Ozan Ozdenizci,
Ye Wang,
Toshiaki Koike-Akino,
Deniz Erdogmus
Abstract:
Recent developments in biosignal processing have enabled users to exploit their physiological status for manipulating devices in a reliable and safe manner. One major challenge of physiological sensing lies in the variability of biosignals across different users and tasks. To address this issue, we propose an adversarial feature extractor for transfer learning to exploit disentangled universal rep…
▽ More
Recent developments in biosignal processing have enabled users to exploit their physiological status for manipulating devices in a reliable and safe manner. One major challenge of physiological sensing lies in the variability of biosignals across different users and tasks. To address this issue, we propose an adversarial feature extractor for transfer learning to exploit disentangled universal representations. We consider the trade-off between task-relevant features and user-discriminative information by introducing additional adversary and nuisance networks in order to manipulate the latent representations such that the learned feature extractor is applicable to unknown users and various tasks. Results on cross-subject transfer evaluations exhibit the benefits of the proposed framework, with up to 8.8% improvement in average accuracy of classification, and demonstrate adaptability to a broader range of subjects.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
Disentangled speaker and nuisance attribute embedding for robust speaker verification
Authors:
Woo Hyun Kang,
Sung Hwan Mun,
Min Hyun Han,
Nam Soo Kim
Abstract:
Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states)…
▽ More
Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
Hypergraph Learning for Identification of COVID-19 with CT Imaging
Authors:
Donglin Di,
Feng Shi,
Fuhua Yan,
Liming Xia,
Zhanhao Mo,
Zhongxiang Ding,
Fei Shan,
Shengrui Li,
Ying Wei,
Ying Shao,
Miaofei Han,
Yaozong Gao,
He Sui,
Yue Gao,
Dinggang Shen
Abstract:
The coronavirus disease, named COVID-19, has become the largest global public health crisis since it started in early 2020. CT imaging has been used as a complementary tool to assist early screening, especially for the rapid identification of COVID-19 cases from community acquired pneumonia (CAP) cases. The main challenge in early screening is how to model the confusing cases in the COVID-19 and C…
▽ More
The coronavirus disease, named COVID-19, has become the largest global public health crisis since it started in early 2020. CT imaging has been used as a complementary tool to assist early screening, especially for the rapid identification of COVID-19 cases from community acquired pneumonia (CAP) cases. The main challenge in early screening is how to model the confusing cases in the COVID-19 and CAP groups, with very similar clinical manifestations and imaging features. To tackle this challenge, we propose an Uncertainty Vertex-weighted Hypergraph Learning (UVHL) method to identify COVID-19 from CAP using CT images. In particular, multiple types of features (including regional features and radiomics features) are first extracted from CT image for each case. Then, the relationship among different cases is formulated by a hypergraph structure, with each case represented as a vertex in the hypergraph. The uncertainty of each vertex is further computed with an uncertainty score measurement and used as a weight in the hypergraph. Finally, a learning process of the vertex-weighted hypergraph is used to predict whether a new testing case belongs to COVID-19 or not. Experiments on a large multi-center pneumonia dataset, consisting of 2,148 COVID-19 cases and 1,182 CAP cases from five hospitals, are conducted to evaluate the performance of the proposed method. Results demonstrate the effectiveness and robustness of our proposed method on the identification of COVID-19 in comparison to state-of-the-art methods.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
Actor-Critic Reinforcement Learning for Control with Stability Guarantee
Authors:
Minghao Han,
Lixian Zhang,
Jun Wang,
Wei Pan
Abstract:
Reinforcement Learning (RL) and its integration with deep learning have achieved impressive performance in various robotic control tasks, ranging from motion planning and navigation to end-to-end visual manipulation. However, stability is not guaranteed in model-free RL by solely using data. From a control-theoretic perspective, stability is the most important property for any control system, sinc…
▽ More
Reinforcement Learning (RL) and its integration with deep learning have achieved impressive performance in various robotic control tasks, ranging from motion planning and navigation to end-to-end visual manipulation. However, stability is not guaranteed in model-free RL by solely using data. From a control-theoretic perspective, stability is the most important property for any control system, since it is closely related to safety, robustness, and reliability of robotic systems. In this paper, we propose an actor-critic RL framework for control which can guarantee closed-loop stability by employing the classic Lyapunov's method in control theory. First of all, a data-based stability theorem is proposed for stochastic nonlinear systems modeled by Markov decision process. Then we show that the stability condition could be exploited as the critic in the actor-critic RL to learn a controller/policy. At last, the effectiveness of our approach is evaluated on several well-known 3-dimensional robot control tasks and a synthetic biology gene network tracking task in three different popular physics simulation platforms. As an empirical evaluation on the advantage of stability, we show that the learned policies can enable the systems to recover to the equilibrium or way-points when interfered by uncertainties such as system parametric variations and external disturbances to a certain extent.
△ Less
Submitted 15 July, 2020; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Disentangled Adversarial Transfer Learning for Physiological Biosignals
Authors:
Mo Han,
Ozan Ozdenizci,
Ye Wang,
Toshiaki Koike-Akino,
Deniz Erdogmus
Abstract:
Recent developments in wearable sensors demonstrate promising results for monitoring physiological status in effective and comfortable ways. One major challenge of physiological status assessment is the problem of transfer learning caused by the domain inconsistency of biosignals across users or different recording sessions from the same user. We propose an adversarial inference approach for trans…
▽ More
Recent developments in wearable sensors demonstrate promising results for monitoring physiological status in effective and comfortable ways. One major challenge of physiological status assessment is the problem of transfer learning caused by the domain inconsistency of biosignals across users or different recording sessions from the same user. We propose an adversarial inference approach for transfer learning to extract disentangled nuisance-robust representations from physiological biosignal data in stress status level assessment. We exploit the trade-off between task-related features and person-discriminative information by using both an adversary network and a nuisance network to jointly manipulate and disentangle the learned latent representations by the encoder, which are then input to a discriminative classifier. Results on cross-subjects transfer evaluations demonstrate the benefits of the proposed adversarial framework, and thus show its capabilities to adapt to a broader range of subjects. Finally we highlight that our proposed adversarial transfer learning approach is also applicable to other deep feature learning frameworks.
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
Lung Infection Quantification of COVID-19 in CT Images with Deep Learning
Authors:
Fei Shan,
Yaozong Gao,
Jun Wang,
Weiya Shi,
Nannan Shi,
Miaofei Han,
Zhong Xue,
Dinggang Shen,
Yuxin Shi
Abstract:
CT imaging is crucial for diagnosis, assessment and staging COVID-19 infection. Follow-up scans every 3-5 days are often recommended for disease progression. It has been reported that bilateral and peripheral ground glass opacification (GGO) with or without consolidation are predominant CT findings in COVID-19 patients. However, due to lack of computerized quantification tools, only qualitative im…
▽ More
CT imaging is crucial for diagnosis, assessment and staging COVID-19 infection. Follow-up scans every 3-5 days are often recommended for disease progression. It has been reported that bilateral and peripheral ground glass opacification (GGO) with or without consolidation are predominant CT findings in COVID-19 patients. However, due to lack of computerized quantification tools, only qualitative impression and rough description of infected areas are currently used in radiological reports. In this paper, a deep learning (DL)-based segmentation system is developed to automatically quantify infection regions of interest (ROIs) and their volumetric ratios w.r.t. the lung. The performance of the system was evaluated by comparing the automatically segmented infection regions with the manually-delineated ones on 300 chest CT scans of 300 COVID-19 patients. For fast manual delineation of training samples and possible manual intervention of automatic results, a human-in-the-loop (HITL) strategy has been adopted to assist radiologists for infection region segmentation, which dramatically reduced the total segmentation time to 4 minutes after 3 iterations of model updating. The average Dice simiarility coefficient showed 91.6% agreement between automatic and manual infaction segmentations, and the mean estimation error of percentage of infection (POI) was 0.3% for the whole lung. Finally, possible applications, including but not limited to analysis of follow-up CT scans and infection distributions in the lobes and segments correlated with clinical findings, were discussed.
△ Less
Submitted 30 March, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Attention based on-device streaming speech recognition with large speech corpus
Authors:
Kwangyoun Kim,
Kyungmin Lee,
Dhananjaya Gowda,
Junmo Park,
Sungsoo Kim,
Sichen **,
Young-Yoon Lee,
**su Yeo,
Daehyun Kim,
Seokyeong Jung,
Jungin Lee,
Myoungji Han,
Chanwoo Kim
Abstract:
In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer…
▽ More
In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
△ Less
Submitted 1 January, 2020;
originally announced January 2020.
-
The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge
Authors:
Nicholas Heller,
Fabian Isensee,
Klaus H. Maier-Hein,
Xiaoshuai Hou,
Chunmei Xie,
Fengyi Li,
Yang Nan,
Guangrui Mu,
Zhiyong Lin,
Miofei Han,
Guang Yao,
Yaozong Gao,
Yao Zhang,
Yixin Wang,
Feng Hou,
Jiawei Yang,
Guangwei Xiong,
Jiang Tian,
Cheng Zhong,
Jun Ma,
Jack Rickman,
Joshua Dean,
Bethany Stai,
Resha Tejpaul,
Makinna Oestreich
, et al. (16 additional authors not shown)
Abstract:
There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recen…
▽ More
There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recently, methods based on deep learning have shown excellent results in automatic 3D segmentation, but they require large datasets for training, and there remains little consensus on which methods perform best. The 2019 Kidney and Kidney Tumor Segmentation challenge (KiTS19) was a competition held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) which sought to address these issues and stimulate progress on this automatic segmentation problem. A training set of 210 cross sectional CT images with kidney tumors was publicly released with corresponding semantic segmentation masks. 106 teams from five continents used this data to develop automated systems to predict the true segmentation masks on a test set of 90 CT images for which the corresponding ground truth segmentations were kept private. These predictions were scored and ranked according to their average So rensen-Dice coefficient between the kidney and tumor across all 90 cases. The winning team achieved a Dice of 0.974 for kidney and 0.851 for tumor, approaching the inter-annotator performance on kidney (0.983) but falling short on tumor (0.923). This challenge has now entered an "open leaderboard" phase where it serves as a challenging benchmark in 3D semantic segmentation.
△ Less
Submitted 7 August, 2020; v1 submitted 2 December, 2019;
originally announced December 2019.
-
$H_\infty$ Model-free Reinforcement Learning with Robust Stability Guarantee
Authors:
Minghao Han,
Yuan Tian,
Lixian Zhang,
Jun Wang,
Wei Pan
Abstract:
Reinforcement learning is showing great potentials in robotics applications, including autonomous driving, robot manipulation and locomotion. However, with complex uncertainties in the real-world environment, it is difficult to guarantee the successful generalization and sim-to-real transfer of learned policies theoretically. In this paper, we introduce and extend the idea of robust stability and…
▽ More
Reinforcement learning is showing great potentials in robotics applications, including autonomous driving, robot manipulation and locomotion. However, with complex uncertainties in the real-world environment, it is difficult to guarantee the successful generalization and sim-to-real transfer of learned policies theoretically. In this paper, we introduce and extend the idea of robust stability and $H_\infty$ control to design policies with both stability and robustness guarantee. Specifically, a sample-based approach for analyzing the Lyapunov stability and performance robustness of a learning-based control system is proposed. Based on the theoretical results, a maximum entropy algorithm is developed for searching Lyapunov function and designing a policy with provable robust stability guarantee. Without any specific domain knowledge, our method can find a policy that is robust to various uncertainties and generalizes well to different test environments. In our experiments, we show that our method achieves better robustness to both large impulsive disturbances and parametric variations in the environment than the state-of-art results in both robust and generic RL, as well as classic control. Anonymous code is available to reproduce the experimental results at https://github.com/RobustStabilityGuaranteeRL/RobustStabilityGuaranteeRL.
△ Less
Submitted 25 July, 2020; v1 submitted 7 November, 2019;
originally announced November 2019.