Search | arXiv e-print repository

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Authors: Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu

Abstract: Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable… ▽ More Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. Project webpage: https://atedm.github.io. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

arXiv:2403.19105 [pdf, ps, other]

Pilot Signal and Channel Estimator Co-Design for Hybrid-Field XL-MIMO

Authors: Yoonseong Kang, Hyowoon Seo, Wan Choi

Abstract: This paper addresses the intricate task of hybrid-field channel estimation in extremely large-scale MIMO (XL-MIMO) systems, critical for the progression of 6G communications. Within these systems, comprising a line-of-sight (LoS) channel component alongside far-field and near-field scattering channel components, our objective is to tackle the channel estimation challenge. We encounter two central… ▽ More This paper addresses the intricate task of hybrid-field channel estimation in extremely large-scale MIMO (XL-MIMO) systems, critical for the progression of 6G communications. Within these systems, comprising a line-of-sight (LoS) channel component alongside far-field and near-field scattering channel components, our objective is to tackle the channel estimation challenge. We encounter two central hurdles for ensuring dependable sparse channel recovery: the design of pilot signals and channel estimators tailored for hybrid-field communications. To overcome the first challenge, we propose a method to derive optimal pilot signals, aimed at minimizing the mutual coherence of the sensing matrix within the context of compressive sensing (CS) problems. These optimal signals are derived using the alternating direction method of multipliers (ADMM), ensuring robust performance in sparse channel recovery. Additionally, leveraging the acquired optimal pilot signal, we introduce a two-stage channel estimation approach that sequentially estimates the LoS channel component and the hybrid-field scattering channel components. Simulation results attest to the superiority of our co-designed approach for pilot signal and channel estimation over conventional CS-based methods, providing more reliable sparse channel recovery in practical scenarios. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.07355 [pdf, ps, other]

Vector Quantization for Deep-Learning-Based CSI Feedback in Massive MIMO Systems

Authors: Junyong Shin, Yu** Kang, Yo-Seb Jeon

Abstract: This paper presents a finite-rate deep-learning (DL)-based channel state information (CSI) feedback method for massive multiple-input multiple-output (MIMO) systems. The presented method provides a finite-bit representation of the latent vector based on a vector-quantized variational autoencoder (VQ-VAE) framework while reducing its computational complexity based on shape-gain vector quantization.… ▽ More This paper presents a finite-rate deep-learning (DL)-based channel state information (CSI) feedback method for massive multiple-input multiple-output (MIMO) systems. The presented method provides a finite-bit representation of the latent vector based on a vector-quantized variational autoencoder (VQ-VAE) framework while reducing its computational complexity based on shape-gain vector quantization. In this method, the magnitude of the latent vector is quantized using a non-uniform scalar codebook with a proper transformation function, while the direction of the latent vector is quantized using a trainable Grassmannian codebook. A multi-rate codebook design strategy is also developed by introducing a codeword selection rule for a nested codebook along with the design of a loss function. Simulation results demonstrate that the proposed method reduces the computational complexity associated with VQ-VAE while improving CSI reconstruction performance under a given feedback overhead. △ Less

Submitted 12 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.11492 [pdf, other]

Exponential Cluster Synchronization in Fast Switching Network Topologies: A Pinning Control Approach with Necessary and Sufficient Conditions

Authors: Ku Du, Yu Kang

Abstract: This research investigates the intricate domain of synchronization problem among multiple agents operating within a dynamic fast switching network topology. We concentrate on cluster synchronization within coupled linear system under pinning control, providing both necessary and sufficient conditions. As a pivotal aspect, this paper aim to president the weakest possible conditions to make the coup… ▽ More This research investigates the intricate domain of synchronization problem among multiple agents operating within a dynamic fast switching network topology. We concentrate on cluster synchronization within coupled linear system under pinning control, providing both necessary and sufficient conditions. As a pivotal aspect, this paper aim to president the weakest possible conditions to make the coupled linear system realize cluster synchronization exponentially. Within the context of fast switching framework, we initially examine the necessary conditions, commencing with the transformation of the consensus problem into a stability problem, introducing a new variable to make the coupled system achieve cluster synchronization if the system is controllable; communication topology switching fast enough and the coupling strength should be sufficiently robust. Then, by using the Lyapunov theorem, we also present that the state matrix controllable is necessary for cluster synchronization. Furthermore, this paper culminating in the incorporation of contraction theory and an invariant manifold, demonstrating that the switching topology has an average is imperative for achieving cluster synchronization. Finally, we introduce three simulations to validate the efficacy of the proposed approach. △ Less

Submitted 18 February, 2024; originally announced February 2024.

arXiv:2312.05279 [pdf]

Quantitative perfusion maps using a novelty spatiotemporal convolutional neural network

Authors: Anbo Cao, Pin-Yu Le, Zhonghui Qie, Haseeb Hassan, Yingwei Guo, Asim Zaman, Jiaxi Lu, Xueqiang Zeng, Huihui Yang, Xiaoqiang Miao, Taiyu Han, Guangtao Huang, Yan Kang, Yu Luo, Jia Guo

Abstract: Dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI) is widely used to evaluate acute ischemic stroke to distinguish salvageable tissue and infarct core. For this purpose, traditional methods employ deconvolution techniques, like singular value decomposition, which are known to be vulnerable to noise, potentially distorting the derived perfusion parameters. However, deep learning t… ▽ More Dynamic susceptibility contrast magnetic resonance imaging (DSC-MRI) is widely used to evaluate acute ischemic stroke to distinguish salvageable tissue and infarct core. For this purpose, traditional methods employ deconvolution techniques, like singular value decomposition, which are known to be vulnerable to noise, potentially distorting the derived perfusion parameters. However, deep learning technology could leverage it, which can accurately estimate clinical perfusion parameters compared to traditional clinical approaches. Therefore, this study presents a perfusion parameters estimation network that considers spatial and temporal information, the Spatiotemporal Network (ST-Net), for the first time. The proposed network comprises a designed physical loss function to enhance model performance further. The results indicate that the network can accurately estimate perfusion parameters, including cerebral blood volume (CBV), cerebral blood flow (CBF), and time to maximum of the residual function (Tmax). The structural similarity index (SSIM) mean values for CBV, CBF, and Tmax parameters were 0.952, 0.943, and 0.863, respectively. The DICE score for the hypo-perfused region reached 0.859, demonstrating high consistency. The proposed model also maintains time efficiency, closely approaching the performance of commercial gold-standard software. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2310.10992 [pdf, other]

A High Fidelity and Low Complexity Neural Audio Coding

Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide-band components and adopts traditional signal processing to compress high-band components according to psychological hearing knowledge. Inspired by auditory perception theory, a perception-based loss function is designed to improve harmonic modeling. Besides, generative adversarial network (GAN) compression is proposed for the first time for neural audio codecs. Our method is superior to prior advanced neural codecs across subjective and objective metrics and allows real-time inference on desktop and mobile. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2308.11639 [pdf, other]

An Empirical Study on Fault Detection and Root Cause Analysis of Indium Tin Oxide Electrodes by Processing S-parameter Patterns

Authors: Tae Yeob Kang, Haebom Lee, Sungho Suh

Abstract: In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault diagnosis and root cause analysis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing… ▽ More In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault diagnosis and root cause analysis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing fault diagnosis methods have limitations in determining the root causes of the defects, often requiring destructive evaluations and secondary material characterization techniques. In this study, a fault diagnosis method with root cause analysis is proposed using scattering parameter (S-parameter) patterns, offering early detection, high diagnostic accuracy, and noise robustness. A comprehensive S-parameter pattern database is obtained according to various defect states of the ITO electrodes. Deep learning (DL) approaches, including multilayer perceptron (MLP), convolutional neural network (CNN), and transformer, are then used to simultaneously analyze the cause and severity of defects. Notably, it is demonstrated that the diagnostic performance under additive noise levels can be significantly enhanced by combining different channels of the S-parameters as input to the learning algorithms, as confirmed through the t-distributed stochastic neighbor embedding (t-SNE) dimension reduction visualization of the S-parameter patterns. △ Less

Submitted 10 June, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

Comments: Accepted in IEEE Transactions on Device and Materials Reliability

arXiv:2304.14225 [pdf, ps, other]

Once and for All: Scheduling Multiple Users Using Statistical CSI under Fixed Wireless Access

Authors: Xin Guan, Zhixing Chen, Yibin Kang, Qingjiang Shi

Abstract: Conventional multi-user scheduling schemes are designed based on instantaneous channel state information (CSI), indicating that decisions must be made every transmission time interval (TTI) which lasts at most several milliseconds. Only quite simple approaches can be exploited under this stringent time constraint, resulting in less than satisfactory scheduling performance. In this paper, we invest… ▽ More Conventional multi-user scheduling schemes are designed based on instantaneous channel state information (CSI), indicating that decisions must be made every transmission time interval (TTI) which lasts at most several milliseconds. Only quite simple approaches can be exploited under this stringent time constraint, resulting in less than satisfactory scheduling performance. In this paper, we investigate the scheduling problem of a fixed wireless access (FWA) network using only statistical CSI. Thanks to their fixed positions, user terminals in FWA can easily provide reliable large-scale CSI lasting tens or even hundreds of TTIs. Inspired by this appealing fact, we propose an \emph{`once-and-for-all'} scheduling approach, i.e. given multiple TTIs sharing identical statistical CSI, only a single high-quality scheduling decision lasting across all TTIs shall be taken rather than repeatedly making low-quality decisions every TTI. The proposed scheduling design is essentially a mixed-integer non-smooth non-convex stochastic problem with the objective of maximizing the weighted sum rate as well as the number of active users. We firstly replace the indicator functions in the considered problem by well-chosen sigmoid functions to tackle the non-smoothness. Via leveraging deterministic equivalent technique, we then convert the original stochastic problem into an approximated deterministic one, followed by linear relaxation of the integer constraints. However, the converted problem is still highly non-convex due to implicit equation constraints introduced by deterministic equivalent. To address this issue, we employ implicit optimization technique so that the gradient can be derived explicitly, with which we propose an algorithm design based on accelerated Frank-Wolfe method. Numerical results verify the effectiveness of our proposed scheduling scheme over state-of-the-art. △ Less

Submitted 27 April, 2023; originally announced April 2023.

Comments: 12 pages,6 figures

arXiv:2303.09278 [pdf, other]

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model

Authors: Yanzhe Fu, Yueteng Kang, Songjun Cao, Long Ma

Abstract: Wav2vec 2.0 (W2V2) has shown impressive performance in automatic speech recognition (ASR). However, the large model size and the non-streaming architecture make it hard to be used under low-resource or streaming scenarios. In this work, we propose a two-stage knowledge distillation method to solve these two problems: the first step is to make the big and non-streaming teacher model smaller, and th… ▽ More Wav2vec 2.0 (W2V2) has shown impressive performance in automatic speech recognition (ASR). However, the large model size and the non-streaming architecture make it hard to be used under low-resource or streaming scenarios. In this work, we propose a two-stage knowledge distillation method to solve these two problems: the first step is to make the big and non-streaming teacher model smaller, and the second step is to make it streaming. Specially, we adopt the MSE loss for the distillation of hidden layers and the modified LF-MMI loss for the distillation of the prediction layer. Experiments are conducted on Gigaspeech, Librispeech, and an in-house dataset. The results show that the distilled student model (DistillW2V2) we finally get is 8x faster and 12x smaller than the original teacher model. For the 480ms latency setup, the DistillW2V2's relative word error rate (WER) degradation varies from 9% to 23.4% on test sets, which reveals a promising way to extend the W2V2's application scope. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2212.05397 [pdf, other]

Task modules Partitioning, Scheduling and Floorplanning for Partially Dynamically Reconfigurable Systems Based on Modern Heterogeneous FPGAs

Authors: Bo Ding, **glei Huang, Junpeng Wang, Qi Xu, Song Chen, Yi Kang

Abstract: Modern field programmable gate array(FPGA) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. And FPGA-based partially dynamically reconfigurable system(FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS n… ▽ More Modern field programmable gate array(FPGA) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. And FPGA-based partially dynamically reconfigurable system(FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts, pre-processing to generate the list of task modules candidate shapes according to the resources requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the success rate of floorplan. Experimental results show that, compared with state-of-the-art work, the proposed complete workflow can improve performance by 18.7\%, reduce communication cost by 8.6\%, on average, with improving the resources reuse rate of the heterogeneous resources on the chip. And based on the solution generated by the exploration process, the post-optimization can improve the success rate of the floorplan by 14\%. △ Less

Submitted 10 December, 2022; originally announced December 2022.

arXiv:2207.08998 [pdf]

Discovering novel systemic biomarkers in photos of the external eye

Authors: Boris Babenko, Ilana Traynis, Christina Chen, Preeti Singh, Akib Uddin, Jorge Cuadros, Lauren P. Daskivich, April Y. Maa, Ramasamy Kim, Eugene Yu-Chuan Kang, Yossi Matias, Greg S. Corrado, Lily Peng, Dale R. Webster, Christopher Semturs, Jonathan Krause, Avinash V. Varadarajan, Naama Hammel, Yun Liu

Abstract: External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidn… ▽ More External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidney (eGFR estimated using the race-free 2021 CKD-EPI creatinine equation, the urine ACR); bone & mineral (calcium); thyroid (TSH); and blood count (Hgb, WBC, platelets). Development leveraged 151,237 images from 49,015 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA. Evaluation focused on 9 pre-specified systemic parameters and leveraged 3 validation sets (A, B, C) spanning 28,869 patients with and without diabetes undergoing eye screening in 3 independent sites in Los Angeles County, CA, and the greater Atlanta area, GA. We compared against baseline models incorporating available clinicodemographic variables (e.g. age, sex, race/ethnicity, years with diabetes). Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST>36, calcium<8.6, eGFR<60, Hgb<11, platelets<150, ACR>=300, and WBC<4 on validation set A (a patient population similar to the development sets), where the AUC of DLS exceeded that of the baseline by 5.2-19.4%. On validation sets B and C, with substantial patient population differences compared to the development sets, the DLS outperformed the baseline for ACR>=300 and Hgb<11 by 7.3-13.2%. Our findings provide further evidence that external eye photos contain important biomarkers of systemic health spanning multiple organ systems. Further work is needed to investigate whether and how these biomarkers can be translated into clinical impact. △ Less

Submitted 18 July, 2022; originally announced July 2022.

arXiv:2206.13042 [pdf, other]

A Strategy Optimized Pix2pix Approach for SAR-to-Optical Image Translation Task

Authors: Fujian Cheng, Yashu Kang, Chunlei Chen, Kezhao Jiang

Abstract: This technical report summarizes the analysis and approach on the image-to-image translation task in the Multimodal Learning for Earth and Environment Challenge (MultiEarth 2022). In terms of strategy optimization, cloud classification is utilized to filter optical images with dense cloud coverage to aid the supervised learning alike approach. The commonly used pix2pix framework with a few optimiz… ▽ More This technical report summarizes the analysis and approach on the image-to-image translation task in the Multimodal Learning for Earth and Environment Challenge (MultiEarth 2022). In terms of strategy optimization, cloud classification is utilized to filter optical images with dense cloud coverage to aid the supervised learning alike approach. The commonly used pix2pix framework with a few optimizations is applied to build the model. A weighted combination of mean squared error and mean absolute error is incorporated in the loss function. As for evaluation, peak to signal ratio and structural similarity were both considered in our preliminary analysis. Lastly, our method achieved the second place with a final error score of 0.0412. The results indicate great potential towards SAR-to-optical translation in remote sensing tasks, specifically for the support of long-term environmental monitoring and protection. △ Less

Submitted 4 July, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

arXiv:2206.09756 [pdf, other]

Time Gated Convolutional Neural Networks for Crop Classification

Authors: Longlong Weng, Yashu Kang, Kezhao Jiang, Chunlei Chen

Abstract: This paper presented a state-of-the-art framework, Time Gated Convolutional Neural Network (TGCNN) that takes advantage of temporal information and gating mechanisms for the crop classification problem. Besides, several vegetation indices were constructed to expand dimensions of input data to take advantage of spectral information. Both spatial (channel-wise) and temporal (step-wise) correlation a… ▽ More This paper presented a state-of-the-art framework, Time Gated Convolutional Neural Network (TGCNN) that takes advantage of temporal information and gating mechanisms for the crop classification problem. Besides, several vegetation indices were constructed to expand dimensions of input data to take advantage of spectral information. Both spatial (channel-wise) and temporal (step-wise) correlation are considered in TGCNN. Specifically, our preliminary analysis indicates that step-wise information is of greater importance in this data set. Lastly, the gating mechanism helps capture high-order relationship. Our TGCNN solution achieves $0.973$ F1 score, $0.977$ AUC ROC and $0.948$ IoU, respectively. In addition, it outperforms three other benchmarks in different local tasks (Kenya, Brazil and Togo). Overall, our experiments demonstrate that TGCNN is advantageous in this earth observation time series classification task. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2205.02845 [pdf, other]

Invariant Content Synergistic Learning for Domain Generalization of Medical Image Segmentation

Authors: Yuxin Kang, Hansheng Li, Xuan Zhao, Dongqing Hu, Feihong Liu, Lei Cui, Jun Feng, Lin Yang

Abstract: While achieving remarkable success for medical image segmentation, deep convolution neural networks (DCNNs) often fail to maintain their robustness when confronting test data with the novel distribution. To address such a drawback, the inductive bias of DCNNs is recently well-recognized. Specifically, DCNNs exhibit an inductive bias towards image style (e.g., superficial texture) rather than invar… ▽ More While achieving remarkable success for medical image segmentation, deep convolution neural networks (DCNNs) often fail to maintain their robustness when confronting test data with the novel distribution. To address such a drawback, the inductive bias of DCNNs is recently well-recognized. Specifically, DCNNs exhibit an inductive bias towards image style (e.g., superficial texture) rather than invariant content (e.g., object shapes). In this paper, we propose a method, named Invariant Content Synergistic Learning (ICSL), to improve the generalization ability of DCNNs on unseen datasets by controlling the inductive bias. First, ICSL mixes the style of training instances to perturb the training distribution. That is to say, more diverse domains or styles would be made available for training DCNNs. Based on the perturbed distribution, we carefully design a dual-branches invariant content synergistic learning strategy to prevent style-biased predictions and focus more on the invariant content. Extensive experimental results on two typical medical image segmentation tasks show that our approach performs better than state-of-the-art domain generalization methods. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: 10 pages, 5 figures

arXiv:2204.04645 [pdf, other]

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

Authors: Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, Wenbiao Ding

Abstract: Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but… ▽ More Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corresponding translated noisy audio features (text embeddings). (3) Iterative Denoising Process (IDP), which iteratively translates raw audio (text) and the corresponding text embeddings (audio features) translated from previous iteration into the new less-noisy text embeddings (audio features). We adapt a dual cross-modal Transformer as our backbone model which consists of two unimodal encoders for IDAE and two cross-modal encoders for CDAE and IDP. Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method. Our code is available at: \url{https://github.com/KarlYuKang/Low-Resource-Multimodal-Pre-training}. △ Less

Submitted 10 April, 2022; originally announced April 2022.

Comments: AAAI 2022

arXiv:2203.14476 [pdf]

A Novel Remote Sensing Approach to Recognize and Monitor Red Palm Weevil in Date Palm Trees

Authors: Yashu Kang, Chunlei Chen, Fujian Cheng, Jianyong Zhang

Abstract: The spread of the Red Pal Weevil (RPW) has become an existential threat for palm trees around the world. In the Middle East, RPW is causing wide-spread damage to date palm Phoenix dactylifera L., having both agricultural impacts on the palm production and environmental impacts. Early detection of RPW is very challenging, especially at large scale. This research proposes a novel remote sensing appr… ▽ More The spread of the Red Pal Weevil (RPW) has become an existential threat for palm trees around the world. In the Middle East, RPW is causing wide-spread damage to date palm Phoenix dactylifera L., having both agricultural impacts on the palm production and environmental impacts. Early detection of RPW is very challenging, especially at large scale. This research proposes a novel remote sensing approach to recognize and monitor red palm weevil in date palm trees, using a combination of vegetation indices, object detection and semantic segmentation techniques. The study area consists of date palm trees with three classes, including healthy palms, smallish palms and severely infected palms. This proposed method achieved a promising 0.947 F1 score on test data set. This work paves the way for deploying artificial intelligence approaches to monitor RPW in large-scale as well as provide guidance for practitioners. △ Less

Submitted 27 March, 2022; originally announced March 2022.

arXiv:2109.07327 [pdf, ps, other]

Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning

Authors: Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma

Abstract: Recently self-supervised learning has emerged as an effective approach to improve the performance of automatic speech recognition (ASR). Under such a framework, the neural network is usually pre-trained with massive unlabeled data and then fine-tuned with limited labeled data. However, the non-streaming architecture like bidirectional transformer is usually adopted by the neural network to achieve… ▽ More Recently self-supervised learning has emerged as an effective approach to improve the performance of automatic speech recognition (ASR). Under such a framework, the neural network is usually pre-trained with massive unlabeled data and then fine-tuned with limited labeled data. However, the non-streaming architecture like bidirectional transformer is usually adopted by the neural network to achieve competitive results, which can not be used in streaming scenarios. In this paper, we mainly focus on improving the performance of streaming transformer under the self-supervised learning framework. Specifically, we propose a novel two-stage training method during fine-tuning, which combines knowledge distilling and self-training. The proposed training method achieves 16.3% relative word error rate (WER) reduction on Librispeech noisy test set. Finally, by only using the 100h clean subset of Librispeech as the labeled data and the rest (860h) as the unlabeled data, our streaming transformer based model obtains competitive WERs 3.5/8.7 on Librispeech clean/noisy test sets. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: INTERSPEECH2021

arXiv:2107.07956 [pdf, other]

A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation

Authors: Hang Li, Yu Kang, Yang Hao, Wenbiao Ding, Zhongqin Wu, Zitao Liu

Abstract: The quality of vocal delivery is one of the key indicators for evaluating teacher enthusiasm, which has been widely accepted to be connected to the overall course qualities. However, existing evaluation for vocal delivery is mainly conducted with manual ratings, which faces two core challenges: subjectivity and time-consuming. In this paper, we present a novel machine learning approach that utiliz… ▽ More The quality of vocal delivery is one of the key indicators for evaluating teacher enthusiasm, which has been widely accepted to be connected to the overall course qualities. However, existing evaluation for vocal delivery is mainly conducted with manual ratings, which faces two core challenges: subjectivity and time-consuming. In this paper, we present a novel machine learning approach that utilizes pairwise comparisons and a multimodal orthogonal fusing algorithm to generate large-scale objective evaluation results of the teacher vocal delivery in terms of fluency and passion. We collect two datasets from real-world education scenarios and the experiment results demonstrate the effectiveness of our algorithm. To encourage reproducible results, we make our code public available at \url{https://github.com/tal-ai/ML4VocalDelivery.git}. △ Less

Submitted 15 July, 2021; originally announced July 2021.

Comments: AIED'21: The 22nd International Conference on Artificial Intelligence in Education, 2021

arXiv:2104.01471 [pdf, other]

Adversarial Joint Training with Self-Attention Mechanism for Robust End-to-End Speech Recognition

Authors: Lujun Li, Yikai Kang, Yuchen Shi, Ludwig Kürzinger, Tobias Watzel, Gerhard Rigoll

Abstract: Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enh… ▽ More Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two highlights which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs; while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement & ASR scheme without joint training, and 5.3% compared to multi-condition training. △ Less

Submitted 3 April, 2021; originally announced April 2021.

arXiv:2103.09407 [pdf, ps, other]

Model-Free Design of Stochastic LQR Controller from Reinforcement Learning and Primal-Dual Optimization Perspective

Authors: Man Li, Jiahu Qin, Wei Xing Zheng, Yaonan Wang, Yu Kang

Abstract: To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stoch… ▽ More To further understand the underlying mechanism of various reinforcement learning (RL) algorithms and also to better use the optimization theory to make further progress in RL, many researchers begin to revisit the linear-quadratic regulator (LQR) problem, whose setting is simple and yet captures the characteristics of RL. Inspired by this, this work is concerned with the model-free design of stochastic LQR controller for linear systems subject to Gaussian noises, from the perspective of both RL and primal-dual optimization. From the RL perspective, we first develop a new model-free off-policy policy iteration (MF-OPPI) algorithm, in which the sampled data is repeatedly used for updating the policy to alleviate the data-hungry problem to some extent. We then provide a rigorous analysis for algorithm convergence by showing that the involved iterations are equivalent to the iterations in the classical policy iteration (PI) algorithm. From the perspective of optimization, we first reformulate the stochastic LQR problem at hand as a constrained non-convex optimization problem, which is shown to have strong duality. Then, to solve this non-convex optimization problem, we propose a model-based primal-dual (MB-PD) algorithm based on the properties of the resulting Karush-Kuhn-Tucker (KKT) conditions. We also give a model-free implementation for the MB-PD algorithm by solving a transformed dual feasibility condition. More importantly, we show that the dual and primal update steps in the MB-PD algorithm can be interpreted as the policy evaluation and policy improvement steps in the PI algorithm, respectively. Finally, we provide one simulation example to show the performance of the proposed algorithms. △ Less

Submitted 16 March, 2021; originally announced March 2021.

arXiv:1910.13799 [pdf, other]

Multimodal Learning For Classroom Activity Detection

Authors: Hang Li, Yu Kang, Wenbiao Ding, Song Yang, Songfan Yang, Gale Yan Huang, Zitao Liu

Abstract: Classroom activity detection (CAD) focuses on accurately classifying whether the teacher or student is speaking and recording both the length of individual utterances during a class. A CAD solution helps teachers get instant feedback on their pedagogical instructions. This greatly improves educators' teaching skills and hence leads to students' achievement. However, CAD is very challenging because… ▽ More Classroom activity detection (CAD) focuses on accurately classifying whether the teacher or student is speaking and recording both the length of individual utterances during a class. A CAD solution helps teachers get instant feedback on their pedagogical instructions. This greatly improves educators' teaching skills and hence leads to students' achievement. However, CAD is very challenging because (1) the CAD model needs to be generalized well enough for different teachers and students; (2) data from both vocal and language modalities has to be wisely fused so that they can be complementary; and (3) the solution shouldn't heavily rely on additional recording device. In this paper, we address the above challenges by using a novel attention based neural framework. Our framework not only extracts both speech and language information, but utilizes attention mechanism to capture long-term semantic dependence. Our framework is device-free and is able to take any classroom recording as input. The proposed CAD learning framework is evaluated in two real-world education applications. The experimental results demonstrate the benefits of our approach on learning attention based neural network from classroom data with different modalities, and show our approach is able to outperform state-of-the-art baselines in terms of various evaluation metrics. △ Less

Submitted 10 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: The 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

arXiv:1905.02200 [pdf, other]

doi 10.1080/23729333.2019.1615729

Transferring Multiscale Map Styles Using Generative Adversarial Networks

Authors: Yuhao Kang, Song Gao, Robert E. Roth

Abstract: The advancement of the Artificial Intelligence (AI) technologies makes it possible to learn stylistic design criteria from existing maps or other visual art and transfer these styles to make new digital maps. In this paper, we propose a novel framework using AI for map style transfer applicable across multiple map scales. Specifically, we identify and transfer the stylistic elements from a target… ▽ More The advancement of the Artificial Intelligence (AI) technologies makes it possible to learn stylistic design criteria from existing maps or other visual art and transfer these styles to make new digital maps. In this paper, we propose a novel framework using AI for map style transfer applicable across multiple map scales. Specifically, we identify and transfer the stylistic elements from a target group of visual examples, including Google Maps, OpenStreetMap, and artistic paintings, to unstylized GIS vector data through two generative adversarial network (GAN) models. We then train a binary classifier based on a deep convolutional neural network to evaluate whether the transfer styled map images preserve the original map design characteristics. Our experiment results show that GANs have a great potential for multiscale map style transferring, but many challenges remain requiring future research. △ Less

Submitted 18 May, 2019; v1 submitted 6 May, 2019; originally announced May 2019.

Comments: 12 pages, 17 figure

ACM Class: I.2.1; I.4.9

Journal ref: International Journal of Cartography, 2019

arXiv:1901.05294 [pdf, other]

doi 10.1016/j.jfranklin.2020.01.008

Design of generalized fractional order gradient descent method

Authors: Yiheng Wei, Yu Kang, Weidi Yin, Yong Wang

Abstract: This paper focuses on the convergence problem of the emerging fractional order gradient descent method, and proposes three solutions to overcome the problem. In fact, the general fractional gradient method cannot converge to the real extreme point of the target function, which critically hampers the application of this method. Because of the long memory characteristics of fractional derivative, fi… ▽ More This paper focuses on the convergence problem of the emerging fractional order gradient descent method, and proposes three solutions to overcome the problem. In fact, the general fractional gradient method cannot converge to the real extreme point of the target function, which critically hampers the application of this method. Because of the long memory characteristics of fractional derivative, fixed memory principle is a prior choice. Apart from the truncation of memory length, two new methods are developed to reach the convergence. The one is the truncation of the infinite series, and the other is the modification of the constant fractional order. Finally, six illustrative examples are performed to illustrate the effectiveness and practicability of proposed methods. △ Less

Submitted 16 February, 2020; v1 submitted 24 December, 2018; originally announced January 2019.

Comments: 8 pages, 16 figures

MSC Class: 26A33; 90C25 ACM Class: F.2.2

arXiv:1810.02558 [pdf, other]

Optimal Denial-of-Service Attack Energy Management over an SINR-Based Network

Authors: Jiahu Qin, Menglin Li, Ling Shi, Yu Kang

Abstract: We consider a scenario in which a DoS attacker with the limited power resource jams a wireless network through which the packet from a sensor is sent to a remote estimator to estimate the system state. To degrade the estimation quality with power constraint, the attacker aims to solve how much power to obstruct the channel each time, which is the recently proposed optimal attack energy management… ▽ More We consider a scenario in which a DoS attacker with the limited power resource jams a wireless network through which the packet from a sensor is sent to a remote estimator to estimate the system state. To degrade the estimation quality with power constraint, the attacker aims to solve how much power to obstruct the channel each time, which is the recently proposed optimal attack energy management problem. The existing works are built on an ideal link model in which the packet dropout never occurs without attack. To encompass wireless transmission losses, we introduce the SINR-based link. First, we focus on the case when the attacker employs the constant power level. To maximize the terminal error at the remote estimator, we provide some sufficient conditions for the existence of an explicit solution to the optimal static attack energy management problem and the solution is constructed. Compared with the existing result in which corresponding sufficient conditions work only when the system matrix is normal, the obtained conditions in this paper are viable for a general system and shown to be more relaxed. For the other system index, the average error, the associated sufficient conditions are also derived based on different analysis with the existing work. And a feasible method is presented for both indexes when the system cannot meet the sufficient conditions. Then when the real-time ACK information can be acquired, an MDP based algorithm is designed to solve the optimal dynamic attack energy management problem. We further study the optimal tradeoff between attack power and system degradation. By moving power constraint into the objective function to maximize system index and minimize energy consumption, the other MDP based algorithm is proposed to find the optimal attack policy which is further shown to have a monotone structure. The theoretical results are illustrated by simulations. △ Less

Submitted 5 October, 2018; originally announced October 2018.

arXiv:1810.02531 [pdf, other]

Randomized Consensus based Distributed Kalman Filtering over Wireless Sensor Networks

Authors: Jiahu Qin, Jie Wang, Ling Shi, Yu Kang

Abstract: This paper is concerned with develo** a novel distributed Kalman filtering algorithm over wireless sensor networks based on randomized consensus strategy. Compared with the centralized algorithm, distributed filtering techniques require less computation per sensor and lead to more robust estimation since they simply use the information from the neighboring nodes in the network. However, poor loc… ▽ More This paper is concerned with develo** a novel distributed Kalman filtering algorithm over wireless sensor networks based on randomized consensus strategy. Compared with the centralized algorithm, distributed filtering techniques require less computation per sensor and lead to more robust estimation since they simply use the information from the neighboring nodes in the network. However, poor local sensor estimation caused by limited observability and network topology changes which interfere the global consensus are challenging issues. Motivated by this observation, we propose a novel randomized gossip-based distributed Kalman filtering algorithm. Information exchange and computation in the proposed algorithm can be carried out in an arbitrarily connected network of nodes. In addition, the computational burden can be distributed for a sensor which communicates with a stochastically selected neighbor at each clock step under schemes of gossip algorithm. In this case, the error covariance matrix changes stochastically at every clock step, thus the convergence is considered in a probabilistic sense. We provide the mean square convergence analysis of the proposed algorithm. Under a sufficient condition, we show that the proposed algorithm is quite appealing as it achieves better mean square error performance theoretically than the noncooperative decentralized Kalman filtering algorithm. Besides, considering the limited computation, communication, and energy resources in the wireless sensor networks, we propose an optimization problem which minimizes the average expected state estimation error based on the proposed algorithm. To solve the proposed problem efficiently, we transform it into a convex optimization problem. And a sub-optimal solution is attained. Examples and simulations are provided to illustrate the theoretical results. △ Less

Submitted 5 October, 2018; originally announced October 2018.

arXiv:1807.05855 [pdf]

A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network

Authors: Hosung Park, Donghyun Lee, Minkyu Lim, Yoseb Kang, Juneseok Oh, Ji-Hwan Kim

Abstract: In this paper, a time delay neural network (TDNN) based acoustic model is proposed to implement a fast-converged acoustic modeling for Korean speech recognition. The TDNN has an advantage in fast-convergence where the amount of training data is limited, due to subsampling which excludes duplicated weights. The TDNN showed an absolute improvement of 2.12% in terms of character error rate compared t… ▽ More In this paper, a time delay neural network (TDNN) based acoustic model is proposed to implement a fast-converged acoustic modeling for Korean speech recognition. The TDNN has an advantage in fast-convergence where the amount of training data is limited, due to subsampling which excludes duplicated weights. The TDNN showed an absolute improvement of 2.12% in terms of character error rate compared to feed forward neural network (FFNN) based modelling for Korean speech corpora. The proposed model converged 1.67 times faster than a FFNN-based model did. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Comments: 6 pages, 2 figures

arXiv:1806.09276 [pdf]

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Authors: Hao Li, Yongguo Kang, Zhenyu Wang

Abstract: We present EMPHASIS, an emotional phoneme-based acoustic model for speech synthesis system. EMPHASIS includes a phoneme duration prediction model and an acoustic parameter prediction model. It uses a CBHG-based regression network to model the dependencies between linguistic features and acoustic features. We modify the input and output layer structures of the network to improve the performance. Fo… ▽ More We present EMPHASIS, an emotional phoneme-based acoustic model for speech synthesis system. EMPHASIS includes a phoneme duration prediction model and an acoustic parameter prediction model. It uses a CBHG-based regression network to model the dependencies between linguistic features and acoustic features. We modify the input and output layer structures of the network to improve the performance. For the linguistic features, we apply a feature grou** strategy to enhance emotional and prosodic features. The acoustic parameters are designed to be suitable for the regression task and waveform reconstruction. EMPHASIS can synthesize speech in real-time and generate expressive interrogative and exclamatory speech with high audio quality. EMPHASIS is designed to be a multi-lingual model and can synthesize Mandarin-English speech for now. In the experiment of emotional speech synthesis, it achieves better subjective results than other real-time speech synthesis systems. △ Less

Submitted 25 June, 2018; v1 submitted 24 June, 2018; originally announced June 2018.

Comments: Accepted by Interspeech 2018

arXiv:1806.08619 [pdf, other]

Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

Authors: Yu Gu, Yongguo Kang

Abstract: This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be remov… ▽ More This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic features. Multi-task WaveNet can produce more natural and expressive speech by addressing the pitch prediction error accumulation issue and possesses more succinct inference procedures than the original WaveNet. Experimental results prove that the SPSS method proposed in this paper can achieve better performance than the state-of-the-art approach utilizing the original WaveNet in both objective and subjective preference tests. △ Less

Submitted 22 June, 2018; originally announced June 2018.

Comments: Accepted by Interspeech 2018

arXiv:1711.03536 [pdf, other]

Picasso, Matisse, or a Fake? Automated Analysis of Drawings at the Stroke Level for Attribution and Authentication

Authors: Ahmed Elgammal, Yan Kang, Milko Den Leeuw

Abstract: This paper proposes a computational approach for analysis of strokes in line drawings by artists. We aim at develo** an AI methodology that facilitates attribution of drawings of unknown authors in a way that is not easy to be deceived by forged art. The methodology used is based on quantifying the characteristics of individual strokes in drawings. We propose a novel algorithm for segmenting ind… ▽ More This paper proposes a computational approach for analysis of strokes in line drawings by artists. We aim at develo** an AI methodology that facilitates attribution of drawings of unknown authors in a way that is not easy to be deceived by forged art. The methodology used is based on quantifying the characteristics of individual strokes in drawings. We propose a novel algorithm for segmenting individual strokes. We designed and compared different hand-crafted and learned features for the task of quantifying stroke characteristics. We also propose and compare different classification methods at the drawing level. We experimented with a dataset of 300 digitized drawings with over 80 thousands strokes. The collection mainly consisted of drawings of Pablo Picasso, Henry Matisse, and Egon Schiele, besides a small number of representative works of other artists. The experiments shows that the proposed methodology can classify individual strokes with accuracy 70%-90%, and aggregate over drawings with accuracy above 80%, while being robust to be deceived by fakes (with accuracy 100% for detecting fakes in most settings). △ Less

Submitted 8 November, 2017; originally announced November 2017.

Showing 1–29 of 29 results for author: Kang, Y