Search | arXiv e-print repository

doi 10.1109/TASLP.2024.3350893

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Authors: Xun Gong, Yu Wu, **yu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Abstract: In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that th… ▽ More In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while kee** a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted by TASLP 2024

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1803-1815, 2024

arXiv:2402.17043 [pdf, other]

Traffic Control via Connected and Automated Vehicles: An Open-Road Field Experiment with 100 CAVs

Authors: Jonathan W. Lee, Han Wang, Kathy Jang, Amaury Hayat, Matthew Bunting, Arwa Alanqary, William Barbour, Zhe Fu, Xiaoqian Gong, George Gunter, Sharon Hornstein, Abdul Rahman Kreidieh, Nathan Lichtlé, Matthew W. Nice, William A. Richardson, Adit Shah, Eugene Vinitsky, Fangyu Wu, Shengquan Xiang, Sulaiman Almatrudi, Fahd Althukair, Rahul Bhadani, Joy Carpio, Raphael Chekroun, Eric Cheng , et al. (39 additional authors not shown)

Abstract: The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experim… ▽ More The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experiment leveraged a heterogeneous fleet of 100 longitudinally-controlled vehicles as Lagrangian traffic actuators, each of which ran a controller with the architecture described in this paper. The MegaController is a hierarchical control architecture, which consists of two main layers. The upper layer is called Speed Planner, and is a centralized optimal control algorithm. It assigns speed targets to the vehicles, conveyed through the LTE cellular network. The lower layer is a control layer, running on each vehicle. It performs local actuation by overriding the stock adaptive cruise controller, using the stock on-board sensors. The Speed Planner ingests live data feeds provided by third parties, as well as data from our own control vehicles, and uses both to perform the speed assignment. The architecture of the speed planner allows for modular use of standard control techniques, such as optimal control, model predictive control, kernel methods and others, including Deep RL, model predictive control and explicit controllers. Depending on the vehicle architecture, all onboard sensing data can be accessed by the local controllers, or only some. Control inputs vary across different automakers, with inputs ranging from torque or acceleration requests for some cars, and electronic selection of ACC set points in others. The proposed architecture allows for the combination of all possible settings proposed above. Most configurations were tested throughout the ramp up to the MegaVandertest. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2308.03591 [pdf, other]

On Data-Driven Modeling and Control in Modern Power Grids Stability: Survey and Perspective

Authors: Xun Gong, Xiaozhe Wang, Bo Cao

Abstract: Modern power grids are fast evolving with the increasing volatile renewable generation, distributed energy resources (DERs) and time-varying operating conditions. The DERs include rooftop photovoltaic (PV), small wind turbines, energy storages, flexible loads, electric vehicles (EVs), etc. The grid control is confronted with low inertia, uncertainty and nonlinearity that challenge the operation se… ▽ More Modern power grids are fast evolving with the increasing volatile renewable generation, distributed energy resources (DERs) and time-varying operating conditions. The DERs include rooftop photovoltaic (PV), small wind turbines, energy storages, flexible loads, electric vehicles (EVs), etc. The grid control is confronted with low inertia, uncertainty and nonlinearity that challenge the operation security, efficacy and efficiency. The ongoing digitization of power grids provides opportunities to address the challenges with data-driven and control. This paper provides a comprehensive review of emerging data-driven dynamical modeling and control methods and their various applications in power grid. Future trends are also discussed based on advances in data-driven control. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: To appear in Applied Energy

arXiv:2307.05383 [pdf]

Human Emotion Recognition Based On Galvanic Skin Response signal Feature Selection and SVM

Authors: Di Fan, Mingyang Liu, Xiaohan Zhang, Xiaopeng Gong

Abstract: A novel human emotion recognition method based on automatically selected Galvanic Skin Response (GSR) signal features and SVM is proposed in this paper. GSR signals were acquired by e-Health Sensor Platform V2.0. Then, the data is de-noised by wavelet function and normalized to get rid of the individual difference. 30 features are extracted from the normalized data, however, directly using of thes… ▽ More A novel human emotion recognition method based on automatically selected Galvanic Skin Response (GSR) signal features and SVM is proposed in this paper. GSR signals were acquired by e-Health Sensor Platform V2.0. Then, the data is de-noised by wavelet function and normalized to get rid of the individual difference. 30 features are extracted from the normalized data, however, directly using of these features will lead to a low recognition rate. In order to gain the optimized features, a covariance based feature selection is employed in our method. Finally, a SVM with input of the optimized features is utilized to achieve the human emotion recognition. The experimental results indicate that the proposed method leads to good human emotion recognition, and the recognition accuracy is more than 66.67%. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2305.13947 [pdf, ps, other]

Deep-Learning-Aided Alternating Least Squares for Tensor CP Decomposition and Its Application to Massive MIMO Channel Estimation

Authors: Xiao Gong, Wei Chen, Bo Ai, Geert Leus

Abstract: CANDECOMP/PARAFAC (CP) decomposition is the mostly used model to formulate the received tensor signal in a multi-domain massive multiple-input multiple-output (MIMO) system, as the receiver generally sums the components from different paths or users. To achieve accurate and low-latency channel estimation, good and fast CP decomposition algorithms are desired. The CP alternating least squares (CPAL… ▽ More CANDECOMP/PARAFAC (CP) decomposition is the mostly used model to formulate the received tensor signal in a multi-domain massive multiple-input multiple-output (MIMO) system, as the receiver generally sums the components from different paths or users. To achieve accurate and low-latency channel estimation, good and fast CP decomposition algorithms are desired. The CP alternating least squares (CPALS) is the workhorse algorithm for calculating the CP decomposition. However, its performance depends on the initializations, and good starting values can lead to more efficient solutions. Existing initialization strategies are decoupled from the CPALS and are not necessarily favorable for solving the CP decomposition. To enhance the algorithm's speed and accuracy, this paper proposes a deep-learning-aided CPALS (DL-CPALS) method that uses a deep neural network (DNN) to generate favorable initializations. The proposed DL-CPALS integrates the DNN and CPALS to a model-based deep learning paradigm, where it trains the DNN to generate an initialization that facilitates fast and accurate CP decomposition. Moreover, benefiting from the CP low-rankness, the proposed method is trained using noisy data and does not require paired clean data. The proposed DL-CPALS is applied to millimeter wave MIMO orthogonal frequency division multiplexing (mmWave MIMO-OFDM) channel estimation. Experimental results demonstrate the significant improvements of the proposed method in terms of both speed and accuracy for CP decomposition and channel estimation. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.10788 [pdf, other]

Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR

Authors: Hang Shao, Wei Wang, Bei Liu, Xun Gong, Haoyu Wang, Yanmin Qian

Abstract: Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce… ▽ More Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce model size while maintaining performance, we propose a novel guided knowledge distillation and quantization for large pre-trained model Whisper. The student model selects distillation and quantization layers based on quantization loss and distillation loss, respectively. We compressed $\text{Whisper}_\text{small}$ to $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$ levels, making $\text{Whisper}_\text{small}$ 5.18x/10.48x smaller, respectively. Moreover, compared to the original $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$, there is also a relative character error rate~(CER) reduction of 11.3% and 14.0% for the new compressed model respectively. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.00974 [pdf, other]

Optimal Resource Allocation between Two Nonfully Cooperative Wireless Networks under Malicious Attacks: A Gestalt Game Perspective

Authors: Yukang Cui, Xinru Yang, Tingwen Huang, Xin Gong

Abstract: In this paper, the problem of seeking optimal distributed resource allocation (DRA) policies on cellular networks in the presence of an unknown malicious adding-edge attacker is investigated. This problem is described as the games of games (GoG) model. Specifically, two subnetwork policymakers constitute a Nash game, while the confrontation between each subnetwork policymaker and the attacker is c… ▽ More In this paper, the problem of seeking optimal distributed resource allocation (DRA) policies on cellular networks in the presence of an unknown malicious adding-edge attacker is investigated. This problem is described as the games of games (GoG) model. Specifically, two subnetwork policymakers constitute a Nash game, while the confrontation between each subnetwork policymaker and the attacker is captured by a Stackelberg game. First, we show that the communication resource allocation of cellular networks based on the Foschini-Miljanic (FM) algorithm can be transformed into a \emph{geometric program} and be efficiently solved via convex optimization. Second, the upper limit of attack magnitude that can be tolerated by the network is calculated by the corresponding theory, and it is proved that the above geometric programming (GP) framework is solvable within the attack bound, that is, there exists a Gestalt Nash equilibrium (GNE) in our GoG. Third, a heuristic algorithm that iteratively uses GP is proposed to identify the optimal policy profiles of both subnetworks, for which asymptotic convergence is also confirmed. Fourth, a greedy heuristic adding-edge strategy is developed for the attacker to determine the set of the most vulnerable edges. Finally, simulation examples illustrate that the proposed theoretical results are robust and can achieve the GNE. It is verified that the transmission gains and interference gains of all channels are well tuned within a limited budget, despite the existence of malicious attacks. △ Less

Submitted 22 March, 2023; originally announced April 2023.

arXiv:2303.15299 [pdf, other]

Resilient Output Consensus Control of Heterogeneous Multi-agent Systems against Byzantine Attacks: A Twin Layer Approach

Authors: Xin Gong, Yiwen Liang, Yukang Cui, Shi Liang, Tingwen Huang

Abstract: This paper studies the problem of cooperative control of heterogeneous multi-agent systems (MASs) against Byzantine attacks. The agent affected by Byzantine attacks sends different wrong values to all neighbors while applying wrong input signals for itself, which is aggressive and difficult to be defended. Inspired by the concept of Digital Twin, a new hierarchical protocol equipped with a virtual… ▽ More This paper studies the problem of cooperative control of heterogeneous multi-agent systems (MASs) against Byzantine attacks. The agent affected by Byzantine attacks sends different wrong values to all neighbors while applying wrong input signals for itself, which is aggressive and difficult to be defended. Inspired by the concept of Digital Twin, a new hierarchical protocol equipped with a virtual twin layer (TL) is proposed, which decouples the above problems into the defense scheme against Byzantine edge attacks on the TL and the defense scheme against Byzantine node attacks on the cyber-physical layer (CPL). On the TL, we propose a resilient topology reconfiguration strategy by adding a minimum number of key edges to improve network resilience. It is strictly proved that the control strategy is sufficient to achieve asymptotic consensus in finite time with the topology on the TL satisfying strongly $(2f+1)$-robustness. On the CPL, decentralized chattering-free controllers are proposed to guarantee the resilient output consensus for the heterogeneous MASs against Byzantine node attacks. Moreover, the obtained controller shows exponential convergence. The effectiveness and practicality of the theoretical results are verified by numerical examples. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2303.12823 [pdf, other]

Data-Driven Leader-following Consensus for Nonlinear Multi-Agent Systems against Composite Attacks: A Twins Layer Approach

Authors: Xin Gong, **tao Peng, Dong Yang, Zhan Shu, Tingwen Huang, Yukang Cui

Abstract: This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequen… ▽ More This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequently, the resilient control task against CAs can be divided into two parts: One is distributed estimation against DoS attacks on the TL and the other is resilient decentralized tracking control against actuation attacks on the CPL. %The data-driven scheme is used to deal with both model non-linearity and model uncertainty, in which only the input and output data of the system are employed throughout the whole control process. First, a distributed observer based on switching estimation law against DoS is designed on TL. Second, a distributed model free adaptive control (DMFAC) protocol based on attack compensation against AAs is designed on CPL. Moreover, the uniformly ultimately bounded convergence of consensus error of the proposed double-layer DMFAC algorithm is strictly proved. Finally, the simulation verifies the effectiveness of the resilient double-layer control scheme. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2303.12693 [pdf, other]

Resilient Output Containment Control of Heterogeneous Multiagent Systems Against Composite Attacks: A Digital Twin Approach

Authors: Yukang Cui, Lingbo Cao, Michael V. Basin, Jun Shen, Tingwen Huang, Xin Gong

Abstract: This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense pr… ▽ More This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense protocols against DoS attacks on TL and defense protocols against actuation attacks on cyber-physical layer (CPL). First, considering modeling errors of leader dynamics, we introduce distributed observers to reconstruct the leader dynamics for each follower on TL under DoS attacks. Second, distributed estimators are used to estimate follower states according to the reconstructed leader dynamics on the TL. Third, according to the reconstructed leader dynamics, we design decentralized solvers that calculate the output regulator equations on CPL. Fourth, decentralized adaptive attack-resilient control schemes that resist unbounded actuation attacks are provided on CPL. Furthermore, we apply the above control protocols to prove that the followers can achieve uniformly ultimately bounded (UUB) convergence, and the upper bound of the UUB convergence is determined explicitly. Finally, two simulation examples are provided to show the effectiveness of the proposed control protocols. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2302.12434 [pdf, other]

Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion

Authors: Jiangyi Deng, Yanjiao Chen, Yinan Zhong, Qianhao Miao, Xueluan Gong, Wenyuan Xu

Abstract: Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling t… ▽ More Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling the features of the source speaker from a converted audio is challenging since the voice conversion operation intends to disentangle the original features and infuse the features of the target speaker. To fulfill our goal, we develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples. We equip Revelio with a carefully-designed differential rectification algorithm to eliminate the influence of the target speaker by removing the representation component that is parallel to the voiceprint of the target speaker. We have conducted extensive experiments to evaluate the capability of Revelio in restoring voiceprint from audios converted by VQVC, VQVC+, AGAIN, and BNE. The experiments verify that Revelio is able to rebuild voiceprints that can be traced to the source speaker by speaker verification and identification systems. Revelio also exhibits robust performance under inter-gender conversion, unseen languages, and telephony networks. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: Accepted by USENIX Security Symposium 2023. Please cite this paper as "Jiangyi Deng, Yanjiao Chen, Yinan Zhong, Qianhao Miao, Xueluan Gong, Wenyuan Xu. Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion. In 32nd USENIX Security Symposium (USENIX Security 23)."

arXiv:2301.01461 [pdf, other]

A Novel Koopman-Inspired Method for the Secondary Control of Microgrids with Grid-Forming and Grid-Following Sources

Authors: Xun Gong, Xiaozhe Wang

Abstract: This paper proposes an online data-driven Koopman-inspired identification and control method for microgrid secondary voltage and frequency control. Unlike typical data-driven methods, the proposed method requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotic stability under some mild conditions. The proposed method estimates the Koopman s… ▽ More This paper proposes an online data-driven Koopman-inspired identification and control method for microgrid secondary voltage and frequency control. Unlike typical data-driven methods, the proposed method requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotic stability under some mild conditions. The proposed method estimates the Koopman state space model adaptively so as to perform effective secondary voltage and frequency control that can handle microgrid nonlinearity and uncertainty. Case studies in the 4-bus and 13-bus microgrid test systems (with grid-forming and grid-following sources) demonstrate the effectiveness and robustness of the proposed identification and control method subject to the change of operating conditions and large disturbances (e.g., microgrid mode transitions, generation/load variations) even with measurement noises and time delays. △ Less

Submitted 4 January, 2023; originally announced January 2023.

Comments: Accepted by Applied Energy for future publication

arXiv:2211.09412 [pdf, other]

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Authors: Xun Gong, Yu Wu, **yu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Abstract: Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language mo… ▽ More Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP2023

arXiv:2211.03789 [pdf]

doi 10.3389/fenrg.2021.708456

A Random Forest and Current Fault Texture Feature-Based Method for Current Sensor Fault Diagnosis in Three-Phase PWM VSR

Authors: Lei Kou, Xiao-dong Gong, Yi Zheng, Xiu-hui Ni, Yang Li, Quan-de Yuan, Ya-nan Dong

Abstract: Three-phase PWM voltage-source rectifier (VSR) systems have been widely used in various energy conversion systems, where current sensors are the key component for state monitoring and system control. The current sensor faults may bring hidden danger or damage to the whole system; therefore, this paper proposed a random forest (RF) and current fault texture feature-based method for current sensor f… ▽ More Three-phase PWM voltage-source rectifier (VSR) systems have been widely used in various energy conversion systems, where current sensors are the key component for state monitoring and system control. The current sensor faults may bring hidden danger or damage to the whole system; therefore, this paper proposed a random forest (RF) and current fault texture feature-based method for current sensor fault diagnosis in three-phase PWM VSR systems. First, the three-phase alternating currents (ACs) of the three-phase PWM VSR are collected to extract the current fault texture features, and no additional hardware sensors are needed to avoid causing additional unstable factors. Then, the current fault texture features are adopted to train the random forest current sensor fault detection and diagnosis (CSFDD) classifier, which is a data-driven CSFDD classifier. Finally, the effectiveness of the proposed method is verified by simulation experiments. The result shows that the current sensor faults can be detected and located successfully and that it can effectively provide fault locations for maintenance personnel to keep the stable operation of the whole system. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Frontiers in Energy Research

MSC Class: 68Q04 ACM Class: I.2

arXiv:2211.00221 [pdf]

doi 10.3390/s22082822

Review on Monitoring, Operation and Maintenance of Smart Offshore Wind Farms

Authors: Lei Kou, Yang Li, Fangfang Zhang, Xiaodong Gong, Yinghong Hu, Quande Yuan, Wende Ke

Abstract: In recent years, with the development of wind energy, the number and scale of wind farms are develo** rapidly. Since offshore wind farm has the advantages of stable wind speed, clean, renewable, non-polluting and no occupation of cultivated land, which has gradually become a new trend of wind power industry all over the world. The operation and maintenance mode of offshore wind power is developi… ▽ More In recent years, with the development of wind energy, the number and scale of wind farms are develo** rapidly. Since offshore wind farm has the advantages of stable wind speed, clean, renewable, non-polluting and no occupation of cultivated land, which has gradually become a new trend of wind power industry all over the world. The operation and maintenance mode of offshore wind power is develo** in the direction of digitization and intelligence. It is of great significance to carry out the research on the monitoring, operation and maintenance of offshore wind farm, which will be of benefits to reduce the operation and maintenance cost, improve the power generation efficiency, improve the stability of offshore wind farm system and build smart offshore wind farm. This paper will mainly analyze and summarize the monitoring, operation and maintenance of offshore wind farm, especially from the following points: monitoring of "offshore wind power engineering & biological & environment", the monitoring of power equipment and the operation & maintenance of smart offshore wind farms. Finally, the future research challenges about monitoring, operation and maintenance of smart offshore wind farm are proposed, and the future research directions in this field are prospected. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: accepted by Sensors

MSC Class: 90B25 ACM Class: I.2

Journal ref: Sensors 2022, 22, 2822

arXiv:2209.15329 [pdf, other]

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Authors: Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, **yu Li, Furu Wei

Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret… ▽ More How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM. △ Less

Submitted 15 June, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: We have corrected the errors in the pre-training data for SpeechLM-P Base models, new results are updated

arXiv:2207.10600 [pdf, other]

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

Authors: Xun Gong, Zhikai Zhou, Yanmin Qian

Abstract: Modern non-autoregressive~(NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive~(AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. F… ▽ More Modern non-autoregressive~(NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive~(AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. Frame- and sequence-level objectives are well-designed for transfer learning. To further boost the performance of NAR, a beam search method on Mask-CTC is developed to enlarge the search space during the inference stage. Experiments show that the proposed NAR beam search relatively reduces CER by over 5% on AISHELL-1 benchmark with a tolerable real-time-factor~(RTF) increment. By knowledge transfer, the NAR student who has the same size as the AR teacher obtains relative CER reductions of 8/16% on AISHELL-1 dev/test sets, and over 25% relative WER reductions on LibriSpeech test-clean/other sets. Moreover, the ~9x smaller NAR models achieve ~25% relative CER/WER reductions on both AISHELL-1 and LibriSpeech benchmarks with the proposed knowledge transfer and distillation. △ Less

Submitted 15 July, 2022; originally announced July 2022.

Comments: Accepted to Interspeech 2022

arXiv:2207.05204 [pdf]

An Online Data-Driven Method for Microgrid Secondary Voltage and Frequency Control with Ensemble Koopman Modeling

Authors: Xun Gong, Xiaozhe Wang, Geza Joos

Abstract: Low inertia, nonlinearity and a high level of uncertainty (varying topologies and operating conditions) pose challenges to microgrid (MG) systemwide operation. This paper proposes an online adaptive Koopman operator optimal control (AKOOC) method for MG secondary voltage and frequency control. Unlike typical data-driven methods that are data-hungry and lack guaranteed stability, the proposed AKOOC… ▽ More Low inertia, nonlinearity and a high level of uncertainty (varying topologies and operating conditions) pose challenges to microgrid (MG) systemwide operation. This paper proposes an online adaptive Koopman operator optimal control (AKOOC) method for MG secondary voltage and frequency control. Unlike typical data-driven methods that are data-hungry and lack guaranteed stability, the proposed AKOOC requires no warm-up training yet with guaranteed bounded-input-bounded-output (BIBO) stability and even asymptotical stability under some mild conditions. The proposed AKOOC is developed based on an ensemble Koopman state space modeling with full basis functions that combines both linear and nonlinear bases without the need of event detection or switching. An iterative learning method is also developed to exploit model parameters, ensuring the effectiveness and the adaptiveness of the designed control. Simulation studies in the 4-bus (with detailed inner-loop control) MG system and the 34-bus MG system showed improved modeling accuracy and control, verifying the effectiveness of the proposed method subject to various changes of operating conditions even with time delay, measurement noise, and missing measurements. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: Accepted by IEEE Transactions on Smart Grid for future publication

arXiv:2205.06913 [pdf, other]

doi 10.1140/epjs/s11734-022-00580-z

A rigorous multi-population multi-lane hybrid traffic model and its mean-field limit for dissipation of waves via autonomous vehicles

Authors: Nicolas Kardous, Amaury Hayat, Sean T. McQuade, Xiaoqian Gong, Sydney Truong, Tinhinane Mezair, Paige Arnold, Ryan Delorenzo, Alexandre Bayen, Benedetto Piccoli

Abstract: In this paper, a multi-lane multi-population microscopic model, which presents stop and go waves, is proposed to simulate traffic on a ring-road. Vehicles are divided between human-driven and autonomous vehicles (AV). Control strategies are designed with the ultimate goal of using a small number of AVs (less than 5\% penetration rate) to represent Lagrangian control actuators that can smooth the m… ▽ More In this paper, a multi-lane multi-population microscopic model, which presents stop and go waves, is proposed to simulate traffic on a ring-road. Vehicles are divided between human-driven and autonomous vehicles (AV). Control strategies are designed with the ultimate goal of using a small number of AVs (less than 5\% penetration rate) to represent Lagrangian control actuators that can smooth the multilane traffic flow and dissipate the stop-and-go waves. This in turn may reduce fuel consumption and emissions. The lane-changing mechanism is based on three components that we treat as parameters in the model: safety, incentive and cool-down time. The choice of these parameters in the lane-change mechanism is critical to modeling traffic accurately, because different parameter values can lead to drastically different traffic behaviors. In particular, the number of lane-changes and the speed variance are highly affected by the choice of parameters. Despite this modeling issue, when using sufficiently simple and robust controllers for AVs, the stabilization of uniform flow steady-state is effective for any realistic value of the parameters, and ultimately bypasses the observed modeling issue. Our approach is based on accurate and rigorous mathematical models, which allows a limit procedure that is termed, in gas dynamic terminology, mean-field. In simple words, from increasing the human-driven population to infinity, a system of coupled ordinary and partial differential equations are obtained. Moreover, control problems also pass to the limit, allowing the design to be tackled at different scales. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 24p. 6 figures

MSC Class: 90B20; 93C15

arXiv:2204.09883 [pdf, other]

doi 10.21437/Interspeech.2021-1075

Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

Authors: Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian

Abstract: Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we… ▽ More Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12\% and 10\% relative word error rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline. △ Less

Submitted 21 April, 2022; originally announced April 2022.

Comments: Accepted by Interspeech2021

Journal ref: Proc. Interspeech 2021

arXiv:2201.04498 [pdf, other]

Towards Integrated Sensing and Communications for 6G

Authors: Qi Wang, Anastasios Kakkavas, Xitao Gong, Richard A. Stirling-Gallacher

Abstract: For the next generation of mobile communications systems, the integration of sensing and communications promises benefits in terms of spectrum utilization, cost, latency, area and weight. In this paper, we categorize and summarize the key features and technical considerations for different integration approaches and discuss related waveform design issues for a future 6G system. We provide results… ▽ More For the next generation of mobile communications systems, the integration of sensing and communications promises benefits in terms of spectrum utilization, cost, latency, area and weight. In this paper, we categorize and summarize the key features and technical considerations for different integration approaches and discuss related waveform design issues for a future 6G system. We provide results on new candidate waveforms for monostatic sensing and finally highlight important open issues and directions that deserve future in-depth research. △ Less

Submitted 12 January, 2022; originally announced January 2022.

Comments: Accepted for publication at the 2nd IEEE International Symposium on Joint Communications & Sensing

arXiv:2112.06091 [pdf]

Continuous Human Action Detection Based on Wearable Inertial Data

Authors: Xia Gong, Yan Lu, Haoran Wei

Abstract: Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching. As inertial sensors are low cost, portable, and having no operating space, it is suitable to detect human action. In real-world applications, actions that are of interest appear among actions of non interest wit… ▽ More Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching. As inertial sensors are low cost, portable, and having no operating space, it is suitable to detect human action. In real-world applications, actions that are of interest appear among actions of non interest without pauses in between. Recognizing and detecting actions of interests from continuous action streams is more challenging and useful for real applications. Based on inertial sensor and C-MHAD smart TV gesture recognition dataset, this paper utilized different inertial sensor feature formats, then compared the performance with different deep neural network structures according to these feature formats. Experiment results show the best performance was achieved by image based inertial feature with convolution neural network, which got 51.1% F1 score. △ Less

Submitted 11 December, 2021; originally announced December 2021.

arXiv:2108.08470 [pdf]

ChMusic: A Traditional Chinese Music Dataset for Evaluation of Instrument Recognition

Authors: Xia Gong, Yuxiang Zhu, Haidi Zhu, Haoran Wei

Abstract: Musical instruments recognition is a widely used application for music information retrieval. As most of previous musical instruments recognition dataset focus on western musical instruments, it is difficult for researcher to study and evaluate the area of traditional Chinese musical instrument recognition. This paper propose a traditional Chinese music dataset for training model and performance e… ▽ More Musical instruments recognition is a widely used application for music information retrieval. As most of previous musical instruments recognition dataset focus on western musical instruments, it is difficult for researcher to study and evaluate the area of traditional Chinese musical instrument recognition. This paper propose a traditional Chinese music dataset for training model and performance evaluation, named ChMusic. This dataset is free and publicly available, 11 traditional Chinese musical instruments and 55 traditional Chinese music excerpts are recorded in this dataset. Then an evaluation standard is proposed based on ChMusic dataset. With this standard, researchers can compare their results following the same rule, and results from different researchers will become comparable. △ Less

Submitted 11 December, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

arXiv:2104.11267 [pdf, other]

Integrated Framework of Vehicle Dynamics, Instabilities, Energy Models, and Sparse Flow Smoothing Controllers

Authors: Jonathan W. Lee, George Gunter, Rabie Ramadan, Sulaiman Almatrudi, Paige Arnold, John Aquino, William Barbour, Rahul Bhadani, Joy Carpio, Fang-Chieh Chou, Marsalis Gibson, Xiaoqian Gong, Amaury Hayat, Nour Khoudari, Abdul Rahman Kreidieh, Maya Kumar, Nathan Lichtlé, Sean McQuade, Brian Nguyen, Megan Ross, Sydney Truong, Eugene Vinitsky, Yibo Zhao, Jonathan Sprinkle, Benedetto Piccoli , et al. (3 additional authors not shown)

Abstract: This work presents an integrated framework of: vehicle dynamics models, with a particular attention to instabilities and traffic waves; vehicle energy models, with particular attention to accurate energy values for strongly unsteady driving profiles; and sparse Lagrangian controls via automated vehicles, with a focus on controls that can be executed via existing technology such as adaptive cruise… ▽ More This work presents an integrated framework of: vehicle dynamics models, with a particular attention to instabilities and traffic waves; vehicle energy models, with particular attention to accurate energy values for strongly unsteady driving profiles; and sparse Lagrangian controls via automated vehicles, with a focus on controls that can be executed via existing technology such as adaptive cruise control systems. This framework serves as a key building block in develo** control strategies for human-in-the-loop traffic flow smoothing on real highways. In this contribution, we outline the fundamental merits of integrating vehicle dynamics and energy modeling into a single framework, and we demonstrate the energy impact of sparse flow smoothing controllers via simulation results. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:2104.02583 [pdf, other]

Limitations and Improvements of the Intelligent Driver Model (IDM)

Authors: Saleh Albeaik, Alexandre Bayen, Maria Teresa Chiri, Xiaoqian Gong, Amaury Hayat, Nicolas Kardous, Alexander Keimer, Sean T. McQuade, Benedetto Piccoli, Yiling You

Abstract: This contribution analyzes the widely used and well-known "intelligent driver model (briefly IDM), which is a second order car-following model governed by a system of ordinary differential equations. Although this model was intensively studied in recent years for properly capturing traffic phenomena and driver braking behavior, a rigorous study of the well-posedness has, to our knowledge, never be… ▽ More This contribution analyzes the widely used and well-known "intelligent driver model (briefly IDM), which is a second order car-following model governed by a system of ordinary differential equations. Although this model was intensively studied in recent years for properly capturing traffic phenomena and driver braking behavior, a rigorous study of the well-posedness has, to our knowledge, never been performed. First it is shown that, for a specific class of initial data, the vehicles' velocities become negative or even diverge to $-\infty$ in finite time, both undesirable properties for a car-following model. Various modifications of the IDM are then proposed in order to avoid such ill-posedness. The theoretical remediation of the model, rather than post facto by ad-hoc modification of code implementations, allows a more sound numerical implementation and preservation of the model features. Indeed, to avoid inconsistencies and ensure dynamics close to the one of the original model, one may need to inspect and clean large input data, which may result in practically impossible scenarios for large-scale simulations. Although well-posedness issues occur only for specific initial data, this may happen frequently when different traffic scenarios are analyzed, and especially in presence of lane-changing, on ramps and other network components as it is the case for most commonly used micro-simulators. On the other side, it is shown that well-posedness can be guaranteed by straight-forward improvements, such as those obtained by slightly changing the acceleration to prevent the velocity from becoming negative. △ Less

Submitted 1 April, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

Comments: 28 pages, 20 Figures

MSC Class: 34A12; 34A38; 65L05; 65L08

arXiv:2011.04254 [pdf, ps, other]

Enhanced Few-shot Learning for Intrusion Detection in Railway Video Surveillance

Authors: Xiao Gong, Xi Chen, Wei Chen

Abstract: Video surveillance is gaining increasing popularity to assist in railway intrusion detection in recent years. However, efficient and accurate intrusion detection remains a challenging issue due to: (a) limited sample number: only small sample size (or portion) of intrusive video frames is available; (b) low inter-scene dissimilarity: various railway track area scenes are captured by cameras instal… ▽ More Video surveillance is gaining increasing popularity to assist in railway intrusion detection in recent years. However, efficient and accurate intrusion detection remains a challenging issue due to: (a) limited sample number: only small sample size (or portion) of intrusive video frames is available; (b) low inter-scene dissimilarity: various railway track area scenes are captured by cameras installed in different landforms; (c) high intra-scene similarity: the video frames captured by an individual camera share a same backgound. In this paper, an efficient few-shot learning solution is developed to address the above issues. In particular, an enhanced model-agnostic meta-learner is trained using both the original video frames and segmented masks of track area extracted from the video. Moreover, theoretical analysis and engineering solutions are provided to cope with the highly similar video frames in the meta-model training phase. The proposed method is tested on realistic railway video dataset. Numerical results show that the enhanced meta-learner successfully adapts unseen scene with only few newly collected video frame samples, and its intrusion detection accuracy outperforms that of the standard randomly initialized supervised learning. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: 11 pages, submitted

arXiv:2007.04390 [pdf, other]

Achievable Rates of Opportunistic Cognitive Radio Systems Using Reconfigurable Antennas with Imperfect Sensing and Channel Estimation

Authors: Hassan Yazdani, Azadeh Vosoughi, Xun Gong

Abstract: We consider an opportunistic cognitive radio (CR) system in which secondary transmitter (SUtx) is equipped with a reconfigurable antenna (RA). Utilizing the beam steering capability of the RA, we regard a design framework for integrated sector-based spectrum sensing and data communication. In this framework, SUtx senses the spectrum and detects the beam corresponding to active primary user's (PU)… ▽ More We consider an opportunistic cognitive radio (CR) system in which secondary transmitter (SUtx) is equipped with a reconfigurable antenna (RA). Utilizing the beam steering capability of the RA, we regard a design framework for integrated sector-based spectrum sensing and data communication. In this framework, SUtx senses the spectrum and detects the beam corresponding to active primary user's (PU) location. SUtx also sends training symbols (prior to data symbols), to enable channel estimation at secondary receiver (SUrx) and selection of the strongest beam between SUtx-SUrx for data transmission. We establish a lower bound on the achievable rates of SUtx-SUrx link, in the presence of spectrum sensing and channel estimation errors, and errors due to incorrect detection of the beam corresponding to PU's location and incorrect selection of the strongest beam for data transmission. We formulate a novel constrained optimization problem, aiming at maximizing the derived achievable rate lower bound subject to average transmit and interference power constraints. We optimize the durations of spatial spectrum sensing and channel training as well as data symbol transmission power. Our numerical results demonstrate that between optimizing spectrum sensing and channel training durations, the latter is more important for providing higher achievable rates. △ Less

Submitted 8 July, 2020; originally announced July 2020.

Comments: This paper has been submitted to IEEE Transactions on Cognitive Communications and Networking

arXiv:2005.03215 [pdf, other]

AutoSpeech: Neural Architecture Search for Speaker Recognition

Authors: Shao** Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, Zhangyang Wang

Abstract: Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture… ▽ More Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity. △ Less

Submitted 31 August, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

arXiv:2004.05804 [pdf, other]

Multi-modal Datasets for Super-resolution

Authors: Haoran Li, Weihong Quan, Meijun Yan, ** zhang, Xiaoli Gong, ** Zhou

Abstract: Nowdays, most datasets used to train and evaluate super-resolution models are single-modal simulation datasets. However, due to the variety of image degradation types in the real world, models trained on single-modal simulation datasets do not always have good robustness and generalization ability in different degradation scenarios. Previous work tended to focus only on true-color images. In contr… ▽ More Nowdays, most datasets used to train and evaluate super-resolution models are single-modal simulation datasets. However, due to the variety of image degradation types in the real world, models trained on single-modal simulation datasets do not always have good robustness and generalization ability in different degradation scenarios. Previous work tended to focus only on true-color images. In contrast, we first proposed real-world black-and-white old photo datasets for super-resolution (OID-RW), which is constructed using two methods of manually filling pixels and shooting with different cameras. The dataset contains 82 groups of images, including 22 groups of character type and 60 groups of landscape and architecture. At the same time, we also propose a multi-modal degradation dataset (MDD400) to solve the super-resolution reconstruction in real-life image degradation scenarios. We managed to simulate the process of generating degraded images by the following four methods: interpolation algorithm, CNN network, GAN network and capturing videos with different bit rates. Our experiments demonstrate that not only the models trained on our dataset have better generalization capability and robustness, but also the trained images can maintain better edge contours and texture features. △ Less

Submitted 13 April, 2020; originally announced April 2020.

arXiv:1912.03449 [pdf, other]

Fully Dense Neural Network for the Automatic Modulation Recognition

Authors: Miao Du, Qin Yu, Shaomin Fei, Chen Wang, Xiaofeng Gong, Ruisen Luo

Abstract: Nowadays, we mainly use various convolution neural network (CNN) structures to extract features from radio data or spectrogram in AMR. Based on expert experience and spectrograms, they not only increase the difficulty of preprocessing, but also consume a lot of memory. In order to directly use in-phase and quadrature (IQ) data obtained by the receiver and enhance the efficiency of network extracti… ▽ More Nowadays, we mainly use various convolution neural network (CNN) structures to extract features from radio data or spectrogram in AMR. Based on expert experience and spectrograms, they not only increase the difficulty of preprocessing, but also consume a lot of memory. In order to directly use in-phase and quadrature (IQ) data obtained by the receiver and enhance the efficiency of network extraction features to improve the recognition rate of modulation mode, this paper proposes a new network structure called Fully Dense Neural Network (FDNN). This network uses residual blocks to extract features, dense connect to reduce model size, and adds attentions mechanism to recalibrate. Experiments on RML2016.10a show that this network has a higher recognition rate and lower model complexity. And it shows that the FDNN model with dense connections can not only extract features effectively but also greatly reduce model parameters, which also provides a significant contribution for the application of deep learning to the intelligent radio system. △ Less

Submitted 7 December, 2019; originally announced December 2019.

arXiv:1910.07895 [pdf]

A New Three-stage Curriculum Learning Approach to Deep Network Based Liver Tumor Segmentation

Authors: Huiyu Li, Xiabi Liu, Said Boumaraf, Weihua Liu, Xiaopeng Gong, Xiaohong Ma

Abstract: Automatic segmentation of liver tumors in medical images is crucial for the computer-aided diagnosis and therapy. It is a challenging task, since the tumors are notoriously small against the background voxels. This paper proposes a new three-stage curriculum learning approach for training deep networks to tackle this small object segmentation problem. The learning in the first stage is performed o… ▽ More Automatic segmentation of liver tumors in medical images is crucial for the computer-aided diagnosis and therapy. It is a challenging task, since the tumors are notoriously small against the background voxels. This paper proposes a new three-stage curriculum learning approach for training deep networks to tackle this small object segmentation problem. The learning in the first stage is performed on the whole input to obtain an initial deep network for tumor segmenta-tion. Then the second stage of learning focuses the strength-ening of tumor specific features by continuing training the network on the tumor patches. Finally, we retrain the net-work on the whole input in the third stage, in order that the tumor specific features and the global context can be inte-grated ideally under the segmentation objective. Benefitting from the proposed learning approach, we only need to em-ploy one single network to segment the tumors directly. We evaluated our approach on the 2017 MICCAI Liver Tumor Segmentation challenge dataset. In the experiments, our approach exhibits significant improvement compared with the commonly used cascaded counterpart. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 5 pages, 3 figures, 1 table, conference

arXiv:1910.03928 [pdf]

A New Deep Learning Method for Image Deblurring in Optical Microscopic Systems

Authors: Huangxuan Zhao, Ziwen Ke, Ningbo Chen, Ke Li, Lidai Wang, Xiao**g Gong, Wei Zheng, Liang Song, Zhicheng Liu, Dong Liang, Chengbo Liu

Abstract: Deconvolution is the most commonly used image processing method to remove the blur caused by the point-spread-function (PSF) in optical imaging systems. While this method has been successful in deblurring, it suffers from several disadvantages including being slow, since it takes many iterations, suboptimal, in cases where experimental operator chosen to represent PSF is not optimal. In this paper… ▽ More Deconvolution is the most commonly used image processing method to remove the blur caused by the point-spread-function (PSF) in optical imaging systems. While this method has been successful in deblurring, it suffers from several disadvantages including being slow, since it takes many iterations, suboptimal, in cases where experimental operator chosen to represent PSF is not optimal. In this paper, we are proposing a deep-learning-based deblurring method applicable to optical microscopic imaging systems. We tested the proposed method in database data, simulated data, and experimental data (include 2D optical microscopic data and 3D photoacoustic microscopic data), all of which showed much improved deblurred results compared to deconvolution. To quantify the improved performance, we compared our results against several deconvolution methods. Our results are better than conventional techniques and do not require multiple iterations or pre-determined experimental operator. Our method has the advantages of simple operation, short time to compute, good deblur results and wide application in all types of optical microscopic imaging systems. The deep learning approach opens up a new path for deblurring and can be applied in various biomedical imaging fields. △ Less

Submitted 8 October, 2019; originally announced October 2019.

arXiv:1909.12472 [pdf]

A Radio Signal Modulation Recognition Algorithm Based on Residual Networks and Attention Mechanisms

Authors: Ruisen Luo, Tao Hu, Zuodong Tang, Chen Wang, Xiaofeng Gong, Haiyan Tu

Abstract: To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neu… ▽ More To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neural network, residual block network is added to solve the problem of gradient disappearance caused by deep network layers. Attention mechanism is added to the network to accelerate the gradient descent. In the experiment, 16QAM, 2FSK and 4FSK are used as actual samples, IQ data frames of signals are used as input, and the RNN neural network combined with residual block network and attention mechanism is trained. The final recognition results show that the average recognition rate of real-time signals is over 93%. The network has high robustness and good use value. △ Less

Submitted 26 September, 2019; originally announced September 2019.

arXiv:1908.03835 [pdf, other]

AutoGAN: Neural Architecture Search for Generative Adversarial Networks

Authors: Xinyu Gong, Shiyu Chang, Yifan Jiang, Zhangyang Wang

Abstract: Neural architecture search (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variati… ▽ More Neural architecture search (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way. Experiments validate the effectiveness of AutoGAN on the task of unconditional image generation. Specifically, our discovered architectures achieve highly competitive performance compared to current state-of-the-art hand-crafted GANs, e.g., setting new state-of-the-art FID scores of 12.42 on CIFAR-10, and 31.01 on STL-10, respectively. We also conclude with a discussion of the current limitations and future potential of AutoGAN. The code is available at https://github.com/TAMU-VITA/AutoGAN △ Less

Submitted 10 August, 2019; originally announced August 2019.

Comments: accepted by ICCV 2019

arXiv:1907.04536 [pdf]

Multi-layer Attention Mechanism for Speech Keyword Recognition

Authors: Ruisen Luo, Tianran Sun, Chen Wang, Miao Du, Zuodong Tang, Kai Zhou, Xiaofeng Gong, Xiaomei Yang

Abstract: As an important part of speech recognition technology, automatic speech keyword recognition has been intensively studied in recent years. Such technology becomes especially pivotal under situations with limited infrastructures and computational resources, such as voice command recognition in vehicles and robot interaction. At present, the mainstream methods in automatic speech keyword recognition… ▽ More As an important part of speech recognition technology, automatic speech keyword recognition has been intensively studied in recent years. Such technology becomes especially pivotal under situations with limited infrastructures and computational resources, such as voice command recognition in vehicles and robot interaction. At present, the mainstream methods in automatic speech keyword recognition are based on long short-term memory (LSTM) networks with attention mechanism. However, due to inevitable information losses for the LSTM layer caused during feature extraction, the calculated attention weights are biased. In this paper, a novel approach, namely Multi-layer Attention Mechanism, is proposed to handle the inaccurate attention weights problem. The key idea is that, in addition to the conventional attention mechanism, information of layers prior to feature extraction and LSTM are introduced into attention weights calculations. Therefore, the attention weights are more accurate because the overall model can have more precise and focused areas. We conduct a comprehensive comparison and analysis on the keyword spotting performances on convolution neural network, bi-directional LSTM cyclic neural network, and cyclic neural network with the proposed attention mechanism on Google Speech Command datasets V2 datasets. Experimental results indicate favorable results for the proposed method and demonstrate the validity of the proposed method. The proposed multi-layer attention methods can be useful for other researches related to object spotting. △ Less

Submitted 10 July, 2019; originally announced July 2019.

arXiv:1906.06972 [pdf, other]

EnlightenGAN: Deep Light Enhancement without Paired Supervision

Authors: Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, Zhangyang Wang

Abstract: Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose… ▽ More Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose a highly effective unsupervised generative adversarial network, dubbed EnlightenGAN, that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images. Instead of supervising the learning using ground truth data, we propose to regularize the unpaired training using the information extracted from the input itself, and benchmark a series of innovations for the low-light image enhancement problem, including a global-local discriminator structure, a self-regularized perceptual loss fusion, and attention mechanism. Through extensive experiments, our proposed approach outperforms recent methods under a variety of metrics in terms of visual quality and subjective user study. Thanks to the great flexibility brought by unpaired training, EnlightenGAN is demonstrated to be easily adaptable to enhancing real-world images from various domains. The code is available at \url{https://github.com/yueruchen/EnlightenGAN} △ Less

Submitted 24 January, 2021; v1 submitted 17 June, 2019; originally announced June 2019.

arXiv:1906.01177 [pdf, other]

Integrated Optimization of Power Split, Engine Thermal Management, and Cabin Heating for Hybrid Electric Vehicles

Authors: Xun Gong, Hao Wang, Mohammad Reza Amini, Ilya Kolmanovsky, **g Sun

Abstract: Cabin heating demand and engine efficiency degradation in cold weather lead to considerable increase in fuel consumption of hybrid electric vehicles (HEVs), especially in congested traffic conditions. This paper presents an integrated power and thermal management (i-PTM) scheme for the optimization of power split, engine thermal management, and cabin heating of HEVs. A control-oriented model of a… ▽ More Cabin heating demand and engine efficiency degradation in cold weather lead to considerable increase in fuel consumption of hybrid electric vehicles (HEVs), especially in congested traffic conditions. This paper presents an integrated power and thermal management (i-PTM) scheme for the optimization of power split, engine thermal management, and cabin heating of HEVs. A control-oriented model of a power split HEV, including power and thermal loops, is developed and experimentally validated against data collected from a 2017 Toyota Prius HEV. Based on this model, the dynamic programming (DP) technique is adopted to derive a bench-mark for minimal fuel consumption, using 2-dimensional (power split and engine thermal management) and 3-dimensional (power split, engine thermal management, and cabin heating) formulations. Simulation results for a real-world congested driving cycle show that the engine thermal effect and the cabin heating requirement can significantly influence the optimal behavior for the power management, and substantial potential on fuel saving can be achieved by the i-PTM optimization as compared to conventional power and thermal management strategies. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: 6 pages, 10 figures, 2 tables, The 3rd IEEE Conference on Control Technology and Applications (CCTA, August 19--21, 2019, Hong Kong, China

arXiv:1903.10482 [pdf, ps, other]

Beam Selection and Discrete Power Allocation in Opportunistic Cognitive Radio Systems with Limited Feedback Using ESPAR Antennas

Authors: Hassan Yazdani, Azadeh Vosoughi, Xun Gong

Abstract: We consider an opportunistic cognitive radio (CR) system consisting of a primary user (PU), secondary transmitter (SUtx), and secondary receiver (SUrx), where SUtx is equipped with an electrically steerable parasitic array radiator (ESPAR) antenna with beam steering capability for sensing and communication, and there is a limited feedback channel from SUrx to SUtx. Taking a holistic approach, we d… ▽ More We consider an opportunistic cognitive radio (CR) system consisting of a primary user (PU), secondary transmitter (SUtx), and secondary receiver (SUrx), where SUtx is equipped with an electrically steerable parasitic array radiator (ESPAR) antenna with beam steering capability for sensing and communication, and there is a limited feedback channel from SUrx to SUtx. Taking a holistic approach, we develop a framework for integrated sector-based spectrum sensing and sector-based data communication. Upon sensing the channel busy, SUtx determines the beam corresponding to PU's orientation. Upon sensing the channel idle, SUtx transmits data to SUrx, using the selected beam corresponding to the strongest channel between SUtx and SUrx. We formulate a constrained optimization problem, where SUtx-SUrx link ergodic capacity is maximized, subject to average transmit power and interference constraints, and the optimization variables are sensing duration, thresholds of channel quantizer at SUrx, and transmit power levels at SUtx. Since this problem is non-convex we develop a suboptimal computationally efficient iterative algorithm to find the solution. Our numerical results quantify the capacity improvement provided by the ESPAR antenna and demonstrate that our CR system yields lower outage and symbol error probabilities, compared with a CR system that its SUtx has an omni-directional antenna. △ Less

Submitted 13 July, 2019; v1 submitted 25 March, 2019; originally announced March 2019.

Comments: This paper has been submitted to IEEE Transactions on Cognitive Communications and Networking

arXiv:1903.08561 [pdf, other]

Sequential Optimization of Speed, Thermal Load, and Power Split in Connected HEVs

Authors: Mohammad Reza Amini, Xun Gong, Yiheng Feng, Hao Wang, Ilya Kolmanovsky, **g Sun

Abstract: The emergence of connected and automated vehicles (CAVs) provides an unprecedented opportunity to capitalize on these technologies well beyond their original designed intents. While abundant evidence has been accumulated showing substantial fuel economy improvement benefits achieved through advanced powertrain control, the implications of the CAV operation on power and thermal management have not… ▽ More The emergence of connected and automated vehicles (CAVs) provides an unprecedented opportunity to capitalize on these technologies well beyond their original designed intents. While abundant evidence has been accumulated showing substantial fuel economy improvement benefits achieved through advanced powertrain control, the implications of the CAV operation on power and thermal management have not been fully investigated. In this paper, in order to explore the opportunities for the coordination between the onboard thermal management and the power split control, we present a sequential optimization solution for eco-driving speed trajectory planning, air conditioning (A/C) thermal load planning (eco-cooling), and powertrain control in hybrid electric CAVs to evaluate the individual as well as the collective energy savings through proactive usage of traffic data for vehicle speed prediction. Simulation results over a real-world driving cycle show that compared to a baseline non-CAV, 11.9%, 14.2%, and 18.8% energy savings can be accumulated sequentially through speed, thermal load, and power split optimizations, respectively. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Comments: 2019 Annual American Control Conference (ACC), July 10-12, 2019, Philadelphia, PA, USA, 7 pages, 11 figures

arXiv:1812.02871 [pdf, other]

A Low-rank Tensor Dictionary Learning Method for Multi-spectral Images Denoising

Authors: Xiao Gong, Wei Chen

Abstract: As a 3-order tensor, a multi-spectral image (MSI) has dozens of spectral bands, which can deliver more information for real scenes. However, real MSIs are often corrupted by noises in the sensing process, which will further deteriorate the performance of higher-level classification and recognition tasks. In this paper, we propose a Low-rank Tensor Dictionary Learning (LTDL) method for MSI denoisin… ▽ More As a 3-order tensor, a multi-spectral image (MSI) has dozens of spectral bands, which can deliver more information for real scenes. However, real MSIs are often corrupted by noises in the sensing process, which will further deteriorate the performance of higher-level classification and recognition tasks. In this paper, we propose a Low-rank Tensor Dictionary Learning (LTDL) method for MSI denoising. Firstly, we extract blocks from the MSI and cluster them into groups. Then instead of using the exactly low-rank model, we consider a nearly low-rank approximation, which is closer to the latent low-rank structure of the clean groups of real MSIs. In addition, we propose to learn an spatial dictionary and an spectral dictionary, which contain the spatial features and spectral features respectively of the whole MSI and are shared among different groups. Hence the LTDL method utilizes both the latent low-rank prior of each group and the correlation of different groups via the shared dictionaries. Experiments on synthetic data validate the effectiveness of dictionary learning by the LTDL. Experiments on real MSIs demonstrate the superior denoising performance of the proposed method in comparison to state-of-the-art methods. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1806.00589 [pdf, ps, other]

Efficient Entropy for Policy Gradient with Multidimensional Action Space

Authors: Yiming Zhang, Quan Ho Vuong, Kenny Song, Xiao-Yue Gong, Keith W. Ross

Abstract: In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gr… ▽ More In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on two environments: a multi-hunter multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results show that our entropy estimators substantially improve performance with marginal additional computational cost. △ Less

Submitted 2 June, 2018; originally announced June 2018.

arXiv:1806.00377 [pdf, ps, other]

Evaluation of the Energy Efficiency in a Mixed Traffic with Automated Vehicles and Human Controlled Vehicles

Authors: Xun Gong, Yaohui Guo, Yiheng Feng, **g Sun, Ding Zhao

Abstract: The energy efficiency of Connected and Automated Vehicles (CAVs) is significantly influenced by surrounding road users. This paper presents the evaluation of energy efficiency of CAVs in a mixed traffic interacted with human controlled vehicles. To simulate the interaction between the CAVs and the cut-in vehicles controlled by human drivers near the intersection, a lane changing model is proposed… ▽ More The energy efficiency of Connected and Automated Vehicles (CAVs) is significantly influenced by surrounding road users. This paper presents the evaluation of energy efficiency of CAVs in a mixed traffic interacted with human controlled vehicles. To simulate the interaction between the CAVs and the cut-in vehicles controlled by human drivers near the intersection, a lane changing model is proposed to emulate the politeness and patience characteristics of the human driver. The proposed lane changing model is then calibrated based on over 100,000 naturalistic lane changing events collected by the University of Michigan Safety Pilot Model Deployment Program. A case study on simulation of the cut-in scenario is carried out to demonstrate the human driver's lane changing sensitivity under different driving trajectories of a frontal CAV and the influence on the energy consumption of the CAV due to the cut-in vehicle is evaluated. The simulation results indicate that the fuel economy of the CAV can be substantially improved if its surrounding cut-in vehicles can be well handled. △ Less

Submitted 1 June, 2018; originally announced June 2018.

arXiv:1803.02099 [pdf]

A Hybrid Method for Traffic Flow Forecasting Using Multimodal Deep Learning

Authors: Shengdong Du, Tianrui Li, Xun Gong, Shi-**n Horng

Abstract: Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which can jointly and adaptively learn the spatial-temporal correlation features and long temporal interdependence of multi-modality traffic data by an attention auxiliary multimodal deep learning ar… ▽ More Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which can jointly and adaptively learn the spatial-temporal correlation features and long temporal interdependence of multi-modality traffic data by an attention auxiliary multimodal deep learning architecture. According to the highly nonlinear characteristics of multi-modality traffic data, the base module of our method consists of one-dimensional Convolutional Neural Networks (1D CNN) and Gated Recurrent Units (GRU) with the attention mechanism. The former is to capture the local trend features and the latter is to capture the long temporal dependencies. Then, we design a hybrid multimodal deep learning framework (HMDLF) for fusing share representation features of different modality traffic data by multiple CNN-GRU-Attention modules. The experimental results indicate that the proposed multimodal deep learning model is capable of dealing with complex nonlinear urban traffic flow forecasting with satisfying accuracy and effectiveness. △ Less

Submitted 19 March, 2019; v1 submitted 6 March, 2018; originally announced March 2018.

Showing 1–43 of 43 results for author: Gong, X