Search | arXiv e-print repository

Text-Queried Target Sound Event Localization

Authors: **zheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

Abstract: Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the… ▽ More Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: Accepted by EUSIPCO 2024

arXiv:2406.11401 [pdf, other]

An Exploration of Length Generalization in Transformer-Based Speech Enhancement

Authors: Qiquan Zhang, Hongxu Zhu, Xinyuan Qian, Eliathamby Ambikairajah, Haizhou Li

Abstract: The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is a critical and unexplored challenge for a Transformer-based speech enhancement model to learn from short speech utterances and generalize to longer ones. In thi… ▽ More The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is a critical and unexplored challenge for a Transformer-based speech enhancement model to learn from short speech utterances and generalize to longer ones. In this paper, we conduct comprehensive experiments to explore the length generalization problem in speech enhancement with Transformer. Our findings first establish that position embedding provides an effective instrument to alleviate the impact of utterance length on Transformer-based speech enhancement. Specifically, we explore four different position embedding schemes to enable length generalization. The results confirm the superiority of relative position embeddings (RPEs) over absolute PE (APEs) in length generalization. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2405.12609 [pdf, other]

Mamba in Speech: Towards an Alternative to Self-Attention

Authors: Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

Abstract: Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and comp… ▽ More Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing using two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results exhibit the superiority of bidirectional Mamba (BiMamba) for speech processing to vanilla Mamba. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in Transformer and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section to offer insights for future research. △ Less

Submitted 30 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.01104 [pdf, other]

Multi-user ISAC through Stacked Intelligent Metasurfaces: New Algorithms and Experiments

Authors: Ziqing Wang, Hongzheng Liu, Jianan Zhang, Ru**g Xiong, Kai Wan, Xuewen Qian, Marco Di Renzo, Robert Caiming Qiu

Abstract: This paper investigates a Stacked Intelligent Metasurfaces (SIM)-assisted Integrated Sensing and Communications (ISAC) system. An extended target model is considered, where the BS aims to estimate the complete target response matrix relative to the SIM. Under the constraints of minimum Signal-to-Interference-plus-Noise Ratio (SINR) for the communication users (CUs) and maximum transmit power, we j… ▽ More This paper investigates a Stacked Intelligent Metasurfaces (SIM)-assisted Integrated Sensing and Communications (ISAC) system. An extended target model is considered, where the BS aims to estimate the complete target response matrix relative to the SIM. Under the constraints of minimum Signal-to-Interference-plus-Noise Ratio (SINR) for the communication users (CUs) and maximum transmit power, we jointly optimize the transmit beamforming at the base station (BS) and the end-to-end transmission matrix of the SIM, to minimize the Cramér-Rao Bound (CRB) for target estimation. Effective algorithms such as the alternating optimization (AO) and semidefinite relaxation (SDR) are employed to solve the non-convex SINR-constrained CRB minimization problem. Finally, we design and build an experimental platform for SIM, and evaluate the performance of the proposed algorithms for communication and sensing tasks. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.18501 [pdf, other]

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Authors: Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang, Haizhou Li

Abstract: Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noi… ▽ More Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs. △ Less

Submitted 8 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.13153 [pdf, other]

Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring

Authors: Chengxu Liu, Xuan Wang, Xiangyu Xu, Ruhao Tian, Shuai Li, Xueming Qian, Ming-Hsuan Yang

Abstract: Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In th… ▽ More Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper, we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction, and then collaboratively filters the aligned image through the predicted kernels, weights, and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore, we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.00861 [pdf, other]

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Authors: Ruijie Tao, Xinyuan Qian, Rohan Kumar Das, Xiaoxue Gao, Jiadong Wang, Haizhou Li

Abstract: Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selec… ▽ More Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speaker Extraction-to-Detection framework named `MuSED', which is pre-trained with audio-visual target speaker extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: 10 pages

arXiv:2403.10012 [pdf, other]

Real-World Computational Aberration Correction via Quantized Domain-Mixing Representation

Authors: Qi Jiang, Zhonghua Yi, Shaohua Gao, Yao Gao, Xiaolong Qian, Hao Shi, Lei Sun, Zhijie Xu, Kailun Yang, Kaiwei Wang

Abstract: Relying on paired synthetic data, existing learning-based Computational Aberration Correction (CAC) methods are confronted with the intricate and multifaceted synthetic-to-real domain gap, which leads to suboptimal performance in real-world applications. In this paper, in contrast to improving the simulation pipeline, we deliver a novel insight into real-world CAC from the perspective of Unsupervi… ▽ More Relying on paired synthetic data, existing learning-based Computational Aberration Correction (CAC) methods are confronted with the intricate and multifaceted synthetic-to-real domain gap, which leads to suboptimal performance in real-world applications. In this paper, in contrast to improving the simulation pipeline, we deliver a novel insight into real-world CAC from the perspective of Unsupervised Domain Adaptation (UDA). By incorporating readily accessible unpaired real-world data into training, we formalize the Domain Adaptive CAC (DACAC) task, and then introduce a comprehensive Real-world aberrated images (Realab) dataset to benchmark it. The setup task presents a formidable challenge due to the intricacy of understanding the target aberration domain. To this intent, we propose a novel Quntized Domain-Mixing Representation (QDMR) framework as a potent solution to the issue. QDMR adapts the CAC model to the target domain from three key aspects: (1) reconstructing aberrated images of both domains by a VQGAN to learn a Domain-Mixing Codebook (DMC) which characterizes the degradation-aware priors; (2) modulating the deep features in CAC model with DMC to transfer the target domain knowledge; and (3) leveraging the trained VQGAN to generate pseudo target aberrated images from the source ones for convincing target domain supervision. Extensive experiments on both synthetic and real-world benchmarks reveal that the models with QDMR consistently surpass the competitive methods in mitigating the synthetic-to-real gap, which produces visually pleasant real-world CAC results with fewer artifacts. Codes and datasets will be made publicly available. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Codes and datasets will be made publicly available at https://github.com/zju-jiangqi/QDMR

arXiv:2311.12070 [pdf, other]

FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model

Authors: Yunxiang Li, Hua-Chieh Shao, Xiaoxue Qian, You Zhang

Abstract: Diffusion models have demonstrated significant potential in producing high-quality images in medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation… ▽ More Diffusion models have demonstrated significant potential in producing high-quality images in medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation of structural and anatomical details is essential to reliable medical diagnosis and treatment planning, as structural mismatches can lead to disease misidentification and treatment errors. In this study, we introduce the Frequency Decoupled Diffusion Model (FDDM) for MR-to-CT conversion. FDDM first obtains the anatomical information of the CT image from the MR image through an initial conversion module. This anatomical information then guides a subsequent diffusion model to generate high-quality CT images. Our diffusion model uses a dual-path reverse diffusion process for low-frequency and high-frequency information, achieving a better balance between image quality and anatomical accuracy. We extensively evaluated FDDM using public datasets for brain MR-to-CT and pelvis MR-to-CT translations, demonstrating its superior performance to other GAN-based, VAE-based, and diffusion-based models. The evaluation metrics included Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM). FDDM achieved the best scores on all metrics for both datasets, particularly excelling in FID, with scores of 25.9 for brain data and 29.2 for pelvis data, significantly outperforming other methods. These results demonstrate that FDDM can generate high-quality target domain images while maintaining the accuracy of translated anatomical structures. △ Less

Submitted 26 June, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2310.14778 [pdf, other]

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

Authors: **zheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we condu… ▽ More Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we conduct a comprehensive overview of audio-visual speaker tracking. To our knowledge, this is the first extensive survey over the past five years. We introduce the family of Bayesian filters and summarize the methods for obtaining audio-visual measurements. In addition, the existing trackers and their performance on AV16.3 dataset are summarized. In the past few years, deep learning techniques have thrived, which also boosts the development of audio visual speaker tracking. The influence of deep learning techniques in terms of measurement extraction and state estimation is also discussed. At last, we discuss the connections between audio-visual speaker tracking and other areas such as speech separation and distributed speaker tracking. △ Less

Submitted 17 December, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.10497 [pdf, other]

LocSelect: Target Speaker Localization with an Auditory Selective Hearing Mechanism

Authors: Yu Chen, Xinyuan Qian, Zexu Pan, Kainan Chen, Haizhou Li

Abstract: The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this paper, we present a target speaker localization algorithm with a selective hearing mechanism. Given a reference speech of the target speaker, we firs… ▽ More The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating and providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this paper, we present a target speaker localization algorithm with a selective hearing mechanism. Given a reference speech of the target speaker, we first produce a speaker-dependent spectrogram mask to eliminate interfering speakers' speech. Subsequently, a Long short-term memory (LSTM) network is employed to extract the target speaker's location from the filtered spectrogram. Experiments validate the superiority of our proposed method over the existing algorithms for different scale invariant signal-to-noise ratios (SNR) conditions. Specifically, at SNR = -10 dB, our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%. △ Less

Submitted 17 October, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.16308 [pdf, other]

Audio Visual Speaker Localization from EgoCentric Views

Authors: **zheng Zhao, Yong Xu, Xinyuan Qian, Wenwu Wang

Abstract: The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitat… ▽ More The use of audio and visual modality for speaker localization has been well studied in the literature by exploiting their complementary characteristics. However, most previous works employ the setting of static sensors mounted at fixed positions. Unlike them, in this work, we explore the ego-centric setting, where the heterogeneous sensors are embodied and could be moving with a human to facilitate speaker localization. Compared to the static scenario, the ego-centric setting is more realistic for smart-home applications e.g., a service robot. However, this also brings new challenges such as blurred images, frequent speaker disappearance from the field of view of the wearer, and occlusions. In this paper, we study egocentric audio-visual speaker DOA estimation and deal with the challenges mentioned above. Specifically, we propose a transformer-based audio-visual fusion method to estimate the relative DOA of the speaker to the wearer, and design a training strategy to mitigate the problem of the speaker disappearing from the camera's view. We also develop a new dataset for simulating the out-of-view scenarios, by creating a scene with a camera wearer walking around while a speaker is moving at the same time. The experimental results show that our proposed method offers promising performance in this new dataset in terms of tracking accuracy. Finally, we adapt the proposed method for the multi-speaker scenario. Experiments on EasyCom show the effectiveness of the proposed model for multiple speakers in real scenarios, which achieves state-of-the-art results in the sphere active speaker detection task and the wearer activity prediction task. The simulated dataset and related code are available at https://github.com/KawhiZhao/Egocentric-Audio-Visual-Speaker-Localization. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2306.09480 [pdf, ps, other]

Optimization of RIS-Aided MIMO -- A Mutually Coupled Loaded Wire Dipole Model

Authors: H. El Hassani, X. Qian, S. Jeong, N. S. Perović, M. Di Renzo, P. Mursia, V. Sciancalepore, X. Costa-Pérez

Abstract: We consider a reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) system in the presence of scattering objects. The MIMO transmitter and receiver, the RIS, and the scattering objects are modeled as mutually coupled thin wires connected to load impedances. We introduce a novel numerical algorithm for optimizing the tunable loads connected to the RIS, which does n… ▽ More We consider a reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) system in the presence of scattering objects. The MIMO transmitter and receiver, the RIS, and the scattering objects are modeled as mutually coupled thin wires connected to load impedances. We introduce a novel numerical algorithm for optimizing the tunable loads connected to the RIS, which does not utilize the Neumann series approximation. The algorithm is provably convergent, has polynomial complexity with the number of RIS elements, and outperforms the most relevant benchmark algorithms while requiring fewer iterations and converging in a shorter time. △ Less

Submitted 18 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.04915 [pdf, ps, other]

Sensing-based Beamforming Design for Joint Performance Enhancement of RIS-Aided ISAC Systems

Authors: Xiaowei Qian, Xiaoling Hu, Chenxi Liu, Mugen Peng, Caijun Zhong

Abstract: Reconfigurable intelligent surface (RIS) has shown its great potential in facilitating device-based integrated sensing and communication (ISAC), where sensing and communication tasks are mostly conducted on different time-frequency resources. While the more challenging scenarios of simultaneous sensing and communication (SSC) have so far drawn little attention. In this paper, we propose a novel RI… ▽ More Reconfigurable intelligent surface (RIS) has shown its great potential in facilitating device-based integrated sensing and communication (ISAC), where sensing and communication tasks are mostly conducted on different time-frequency resources. While the more challenging scenarios of simultaneous sensing and communication (SSC) have so far drawn little attention. In this paper, we propose a novel RIS-aided ISAC framework where the inherent location information in the received communication signals from a blind-zone user equipment is exploited to enable SSC. We first design a two-phase ISAC transmission protocol. In the first phase, communication and coarse-grained location sensing are performed concurrently by exploiting the very limited channel state information, while in the second phase, by using the coarse-grained sensing information obtained from the first phase, simple-yet-efficient sensing-based beamforming designs are proposed to realize both higher-rate communication and fine-grained location sensing. We demonstrate that our proposed framework can achieve almost the same performance as the communication-only frameworks, while providing up to millimeter-level positioning accuracy. In addition, we show how the communication and sensing performance can be simultaneously boosted through our proposed sensing-based beamforming designs. The results presented in this work provide valuable insights into the design and implementation of other ISAC systems considering SSC. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2305.16342 [pdf, other]

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

Authors: Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei, Song-Lu Chen, Feng Chen, Xu-Cheng Yin

Abstract: The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these… ▽ More The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models. △ Less

Submitted 29 May, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.14049 [pdf, other]

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Authors: Tian-Hao Zhang, Hai-Bo Qin, Zhi-Hao Lai, Song-Lu Chen, Qi Liu, Feng Chen, Xinyuan Qian, Xu-Cheng Yin

Abstract: Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla… ▽ More Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.08541 [pdf, other]

Ripple sparse self-attention for monaural speech enhancement

Authors: Qiquan Zhang, Hongxu Zhu, Qi Song, Xinyuan Qian, Zhaoheng Ni, Haizhou Li

Abstract: The use of Transformer represents a recent success in speech enhancement. However, as its core component, self-attention suffers from quadratic complexity, which is computationally prohibited for long speech recordings. Moreover, it allows each time frame to attend to all time frames, neglecting the strong local correlations of speech signals. This study presents a simple yet effective sparse self… ▽ More The use of Transformer represents a recent success in speech enhancement. However, as its core component, self-attention suffers from quadratic complexity, which is computationally prohibited for long speech recordings. Moreover, it allows each time frame to attend to all time frames, neglecting the strong local correlations of speech signals. This study presents a simple yet effective sparse self-attention for speech enhancement, called ripple attention, which simultaneously performs fine- and coarse-grained modeling for local and global dependencies, respectively. Specifically, we employ local band attention to enable each frame to attend to its closest neighbor frames in a window at fine granularity, while employing dilated attention outside the window to model the global dependencies at a coarse granularity. We evaluate the efficacy of our ripple attention for speech enhancement on two commonly used training objectives. Extensive experimental results consistently confirm the superior performance of the ripple attention design over standard full self-attention, blockwise attention, and dual-path attention (Sep-Former) in terms of speech quality and intelligibility. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 5 pages, ICASSP 2023 published

arXiv:2303.17480 [pdf, other]

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Authors: Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T. Tan, Haizhou Li

Abstract: Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generati… ▽ More Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global temporal dependency of audio. For evaluation, we propose a new strategy with two different lip-reading experts to measure intelligibility of the generated videos. Rigorous experiments show that our proposal is superior to other State-of-the-art (SOTA) methods, such as Wav2Lip, in reading intelligibility i.e., over 38% Word Error Rate (WER) on LRS2 dataset and 27.8% accuracy on LRW dataset. We also achieve the SOTA performance in lip-speech synchronization and comparable performances in visual quality. △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: accepted by CVPR 2023

arXiv:2303.03093 [pdf, other]

A Miniaturised Camera-based Multi-Modal Tactile Sensor

Authors: Kaspar Althoefer, Yonggen Ling, Wanlin Li, Xinyuan Qian, Wang Wei Lee, Peng Qi

Abstract: In conjunction with huge recent progress in camera and computer vision technology, camera-based sensors have increasingly shown considerable promise in relation to tactile sensing. In comparison to competing technologies (be they resistive, capacitive or magnetic based), they offer super-high-resolution, while suffering from fewer wiring problems. The human tactile system is composed of various ty… ▽ More In conjunction with huge recent progress in camera and computer vision technology, camera-based sensors have increasingly shown considerable promise in relation to tactile sensing. In comparison to competing technologies (be they resistive, capacitive or magnetic based), they offer super-high-resolution, while suffering from fewer wiring problems. The human tactile system is composed of various types of mechanoreceptors, each able to perceive and process distinct information such as force, pressure, texture, etc. Camera-based tactile sensors such as GelSight mainly focus on high-resolution geometric sensing on a flat surface, and their force measurement capabilities are limited by the hysteresis and non-linearity of the silicone material. In this paper, we present a miniaturised dome-shaped camera-based tactile sensor that allows accurate force and tactile sensing in a single coherent system. The key novelty of the sensor design is as follows. First, we demonstrate how to build a smooth silicone hemispheric sensing medium with uniform markers on its curved surface. Second, we enhance the illumination of the rounded silicone with diffused LEDs. Third, we construct a force-sensitive mechanical structure in a compact form factor with usage of springs to accurately perceive forces. Our multi-modal sensor is able to acquire tactile information from multi-axis forces, local force distribution, and contact geometry, all in real-time. We apply an end-to-end deep learning method to process all the information. △ Less

Submitted 6 March, 2023; originally announced March 2023.

arXiv:2302.01972 [pdf, other]

doi 10.1109/TITS.2023.3287792

DCA: Delayed Charging Attack on the Electric Shared Mobility System

Authors: Shuocheng Guo, Hanlin Chen, Mizanur Rahman, Xinwu Qian

Abstract: An efficient operation of the electric shared mobility system (ESMS) relies heavily on seamless interconnections among shared electric vehicles (SEV), electric vehicle supply equipment (EVSE), and the grid. Nevertheless, this interconnectivity also makes the ESMS vulnerable to cyberattacks that may cause short-term breakdowns or long-term degradation of the ESMS. This study focuses on one such att… ▽ More An efficient operation of the electric shared mobility system (ESMS) relies heavily on seamless interconnections among shared electric vehicles (SEV), electric vehicle supply equipment (EVSE), and the grid. Nevertheless, this interconnectivity also makes the ESMS vulnerable to cyberattacks that may cause short-term breakdowns or long-term degradation of the ESMS. This study focuses on one such attack with long-lasting effects, the Delayed Charge Attack (DCA), that stealthily delays the charging service by exploiting the physical and communication vulnerabilities. To begin, we present the ESMS threat model by highlighting the assets, information flow, and access points. We next identify a linked sequence of vulnerabilities as a viable attack vector for launching DCA. Then, we detail the implementation of DCA, which can effectively bypass the detection in the SEV's battery management system and the cross-verification in the cloud environment. We test the DCA model against various Anomaly Detection (AD) algorithms by simulating the DCA dynamics in a Susceptible-Infectious-Removed-Susceptible process, where the EVSE can be compromised by the DCA or detected for repair. Using real-world taxi trip data and EVSE locations in New York City, the DCA model allows us to explore the long-term impacts and validate the system consequences. The results show that a 10-min delay results in 12-min longer queuing times and 8% more unfulfilled requests, leading to a 10.7% (\$311.7) weekly revenue loss per driver. With the AD algorithms, the weekly revenue loss remains at least 3.8% (\$111.8) with increased repair costs of \$36,000, suggesting the DCA's robustness against the AD. △ Less

Submitted 13 June, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Journal ref: IEEE Transactions on Intelligent Transportation Systems, 2023

arXiv:2301.07968 [pdf, other]

On the Degrees of Freedom of RIS-Aided Holographic MIMO Systems

Authors: Juan Carlos Ruiz-Sicilia, Xuewen Qian, Marco Di Renzo, Vincenzo Sciancalepore, Merouane Debbah, Xavier Costa-Perez

Abstract: In this paper, we study surface-based communication systems based on different levels of channel state information for system optimization. We analyze the system performance in terms of rate and degrees of freedom (DoF). We show that the deployment of a reconfigurable intelligent surface (RIS) results in increasing the number of DoF, by extending the near-field region. Over Rician fading channels,… ▽ More In this paper, we study surface-based communication systems based on different levels of channel state information for system optimization. We analyze the system performance in terms of rate and degrees of freedom (DoF). We show that the deployment of a reconfigurable intelligent surface (RIS) results in increasing the number of DoF, by extending the near-field region. Over Rician fading channels, we show that an RIS can be efficiently optimized only based on the positions of the transmitting and receiving surfaces, while providing good performance if the Rician fading factor is not too small. △ Less

Submitted 27 January, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

arXiv:2212.00661 [pdf, other]

Hybrid Gate-Pulse Model for Variational Quantum Algorithms

Authors: Zhiding Liang, Zhixin Song, **glei Cheng, Zichang He, Ji Liu, Hanrui Wang, Ruiyang Qin, Yiru Wang, Song Han, Xuehai Qian, Yiyu Shi

Abstract: Current quantum programs are mostly synthesized and compiled on the gate-level, where quantum circuits are composed of quantum gates. The gate-level workflow, however, introduces significant redundancy when quantum gates are eventually transformed into control signals and applied on quantum devices. For superconducting quantum computers, the control signals are microwave pulses. Therefore, pulse-l… ▽ More Current quantum programs are mostly synthesized and compiled on the gate-level, where quantum circuits are composed of quantum gates. The gate-level workflow, however, introduces significant redundancy when quantum gates are eventually transformed into control signals and applied on quantum devices. For superconducting quantum computers, the control signals are microwave pulses. Therefore, pulse-level optimization has gained more attention from researchers due to their advantages in terms of circuit duration. Recent works, however, are limited by their poor scalability brought by the large parameter space of control signals. In addition, the lack of gate-level "knowledge" also affects the performance of pure pulse-level frameworks. We present a hybrid gate-pulse model that can mitigate these problems. We propose to use gate-level compilation and optimization for "fixed" part of the quantum circuits and to use pulse-level methods for problem-agnostic parts. Experimental results demonstrate the efficiency of the proposed framework in discrete optimization tasks. We achieve a performance boost at most 8% with 60% shorter pulse duration in the problem-agnostic layer. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: 8 pages, 6 figures

arXiv:2209.15415 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747062

DynImp: Dynamic Imputation for Wearable Sensing Data Through Sensory and Temporal Relatedness

Authors: Zepeng Huo, Taowei Ji, Yifei Liang, Shuai Huang, Zhangyang Wang, Xiaoning Qian, Bobak Mortazavi

Abstract: In wearable sensing applications, data is inevitable to be irregularly sampled or partially missing, which pose challenges for any downstream application. An unique aspect of wearable data is that it is time-series data and each channel can be correlated to another one, such as x, y, z axis of accelerometer. We argue that traditional methods have rarely made use of both times-series dynamics of th… ▽ More In wearable sensing applications, data is inevitable to be irregularly sampled or partially missing, which pose challenges for any downstream application. An unique aspect of wearable data is that it is time-series data and each channel can be correlated to another one, such as x, y, z axis of accelerometer. We argue that traditional methods have rarely made use of both times-series dynamics of the data as well as the relatedness of the features from different sensors. We propose a model, termed as DynImp, to handle different time point's missingness with nearest neighbors along feature axis and then feeding the data into a LSTM-based denoising autoencoder which can reconstruct missingness along the time axis. We experiment the model on the extreme missingness scenario ($>50\%$ missing rate) which has not been widely tested in wearable data. Our experiments on activity recognition show that the method can exploit the multi-modality features from related sensors and also learn from history time-series dynamics to reconstruct the data under extreme missingness. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 5 pages, 2 figures, accepted in ICASSP'2022

arXiv:2209.01768 [pdf, other]

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Authors: Jiadong Wang, Xinyuan Qian, Haizhou Li

Abstract: Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There hav… ▽ More Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the visual signals. The audio signal is then conditioned on the visual embedding via a novel cross-modal Conformer, that updates the character posteriors. We validate the effectiveness of the visual cueing mechanism through extensive experiments. The proposed P&U net outperforms the state-of-the-art AVSR methods on both LRS2-BBC and LRS3-BBC datasets, with the relative reduced Word Error Rate (WER)s exceeding 10% and 40% under clean and noisy conditions, respectively. △ Less

Submitted 5 September, 2022; originally announced September 2022.

arXiv:2209.01749 [pdf, other]

4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement

Authors: Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

Abstract: Image enhancement aims at improving the aesthetic visual quality of photos by retouching the color and tone, and is an essential technology for professional digital photography. Recent years deep learning-based image enhancement algorithms have achieved promising performance and attracted increasing popularity. However, typical efforts attempt to construct a uniform enhancer for all pixels' color… ▽ More Image enhancement aims at improving the aesthetic visual quality of photos by retouching the color and tone, and is an essential technology for professional digital photography. Recent years deep learning-based image enhancement algorithms have achieved promising performance and attracted increasing popularity. However, typical efforts attempt to construct a uniform enhancer for all pixels' color transformation. It ignores the pixel differences between different content (e.g., sky, ocean, etc.) that are significant for photographs, causing unsatisfactory results. In this paper, we propose a novel learnable context-aware 4-dimensional lookup table (4D LUT), which achieves content-dependent enhancement of different contents in each image via adaptively learning of photo context. In particular, we first introduce a lightweight context encoder and a parameter encoder to learn a context map for the pixel-level category and a group of image-adaptive coefficients, respectively. Then, the context-aware 4D LUT is generated by integrating multiple basis 4D LUTs via the coefficients. Finally, the enhanced image can be obtained by feeding the source image and context map into fused context-aware 4D~LUT via quadrilinear interpolation. Compared with traditional 3D LUT, i.e., RGB map** to RGB, which is usually used in camera imaging pipeline systems or tools, 4D LUT, i.e., RGBC(RGB+Context) map** to RGB, enables finer control of color transformations for pixels with different content in each image, even though they have the same RGB values. Experimental results demonstrate that our method outperforms other state-of-the-art methods in widely-used benchmarks. △ Less

Submitted 5 September, 2022; originally announced September 2022.

arXiv:2206.12273 [pdf, other]

Iterative Sound Source Localization for Unknown Number of Sources

Authors: Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

Abstract: Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t… ▽ More Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: Accepted by Interspeech 2022

arXiv:2204.10513 [pdf]

MIPR:Automatic Annotation of Medical Images with Pixel Rearrangement

Authors: **** Dai, Haiming Zhu, Shuang Ge, Ruihan Zhang, Xiang Qian, Xi Li, Kehong Yuan

Abstract: Most of the state-of-the-art semantic segmentation reported in recent years is based on fully supervised deep learning in the medical domain. How?ever, the high-quality annotated datasets require intense labor and domain knowledge, consuming enormous time and cost. Previous works that adopt semi?supervised and unsupervised learning are proposed to address the lack of anno?tated data through assist… ▽ More Most of the state-of-the-art semantic segmentation reported in recent years is based on fully supervised deep learning in the medical domain. How?ever, the high-quality annotated datasets require intense labor and domain knowledge, consuming enormous time and cost. Previous works that adopt semi?supervised and unsupervised learning are proposed to address the lack of anno?tated data through assisted training with unlabeled data and achieve good perfor?mance. Still, these methods can not directly get the image annotation as doctors do. In this paper, inspired by self-training of semi-supervised learning, we pro?pose a novel approach to solve the lack of annotated data from another angle, called medical image pixel rearrangement (short in MIPR). The MIPR combines image-editing and pseudo-label technology to obtain labeled data. As the number of iterations increases, the edited image is similar to the original image, and the labeled result is similar to the doctor annotation. Therefore, the MIPR is to get labeled pairs of data directly from amounts of unlabled data with pixel rearrange?ment, which is implemented with a designed conditional Generative Adversarial Networks and a segmentation network. Experiments on the ISIC18 show that the effect of the data annotated by our method for segmentation task is is equal to or even better than that of doctors annotations △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:2204.04216 [pdf, other]

Learning Trajectory-Aware Transformer for Video Super-Resolution

Authors: Chengxu Liu, Huan Yang, Jianlong Fu, Xueming Qian

Abstract: Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these… ▽ More Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to overcome scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR. △ Less

Submitted 20 April, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: CVPR 2022 Oral

arXiv:2203.16840 [pdf, other]

doi 10.1109/LSP.2022.3175130

Speaker Extraction with Co-Speech Gestures Cue

Authors: Zexu Pan, Xinyuan Qian, Haizhou Li

Abstract: Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures se… ▽ More Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the co-speech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker. △ Less

Submitted 10 May, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE Signal Processing Letters

arXiv:2110.02265 [pdf, other]

Adaptive Group Testing with Mismatched Models

Authors: Mingzhou Fan, Byung-Jun Yoon, Francis J. Alexander, Edward R. Dougherty, Xiaoning Qian

Abstract: Accurate detection of infected individuals is one of the critical steps in stop** any pandemic. When the underlying infection rate of the disease is low, testing people in groups, instead of testing each individual in the population, can be more efficient. In this work, we consider noisy adaptive group testing design with specific test sensitivity and specificity that select the optimal group gi… ▽ More Accurate detection of infected individuals is one of the critical steps in stop** any pandemic. When the underlying infection rate of the disease is low, testing people in groups, instead of testing each individual in the population, can be more efficient. In this work, we consider noisy adaptive group testing design with specific test sensitivity and specificity that select the optimal group given previous test results based on pre-selected utility function. As in prior studies on group testing, we model this problem as a sequential Bayesian Optimal Experimental Design (BOED) to adaptively design the groups for each test. We analyze the required number of group tests when using the updated posterior on the infection status and the corresponding Mutual Information (MI) as our utility function for selecting new groups. More importantly, we study how the potential bias on the ground-truth noise of group tests may affect the group testing sample complexity. △ Less

Submitted 5 October, 2021; originally announced October 2021.

Comments: full length version for ICASSP

arXiv:2109.03181 [pdf, other]

doi 10.1109/BigData52589.2021.9671769

IEEE BigData 2021 Cup: Soft Sensing at Scale

Authors: Sergei Petrov, Chao Zhang, Jaswanth Yella, Yu Huang, Xiaoye Qian, Sthitie Bom

Abstract: IEEE BigData 2021 Cup: Soft Sensing at Scale is a data mining competition organized by Seagate Technology, in association with the IEEE BigData 2021 conference. The scope of this challenge is to tackle the task of classifying soft sensing data with machine learning techniques. In this paper we go into the details of the challenge and describe the data set provided to participants. We define the me… ▽ More IEEE BigData 2021 Cup: Soft Sensing at Scale is a data mining competition organized by Seagate Technology, in association with the IEEE BigData 2021 conference. The scope of this challenge is to tackle the task of classifying soft sensing data with machine learning techniques. In this paper we go into the details of the challenge and describe the data set provided to participants. We define the metrics of interest, baseline models, and describe approaches we found meaningful which may be a good starting point for further analysis. We discuss the results obtained with our approaches and give insights on what potential challenges participants may run into. Students, researchers, and anyone interested in working on a major industrial problem are welcome to participate in the challenge! △ Less

Submitted 7 September, 2021; originally announced September 2021.

Comments: 4 pages, 4 figures, for IEEE Big Data Cup challenge 2021

MSC Class: 68T01

arXiv:2108.02539 [pdf, other]

SLoClas: A Database for Joint Sound Localization and Classification

Authors: Xinyuan Qian, Bidisha Sharma, Amine El Abridi, Haizhou Li

Abstract: In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-o… ▽ More In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-of-Arrival (DoA) from 1 degree to 360 degree at an interval of 5 degree. To facilitate the study of noise robustness, 6 types of outdoor noise are recorded at 4 DoAs, using the same devices. Moreover, we propose a baseline method, namely Sound Localization and Classification Network (SLCnet) and present the experimental results and analysis conducted on the collected SLoClas database. We achieve the accuracy of 95.21% and 80.01% for sound localization and classification, respectively. We publicly release this database and the source code for research purpose. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: Submitted to O-COCOSDA 2021

arXiv:2107.06592 [pdf, other]

doi 10.1145/3474085.3475587

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Authors: Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li

Abstract: Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that ma… ▽ More Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD. △ Less

Submitted 25 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

Comments: ACM Multimedia 2021

arXiv:2105.06107 [pdf, other]

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Authors: Xinyuan Qian, Maulik Madhavi, Zexu Pan, Jiadong Wang, Haizhou Li

Abstract: Most of the prior studies in the spatial \ac{DoA} domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio and visual signals for multi-speaker localization. The use of heterogeneous sensors can provide complementary information to overcome uni-modal challenges, such… ▽ More Most of the prior studies in the spatial \ac{DoA} domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio and visual signals for multi-speaker localization. The use of heterogeneous sensors can provide complementary information to overcome uni-modal challenges, such as noise, reverberation, illumination variations, and occlusions. We attempt to address these issues by introducing an adaptive weighting mechanism for audio-visual fusion. We also propose a novel video simulation method that generates visual features from noisy target 3D annotations that are synchronized with acoustic features. Experimental results confirm that audio-visual fusion consistently improves the performance of speaker DoA estimation, while the adaptive weighting mechanism shows clear benefits. △ Less

Submitted 13 May, 2021; originally announced May 2021.

Comments: ICASSP 2021 accepted

arXiv:2103.04235 [pdf]

Graph-based Pyramid Global Context Reasoning with a Saliency-aware Projection for COVID-19 Lung Infections Segmentation

Authors: Huimin Huang, Ming Cai, Lanfen Lin, **g Zheng, Xiongwei Mao, Xiaohan Qian, Zhiyi Peng, Jianying Zhou, Yutaro Iwamoto, Xian-Hua Han, Yen-Wei Chen, Ruofeng Tong

Abstract: Coronavirus Disease 2019 (COVID-19) has rapidly spread in 2020, emerging a mass of studies for lung infection segmentation from CT images. Though many methods have been proposed for this issue, it is a challenging task because of infections of various size appearing in different lobe zones. To tackle these issues, we propose a Graph-based Pyramid Global Context Reasoning (Graph-PGCR) module, which… ▽ More Coronavirus Disease 2019 (COVID-19) has rapidly spread in 2020, emerging a mass of studies for lung infection segmentation from CT images. Though many methods have been proposed for this issue, it is a challenging task because of infections of various size appearing in different lobe zones. To tackle these issues, we propose a Graph-based Pyramid Global Context Reasoning (Graph-PGCR) module, which is capable of modeling long-range dependencies among disjoint infections as well as adapt size variation. We first incorporate graph convolution to exploit long-term contextual information from multiple lobe zones. Different from previous average pooling or maximum object probability, we propose a saliency-aware projection mechanism to pick up infection-related pixels as a set of graph nodes. After graph reasoning, the relation-aware features are reversed back to the original coordinate space for the down-stream tasks. We further construct multiple graphs with different sampling rates to handle the size variation problem. To this end, distinct multi-scale long-range contextual patterns can be captured. Our Graph-PGCR module is plug-and-play, which can be integrated into any architecture to improve its performance. Experiments demonstrated that the proposed method consistently boost the performance of state-of-the-art backbone architectures on both of public and our private COVID-19 datasets. △ Less

Submitted 6 March, 2021; originally announced March 2021.

arXiv:2010.11630 [pdf, other]

doi 10.1109/DLS51937.2020.00012

DeepGalaxy: Deducing the Properties of Galaxy Mergers from Images Using Deep Neural Networks

Authors: Maxwell X. Cai, Jeroen Bédorf, Vikram A. Saletore, Valeriu Codreanu, Damian Podareanu, Adel Chaibi, Penny X. Qian

Abstract: Galaxy mergers, the dynamical process during which two galaxies collide, are among the most spectacular phenomena in the Universe. During this process, the two colliding galaxies are tidally disrupted, producing significant visual features that evolve as a function of time. These visual features contain valuable clues for deducing the physical properties of the galaxy mergers. In this work, we pro… ▽ More Galaxy mergers, the dynamical process during which two galaxies collide, are among the most spectacular phenomena in the Universe. During this process, the two colliding galaxies are tidally disrupted, producing significant visual features that evolve as a function of time. These visual features contain valuable clues for deducing the physical properties of the galaxy mergers. In this work, we propose DeepGalaxy, a visual analysis framework trained to predict the physical properties of galaxy mergers based on their morphology. Based on an encoder-decoder architecture, DeepGalaxy encodes the input images to a compressed latent space $z$, and determines the similarity of images according to the latent-space distance. DeepGalaxy consists of a fully convolutional autoencoder (FCAE) which generates activation maps at its 3D latent-space, and a variational autoencoder (VAE) which compresses the activation maps into a 1D vector, and a classifier that generates labels from the activation maps. The backbone of the FCAE can be fully customized according to the complexity of the images. DeepGalaxy demonstrates excellent scaling performance on parallel machines. On the Endeavour supercomputer, the scaling efficiency exceeds 0.93 when trained on 128 workers, and it maintains above 0.73 when trained with 512 workers. Without having to carry out expensive numerical simulations, DeepGalaxy makes inferences of the physical properties of galaxy mergers directly from images, and thereby achieves a speedup factor of $\sim 10^5$. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 7 pages, 7 figures. Accepted for publication at the 2020 IEEE/ACM Fifth Workshop on Deep Learning on Supercomputers (DLS)

arXiv:2010.04653 [pdf, other]

doi 10.1109/ACCESS.2021.3085486

Quantifying the multi-objective cost of uncertainty

Authors: Byung-Jun Yoon, Xiaoning Qian, Edward R. Dougherty

Abstract: Various real-world applications involve modeling complex systems with immense uncertainty and optimizing multiple objectives based on the uncertain model. Quantifying the impact of the model uncertainty on the given operational objectives is critical for designing optimal experiments that can most effectively reduce the uncertainty that affect the objectives pertinent to the application at hand. I… ▽ More Various real-world applications involve modeling complex systems with immense uncertainty and optimizing multiple objectives based on the uncertain model. Quantifying the impact of the model uncertainty on the given operational objectives is critical for designing optimal experiments that can most effectively reduce the uncertainty that affect the objectives pertinent to the application at hand. In this paper, we propose the concept of mean multi-objective cost of uncertainty (multi-objective MOCU) that can be used for objective-based quantification of uncertainty for complex uncertain systems considering multiple operational objectives. We provide several illustrative examples that demonstrate the concept and strengths of the proposed multi-objective MOCU. Furthermore, we present a real-world example based on the mammalian cell cycle network to demonstrate how the multi-objective MOCU can be used for quantifying the operational impact of model uncertainty when there are multiple, possibly competing, objectives. △ Less

Submitted 30 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

arXiv:2010.03201 [pdf, other]

M3Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening from CT Imaging

Authors: Xuelin Qian, Huazhu Fu, Weiya Shi, Tao Chen, Yanwei Fu, Fei Shan, Xiangyang Xue

Abstract: To counter the outbreak of COVID-19, the accurate diagnosis of suspected cases plays a crucial role in timely quarantine, medical treatment, and preventing the spread of the pandemic. Considering the limited training cases and resources (e.g, time and budget), we propose a Multi-task Multi-slice Deep Learning System (M3Lung-Sys) for multi-class lung pneumonia screening from CT imaging, which only… ▽ More To counter the outbreak of COVID-19, the accurate diagnosis of suspected cases plays a crucial role in timely quarantine, medical treatment, and preventing the spread of the pandemic. Considering the limited training cases and resources (e.g, time and budget), we propose a Multi-task Multi-slice Deep Learning System (M3Lung-Sys) for multi-class lung pneumonia screening from CT imaging, which only consists of two 2D CNN networks, i.e., slice- and patient-level classification networks. The former aims to seek the feature representations from abundant CT slices instead of limited CT volumes, and for the overall pneumonia screening, the latter one could recover the temporal information by feature refinement and aggregation between different slices. In addition to distinguish COVID-19 from Healthy, H1N1, and CAP cases, our M 3 Lung-Sys also be able to locate the areas of relevant lesions, without any pixel-level annotation. To further demonstrate the effectiveness of our model, we conduct extensive experiments on a chest CT imaging dataset with a total of 734 patients (251 healthy people, 245 COVID-19 patients, 105 H1N1 patients, and 133 CAP patients). The quantitative results with plenty of metrics indicate the superiority of our proposed model on both slice- and patient-level classification tasks. More importantly, the generated lesion location maps make our system interpretable and more valuable to clinicians. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: IEEE Journal of Biomedical and Health Informatics (JBHI), 2020

arXiv:2009.03184 [pdf]

A New Screening Method for COVID-19 based on Ocular Feature Recognition by Machine Learning Tools

Authors: Yanwei Fu, Feng Li, Wenxuan Wang, Haicheng Tang, Xuelin Qian, Mengwei Gu, Xiangyang Xue

Abstract: The Coronavirus disease 2019 (COVID-19) has affected several million people. With the outbreak of the epidemic, many researchers are devoting themselves to the COVID-19 screening system. The standard practices for rapid risk screening of COVID-19 are the CT imaging or RT-PCR (real-time polymerase chain reaction). However, these methods demand professional efforts of the acquisition of CT images an… ▽ More The Coronavirus disease 2019 (COVID-19) has affected several million people. With the outbreak of the epidemic, many researchers are devoting themselves to the COVID-19 screening system. The standard practices for rapid risk screening of COVID-19 are the CT imaging or RT-PCR (real-time polymerase chain reaction). However, these methods demand professional efforts of the acquisition of CT images and saliva samples, a certain amount of waiting time, and most importantly prohibitive examination fee in some countries. Recently, some literatures have shown that the COVID-19 patients usually accompanied by ocular manifestations consistent with the conjunctivitis, including conjunctival hyperemia, chemosis, epiphora, or increased secretions. After more than four months study, we found that the confirmed cases of COVID-19 present the consistent ocular pathological symbols; and we propose a new screening method of analyzing the eye-region images, captured by common CCD and CMOS cameras, could reliably make a rapid risk screening of COVID-19 with very high accuracy. We believe a system implementing such an algorithm should assist the triage management or the clinical diagnosis. To further evaluate our algorithm and approved by the Ethics Committee of Shanghai public health clinic center of Fudan University, we conduct a study of analyzing the eye-region images of 303 patients (104 COVID-19, 131 pulmonary, and 68 ocular patients), as well as 136 healthy people. Remarkably, our results of COVID-19 patients in testing set consistently present similar ocular pathological symbols; and very high testing results have been achieved in terms of sensitivity and specificity. We hope this study can be inspiring and helpful for encouraging more researches in this topic. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: technical report

arXiv:2008.08278 [pdf, other]

DONet: Dual Objective Networks for Skin Lesion Segmentation

Authors: Yaxiong Wang, Yunchao Wei, Xueming Qian, Li Zhu, Yi Yang

Abstract: Skin lesion segmentation is a crucial step in the computer-aided diagnosis of dermoscopic images. In the last few years, deep learning based semantic segmentation methods have significantly advanced the skin lesion segmentation results. However, the current performance is still unsatisfactory due to some challenging factors such as large variety of lesion scale and ambiguous difference between les… ▽ More Skin lesion segmentation is a crucial step in the computer-aided diagnosis of dermoscopic images. In the last few years, deep learning based semantic segmentation methods have significantly advanced the skin lesion segmentation results. However, the current performance is still unsatisfactory due to some challenging factors such as large variety of lesion scale and ambiguous difference between lesion region and background. In this paper, we propose a simple yet effective framework, named Dual Objective Networks (DONet), to improve the skin lesion segmentation. Our DONet adopts two symmetric decoders to produce different predictions for approaching different objectives. Concretely, the two objectives are actually defined by different loss functions. In this way, the two decoders are encouraged to produce differentiated probability maps to match different optimization targets, resulting in complementary predictions accordingly. The complementary information learned by these two objectives are further aggregated together to make the final prediction, by which the uncertainty existing in segmentation maps can be significantly alleviated. Besides, to address the challenge of large variety of lesion scales and shapes in dermoscopic images, we additionally propose a recurrent context encoding module (RCEM) to model the complex correlation among skin lesions, where the features with different scale contexts are efficiently integrated to form a more robust representation. Extensive experiments on two popular benchmarks well demonstrate the effectiveness of the proposed DONet. In particular, our DONet achieves 0.881 and 0.931 dice score on ISIC 2018 and $\text{PH}^2$, respectively. Code will be made public available. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 10 pages

arXiv:1909.08029 [pdf, other]

Heterogeneity-Aware Asynchronous Decentralized Training

Authors: Qinyi Luo, Jiaao He, Youwei Zhuo, Xuehai Qian

Abstract: Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast conver… ▽ More Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers, and is significantly slower in heterogeneous situations. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds - designing a distributed training method that has both high performance as All-Reduce in homogeneous environment and good heterogeneity tolerance as AD-PSGD? In this paper, we propose Ripples, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization, emphasizing the interplay between algorithm and system implementation. To reduce synchronization cost, we propose a novel communication primitive Partial All-Reduce that allows a large group of workers to synchronize quickly. To reduce synchronization conflict, we propose static group scheduling in homogeneous environment and simple techniques (Group Buffer and Group Division) to avoid conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Ripples is 1.1 times faster than the state-of-the-art implementation of All-Reduce, 5.1 times faster than Parameter Server and 4.3 times faster than AD-PSGD. In a heterogeneous setting, Ripples shows 2 times speedup over All-Reduce, and still obtains 3 times speedup over the Parameter Server baseline. △ Less

Submitted 17 September, 2019; originally announced September 2019.

arXiv:1908.08747 [pdf, other]

Reconfigurable Intelligent Surfaces vs. Relaying: Differences, Similarities, and Performance Comparison

Authors: M. Di Renzo, K. Ntontin, J. Song, F. H. Danufane, X. Qian, F. Lazarakis, J. de Rosny, D. -T. Phan-Huy, O. Simeone, R. Zhang, M. Debbah, G. Lerosey, M. Fink, S. Tretyakov, S. Shamai

Abstract: Reconfigurable intelligent surfaces (RISs) have the potential of realizing the emerging concept of smart radio environments by leveraging the unique properties of meta-surfaces. In this article, we discuss the potential applications of RISs in wireless networks that operate at high-frequency bands, e.g., millimeter wave (30-100 GHz) and sub-millimeter wave (greater than 100 GHz) frequencies. When… ▽ More Reconfigurable intelligent surfaces (RISs) have the potential of realizing the emerging concept of smart radio environments by leveraging the unique properties of meta-surfaces. In this article, we discuss the potential applications of RISs in wireless networks that operate at high-frequency bands, e.g., millimeter wave (30-100 GHz) and sub-millimeter wave (greater than 100 GHz) frequencies. When used in wireless networks, RISs may operate in a manner similar to relays. This paper elaborates on the key differences and similarities between RISs that are configured to operate as anomalous reflectors and relays. In particular, we illustrate numerical results that highlight the spectral efficiency gains of RISs when their size is sufficiently large as compared with the wavelength of the radio waves. In addition, we discuss key open issues that need to be addressed for unlocking the potential benefits of RISs. △ Less

Submitted 21 February, 2020; v1 submitted 23 August, 2019; originally announced August 2019.

Comments: Submitted for journal publication (revised version)

arXiv:1907.09077 [pdf, other]

doi 10.1145/3307650.3322270

A Stochastic-Computing based Deep Learning Framework using Adiabatic Quantum-Flux-Parametron SuperconductingTechnology

Authors: Ruizhe Cai, Ao Ren, Olivia Chen, Ning Liu, Caiwen Ding, Xuehai Qian, Jie Han, Wenhui Luo, Nobuyuki Yoshikawa, Yanzhi Wang

Abstract: The Adiabatic Quantum-Flux-Parametron (AQFP) superconducting technology has been recently developed, which achieves the highest energy efficiency among superconducting logic families, potentially huge gain compared with state-of-the-art CMOS. In 2016, the successful fabrication and testing of AQFP-based circuits with the scale of 83,000 JJs have demonstrated the scalability and potential of implem… ▽ More The Adiabatic Quantum-Flux-Parametron (AQFP) superconducting technology has been recently developed, which achieves the highest energy efficiency among superconducting logic families, potentially huge gain compared with state-of-the-art CMOS. In 2016, the successful fabrication and testing of AQFP-based circuits with the scale of 83,000 JJs have demonstrated the scalability and potential of implementing large-scale systems using AQFP. As a result, it will be promising for AQFP in high-performance computing and deep space applications, with Deep Neural Network (DNN) inference acceleration as an important example. Besides ultra-high energy efficiency, AQFP exhibits two unique characteristics: the deep pipelining nature since each AQFP logic gate is connected with an AC clock signal, which increases the difficulty to avoid RAW hazards; the second is the unique opportunity of true random number generation (RNG) using a single AQFP buffer, far more efficient than RNG in CMOS. We point out that these two characteristics make AQFP especially compatible with the \emph{stochastic computing} (SC) technique, which uses a time-independent bit sequence for value representation, and is compatible with the deep pipelining nature. Further, the application of SC has been investigated in DNNs in prior work, and the suitability has been illustrated as SC is more compatible with approximate computations. This work is the first to develop an SC-based DNN acceleration framework using AQFP technology. △ Less

Submitted 21 July, 2019; originally announced July 2019.

arXiv:1902.06085 [pdf]

doi 10.1002/mp.14003

DC-AL GAN: Pseudoprogression and True Tumor Progression of Glioblastoma Multiform Image Classification Based on DCGAN and AlexNet

Authors: Meiyu Li, Hailiang Tang, Michael D. Chan, Xiaobo Zhou, Xiaohua Qian

Abstract: Pseudoprogression (PsP) occurs in 20-30% of patients with glioblastoma multiforme (GBM) after receiving the standard treatment. In the course of post-treatment magnetic resonance imaging (MRI), PsP exhibits similarities in shape and intensity to the true tumor progression (TTP) of GBM. So, these similarities pose challenges on the differentiation of these types of progression and hence the selecti… ▽ More Pseudoprogression (PsP) occurs in 20-30% of patients with glioblastoma multiforme (GBM) after receiving the standard treatment. In the course of post-treatment magnetic resonance imaging (MRI), PsP exhibits similarities in shape and intensity to the true tumor progression (TTP) of GBM. So, these similarities pose challenges on the differentiation of these types of progression and hence the selection of the appropriate clinical treatment strategy. In this paper, we introduce DC-AL GAN, a novel feature learning method based on deep convolutional generative adversarial network (DCGAN) and AlexNet, to discriminate between PsP and TTP in MRI images. Due to the adversarial relationship between the generator and the discriminator of DCGAN, high-level discriminative features of PsP and TTP can be derived for the discriminator with AlexNet. Also, a feature fusion scheme is used to combine higher-layer features with lower-layer information, leading to more powerful features that are used for effectively discriminating between PsP and TTP. The experimental results show that DC-AL GAN achieves desirable PsP and TTP classification performance that is superior to other state-of-the-art methods. △ Less

Submitted 18 May, 2019; v1 submitted 16 February, 2019; originally announced February 2019.

arXiv:1902.01064 [pdf, other]

doi 10.1145/3297858.3304009

Hop: Heterogeneity-Aware Decentralized Training

Authors: Qinyi Luo, **kun Lin, Youwei Zhuo, Xuehai Qian

Abstract: Recent work has shown that decentralized algorithms can deliver superior performance over centralized ones in the context of machine learning. The two approaches, with the main difference residing in their distinct communication patterns, are both susceptible to performance degradation in heterogeneous environments. Although vigorous efforts have been devoted to supporting centralized algorithms a… ▽ More Recent work has shown that decentralized algorithms can deliver superior performance over centralized ones in the context of machine learning. The two approaches, with the main difference residing in their distinct communication patterns, are both susceptible to performance degradation in heterogeneous environments. Although vigorous efforts have been devoted to supporting centralized algorithms against heterogeneity, little has been explored in decentralized algorithms regarding this problem. This paper proposes Hop, the first heterogeneity-aware decentralized training protocol. Based on a unique characteristic of decentralized training that we have identified, the iteration gap, we propose a queue-based synchronization mechanism that can efficiently implement backup workers and bounded staleness in the decentralized setting. To cope with deterministic slowdown, we propose skip** iterations so that the effect of slower workers is further mitigated. We build a prototype implementation of Hop on TensorFlow. The experiment results on CNN and SVM show significant speedup over standard decentralized training in heterogeneous settings. △ Less

Submitted 7 February, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

arXiv:1901.08983 [pdf, other]

LOCATA challenge: speaker localization with a planar array

Authors: Xinyuan Qian, Andrea Cavallaro, Alessio Brutti, Maurizio Omologo

Abstract: This document describes our submission to the 2018 LOCalization And TrAcking (LOCATA) challenge (Tasks 1, 3, 5). We estimate the 3D position of a speaker using the Global Coherence Field (GCF) computed from multiple microphone pairs of a DICIT planar array. One of the main challenges when using such an array with omnidirectional microphones is the front-back ambiguity, which is particularly eviden… ▽ More This document describes our submission to the 2018 LOCalization And TrAcking (LOCATA) challenge (Tasks 1, 3, 5). We estimate the 3D position of a speaker using the Global Coherence Field (GCF) computed from multiple microphone pairs of a DICIT planar array. One of the main challenges when using such an array with omnidirectional microphones is the front-back ambiguity, which is particularly evident in Task 5. We address this challenge by post-processing the peaks of the GCF and exploiting the attenuation introduced by the frame of the array. Moreover, the intermittent nature of speech and the changing orientation of the speaker make localization difficult. For Tasks 3 and 5, we also employ a Particle Filter (PF) that favors the spatio-temporal continuity of the localization results. △ Less

Submitted 25 January, 2019; originally announced January 2019.

Comments: In Proceedings of the LOCATA ChallengeWorkshop - a satellite event of IWAENC 2018 (arXiv:1811.08482 )

Report number: LOCATAchallenge/2018/05

arXiv:1812.07106 [pdf, other]

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs

Authors: Zhe Li, Caiwen Ding, Siyue Wang, Wujie Wen, Youwei Zhuo, Chang Liu, Qinru Qiu, Wenyao Xu, Xue Lin, Xuehai Qian, Yanzhi Wang

Abstract: Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision… ▽ More Recurrent Neural Networks (RNNs) are becoming increasingly important for time series-related applications which require efficient and real-time implementations. The two major types are Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. It is a challenging task to have real-time, efficient, and accurate hardware RNN implementations because of the high sensitivity to imprecision accumulation and the requirement of special activation function implementations. A key limitation of the prior works is the lack of a systematic design optimization framework of RNN model and hardware implementations, especially when the block size (or compression ratio) should be jointly optimized with RNN type, layer size, etc. In this paper, we adopt the block-circulant matrix-based framework, and present the Efficient RNN (E-RNN) framework for FPGA implementations of the Automatic Speech Recognition (ASR) application. The overall goal is to improve performance/energy efficiency under accuracy requirement. We use the alternating direction method of multipliers (ADMM) technique for more accurate block-circulant training, and present two design explorations providing guidance on block size and reducing RNN training trials. Based on the two observations, we decompose E-RNN in two phases: Phase I on determining RNN model to reduce computation and storage subject to accuracy requirement, and Phase II on hardware implementations given RNN model, including processing element design/optimization, quantization, activation implementation, etc. Experimental results on actual FPGA deployments show that E-RNN achieves a maximum energy efficiency improvement of 37.4$\times$ compared with ESE, and more than 2$\times$ compared with C-LSTM, under the same accuracy. △ Less

Submitted 12 December, 2018; originally announced December 2018.

Comments: In The 25th International Symposium on High-Performance Computer Architecture (HPCA 2019)

arXiv:1808.01672 [pdf, other]

Model-Aided Wireless Artificial Intelligence: Embedding Expert Knowledge in Deep Neural Networks Towards Wireless Systems Optimization

Authors: Alessio Zappone, Marco Di Renzo, Mérouane Debbah, Thanh Tu Lam, Xuewen Qian

Abstract: Deep learning based on artificial neural networks is a powerful machine learning method that, in the last few years, has been successfully used to realize tasks, e.g., image classification, speech recognition, translation of languages, etc., that are usually simple to execute by human beings but extremely difficult to perform by machines. This is one of the reasons why deep learning is considered… ▽ More Deep learning based on artificial neural networks is a powerful machine learning method that, in the last few years, has been successfully used to realize tasks, e.g., image classification, speech recognition, translation of languages, etc., that are usually simple to execute by human beings but extremely difficult to perform by machines. This is one of the reasons why deep learning is considered to be one of the main enablers to realize the notion of artificial intelligence. In order to identify the best architecture of an artificial neural network that allows one to fit input-output data pairs, the current methodology in deep learning methods consists of employing a data-driven approach. Once the artificial neural network is trained, it is capable of responding to never-observed inputs by providing the optimum output based on past acquired knowledge. In this context, a recent trend in the deep learning community is to complement pure data-driven approaches with prior information based on expert knowledge. In this work, we describe two methods that implement this strategy, which aim at optimizing wireless communication networks. In addition, we illustrate numerical results in order to assess the performance of the proposed approaches compared with pure data-driven implementations. △ Less

Submitted 15 June, 2019; v1 submitted 5 August, 2018; originally announced August 2018.

Comments: Accepted for publication on the IEEE Vehicular Technology Magazine

arXiv:1805.01143 [pdf, ps, other]

Experimental Design via Generalized Mean Objective Cost of Uncertainty

Authors: Shahin Boluki, Xiaoning Qian, Edward R. Dougherty

Abstract: The mean objective cost of uncertainty (MOCU) quantifies the performance cost of using an operator that is optimal across an uncertainty class of systems as opposed to using an operator that is optimal for a particular system. MOCU-based experimental design selects an experiment to maximally reduce MOCU, thereby gaining the greatest reduction of uncertainty impacting the operational objective. The… ▽ More The mean objective cost of uncertainty (MOCU) quantifies the performance cost of using an operator that is optimal across an uncertainty class of systems as opposed to using an operator that is optimal for a particular system. MOCU-based experimental design selects an experiment to maximally reduce MOCU, thereby gaining the greatest reduction of uncertainty impacting the operational objective. The original formulation applied to finding optimal system operators, where optimality is with respect to a cost function, such as mean-square error; and the prior distribution governing the uncertainty class relates directly to the underlying physical system. Here we provide a generalized MOCU and the corresponding experimental design. We then demonstrate how this new formulation includes as special cases MOCU-based experimental design methods developed for materials science and genomic networks when there is experimental error. Most importantly, we show that the classical Knowledge Gradient and Efficient Global Optimization experimental design procedures are actually implementations of MOCU-based experimental design under their modeling assumptions. △ Less

Submitted 3 May, 2018; originally announced May 2018.

arXiv:1407.5813

Priority-based coordination of autonomous and legacy vehicles at intersection

Authors: Xiangjun Qian, Jean Gregoire, Fabien Moutarde, Arnaud De La Fortelle

Abstract: Recently, researchers have proposed various autonomous intersection management techniques that enable autonomous vehicles to cross the intersection without traffic lights or stop signs. In particular, a priority-based coordination system with provable collision-free and deadlock-free features has been presented. In this paper, we extend the priority-based approach to support legacy vehicles withou… ▽ More Recently, researchers have proposed various autonomous intersection management techniques that enable autonomous vehicles to cross the intersection without traffic lights or stop signs. In particular, a priority-based coordination system with provable collision-free and deadlock-free features has been presented. In this paper, we extend the priority-based approach to support legacy vehicles without compromising above-mentioned features. We make the hypothesis that legacy vehicles are able to keep a safe distance from their leading vehicles. Then we explore some special configurations of system that ensures the safe crossing of legacy vehicles. We implement the extended system in a realistic traffic simulator SUMO. Simulations are performed to demonstrate the safety of the system. △ Less

Submitted 26 September, 2014; v1 submitted 22 July, 2014; originally announced July 2014.

Comments: put in other preprint server

Showing 1–50 of 52 results for author: Qian, X