-
Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression
Authors:
Li Wan,
Tansu Alpcan,
Margreta Kuijper,
Emanuele Viterbo
Abstract:
We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, opti…
▽ More
We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.
△ Less
Submitted 28 April, 2024;
originally announced May 2024.
-
FADI-AEC: Fast Score Based Diffusion Model Guided by Far-end Signal for Acoustic Echo Cancellation
Authors:
Yang Liu,
Li Wan,
Yun Li,
Yiteng Huang,
Ming Sun,
James Luan,
Yangyang Shi,
Xin Lei
Abstract:
Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stan…
▽ More
Despite the potential of diffusion models in speech enhancement, their deployment in Acoustic Echo Cancellation (AEC) has been restricted. In this paper, we propose DI-AEC, pioneering a diffusion-based stochastic regeneration approach dedicated to AEC. Further, we propose FADI-AEC, fast score-based diffusion AEC framework to save computational demands, making it favorable for edge devices. It stands out by running the score model once per frame, achieving a significant surge in processing efficiency. Apart from that, we introduce a novel noise generation technique where far-end signals are utilized, incorporating both far-end and near-end signals to refine the score model's accuracy. We test our proposed method on the ICASSP2023 Microsoft deep echo cancellation challenge evaluation dataset, where our method outperforms some of the end-to-end methods and other diffusion based echo cancellation methods.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Directional Source Separation for Robust Speech Recognition on Smart Glasses
Authors:
Tiantian Feng,
Ju Lin,
Yiteng Huang,
Weipeng He,
Kaustubh Kalgaonkar,
Niko Moritz,
Li Wan,
Xin Lei,
Ming Sun,
Frank Seide
Abstract:
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality,…
▽ More
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer
Authors:
Zhi** Ge,
Fanhua Shang,
Hongying Liu,
Yuanyuan Liu,
Liang Wan,
Wei Feng,
Xiaosen Wang
Abstract:
Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation…
▽ More
Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhi**-Ge/STM.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
High-performance Data Management for Whole Slide Image Analysis in Digital Pathology
Authors:
Haoju Leng,
Ruining Deng,
Shunxing Bao,
Dazheng Fang,
Bryan A. Millis,
Yucheng Tang,
Haichun Yang,
Xiao Wang,
Yifan Peng,
Lipeng Wan,
Yuankai Huo
Abstract:
When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O…
▽ More
When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O load onto the computer system. However, this data management process could be further paralleled, given the typical independence of patch-level image processes across different patches. This paper details our endeavors in tackling this data access challenge by implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been constructing and releasing a digital pathology-centric pipeline using ADIOS2, which facilitates streamlined data management across WSIs. Additionally, we've developed strategies aimed at curtailing data retrieval times. The performance evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance stands on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). From what we know, this appears to be among the initial instances, if any, of utilizing ADIOS2 within the field of digital pathology. The source code has been made publicly available at https://github.com/hrlblab/adios.
△ Less
Submitted 20 August, 2023; v1 submitted 10 August, 2023;
originally announced August 2023.
-
Semi-automated Thermal Envelope Model Setup for Adaptive Model Predictive Control with Event-triggered System Identification
Authors:
Lu Wan,
Xiaobing Dai,
Torsten Welfonder,
Ekaterina Petrova,
Pieter Pauwels
Abstract:
To reach carbon neutrality in the middle of this century, smart controls for building energy systems are urgently required. Model predictive control (MPC) demonstrates great potential in improving the performance of heating ventilation and air-conditioning (HVAC) systems, whereas its wide application in the building sector is impeded by the considerable manual efforts involved in setting up the co…
▽ More
To reach carbon neutrality in the middle of this century, smart controls for building energy systems are urgently required. Model predictive control (MPC) demonstrates great potential in improving the performance of heating ventilation and air-conditioning (HVAC) systems, whereas its wide application in the building sector is impeded by the considerable manual efforts involved in setting up the control-oriented model. To facilitate the system identification (SI) of the building envelope as well as the configuration of the MPC algorithms with less human intervention, a semantic-assisted control framework is proposed in this paper. We first integrate different data sources required by the MPC algorithms such as the building topology, HVAC systems, sensor data stream and control settings in the form of a knowledge graph and then employ the data to set up the MPC algorithm automatically. Moreover, an event-triggered SI scheme is designed, to ensure the computational efficiency and accuracy of the MPC algorithm simultaneously. The proposed method is validated via simulations. The results demonstrate the practical relevance and effectiveness of the proposed semantics-assisted MPC framework with event-triggered learning of system dynamics.
△ Less
Submitted 11 July, 2023; v1 submitted 2 July, 2023;
originally announced July 2023.
-
Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement
Authors:
Liang Wan,
Hongqing Liu,
Yi Zhou,
Jie Ji
Abstract:
The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to effectively exploit time-frequency domain information. By combining the DPRNN module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising performance in speech separation with a limited model size. In this paper, we explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Ne…
▽ More
The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to effectively exploit time-frequency domain information. By combining the DPRNN module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising performance in speech separation with a limited model size. In this paper, we explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement. We use self-attention modules to exploit the long-time information, where the intra-chunk self-attentions are used to model the spectrum pattern and the inter-chunk self-attention are used to model the dependence between consecutive frames. Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation, which is more suitable for long sequences of speech signals. In addition, we propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network. Experiments show that with only 0.23M parameters, the proposed model achieves a better performance than DPCRN.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
An Accelerated Pipeline for Multi-label Renal Pathology Image Segmentation at the Whole Slide Image Level
Authors:
Haoju Leng,
Ruining Deng,
Zuhayr Asad,
R. Michael Womick,
Haichun Yang,
Lipeng Wan,
Yuankai Huo
Abstract:
Deep-learning techniques have been used widely to alleviate the labour-intensive and time-consuming manual annotation required for pixel-level tissue characterization. Our previous study introduced an efficient single dynamic network - Omni-Seg - that achieved multi-class multi-scale pathological segmentation with less computational complexity. However, the patch-wise segmentation paradigm still a…
▽ More
Deep-learning techniques have been used widely to alleviate the labour-intensive and time-consuming manual annotation required for pixel-level tissue characterization. Our previous study introduced an efficient single dynamic network - Omni-Seg - that achieved multi-class multi-scale pathological segmentation with less computational complexity. However, the patch-wise segmentation paradigm still applies to Omni-Seg, and the pipeline is time-consuming when providing segmentation for Whole Slide Images (WSIs). In this paper, we propose an enhanced version of the Omni-Seg pipeline in order to reduce the repetitive computing processes and utilize a GPU to accelerate the model's prediction for both better model performance and faster speed. Our proposed method's innovative contribution is two-fold: (1) a Docker is released for an end-to-end slide-wise multi-tissue segmentation for WSIs; and (2) the pipeline is deployed on a GPU to accelerate the prediction, achieving better segmentation quality in less time. The proposed accelerated implementation reduced the average processing time (at the testing stage) on a standard needle biopsy WSI from 2.3 hours to 22 minutes, using 35 WSIs from the Kidney Tissue Atlas (KPMP) Datasets. The source code and the Docker have been made publicly available at https://github.com/ddrrnn123/Omni-Seg.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Diff-UNet: A Diffusion Embedded Network for Volumetric Segmentation
Authors:
Zhaohu Xing,
Liang Wan,
Huazhu Fu,
Guang Yang,
Lei Zhu
Abstract:
In recent years, Denoising Diffusion Models have demonstrated remarkable success in generating semantically valuable pixel-wise representations for image generative modeling. In this study, we propose a novel end-to-end framework, called Diff-UNet, for medical volumetric segmentation. Our approach integrates the diffusion model into a standard U-shaped architecture to extract semantic information…
▽ More
In recent years, Denoising Diffusion Models have demonstrated remarkable success in generating semantically valuable pixel-wise representations for image generative modeling. In this study, we propose a novel end-to-end framework, called Diff-UNet, for medical volumetric segmentation. Our approach integrates the diffusion model into a standard U-shaped architecture to extract semantic information from the input volume effectively, resulting in excellent pixel-level representations for medical volumetric segmentation. To enhance the robustness of the diffusion model's prediction results, we also introduce a Step-Uncertainty based Fusion (SUF) module during inference to combine the outputs of the diffusion models at each step. We evaluate our method on three datasets, including multimodal brain tumors in MRI, liver tumors, and multi-organ CT volumes, and demonstrate that Diff-UNet outperforms other state-of-the-art methods significantly. Our experimental results also indicate the universality and effectiveness of the proposed model. The proposed framework has the potential to facilitate the accurate diagnosis and treatment of medical conditions by enabling more precise segmentation of anatomical structures. The codes of Diff-UNet are available at https://github.com/ge-xing/Diff-UNet
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-Based, Alignment-Free and Hybrid Approaches
Authors:
Vinicius Ribeiro,
Yiteng Huang,
Yuan Shangguan,
Zhaojun Yang,
Li Wan,
Ming Sun
Abstract:
Wake word detection exists in most intelligent homes and portable devices. It offers these devices the ability to "wake up" when summoned at a low cost of power and computing. This paper focuses on understanding alignment's role in develo** a wake-word system that answers a generic phrase. We discuss three approaches. The first is alignment-based, where the model is trained with frame-wise cross…
▽ More
Wake word detection exists in most intelligent homes and portable devices. It offers these devices the ability to "wake up" when summoned at a low cost of power and computing. This paper focuses on understanding alignment's role in develo** a wake-word system that answers a generic phrase. We discuss three approaches. The first is alignment-based, where the model is trained with frame-wise cross-entropy. The second is alignment-free, where the model is trained with CTC. The third, proposed by us, is a hybrid solution in which the model is trained with a small set of aligned data and then tuned with a sizeable unaligned dataset. We compare the three approaches and evaluate the impact of the different aligned-to-unaligned ratios for hybrid training. Our results show that the alignment-free system performs better than the alignment-based for the target operating point, and with a small fraction of the data (20%), we can train a model that complies with our initial constraints.
△ Less
Submitted 7 June, 2023; v1 submitted 17 February, 2023;
originally announced February 2023.
-
LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting
Authors:
Haichuan Yang,
Zhaojun Yang,
Li Wan,
Biqiao Zhang,
Yangyang Shi,
Yiteng Huang,
Ivaylo Enchev,
Limin Tang,
Raziel Alvarez,
Ming Sun,
Xin Lei,
Raghuraman Krishnamoorthi,
Vikas Chandra
Abstract:
This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, a…
▽ More
This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, and linear layers are often more efficient than other layers. The proposed LiCo-Net is a dual-phase system that uses the efficient int8 linear operators at the inference phase and applies streaming convolutions at the training phase to maintain a high model capacity. The experimental results show that LiCo-Net outperforms single-value decomposition filter (SVDF) on hardware efficiency with on-par detection performance. Compared to SVDF, LiCo-Net reduces cycles by 40% on HiFi4 DSP.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation
Authors:
Zhaohu Xing,
Lequan Yu,
Liang Wan,
Tong Han,
Lei Zhu
Abstract:
Multi-modal MR imaging is routinely used in clinical practice to diagnose and investigate brain tumors by providing rich complementary information. Previous multi-modal MRI segmentation methods usually perform modal fusion by concatenating multi-modal MRIs at an early/middle stage of the network, which hardly explores non-linear dependencies between modalities. In this work, we propose a novel Nes…
▽ More
Multi-modal MR imaging is routinely used in clinical practice to diagnose and investigate brain tumors by providing rich complementary information. Previous multi-modal MRI segmentation methods usually perform modal fusion by concatenating multi-modal MRIs at an early/middle stage of the network, which hardly explores non-linear dependencies between modalities. In this work, we propose a novel Nested Modality-Aware Transformer (NestedFormer) to explicitly explore the intra-modality and inter-modality relationships of multi-modal MRIs for brain tumor segmentation. Built on the transformer-based multi-encoder and single-decoder structure, we perform nested multi-modal fusion for high-level representations of different modalities and apply modality-sensitive gating (MSG) at lower scales for more effective skip connections. Specifically, the multi-modal fusion is conducted in our proposed Nested Modality-aware Feature Aggregation (NMaFA) module, which enhances long-term dependencies within individual modalities via a tri-orientated spatial-attention transformer, and further complements key contextual information among modalities via a cross-modality attention transformer. Extensive experiments on BraTS2020 benchmark and a private meningiomas segmentation (MeniSeg) dataset show that the NestedFormer clearly outperforms the state-of-the-arts. The code is available at https://github.com/920232796/NestedFormer.
△ Less
Submitted 31 August, 2022;
originally announced August 2022.
-
Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization
Authors:
Mufan Sang,
Haoqi Li,
Fang Liu,
Andrew O. Arnold,
Li Wan
Abstract:
Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the prop…
▽ More
Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
△ Less
Submitted 1 February, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Joint CFO, Gridless Channel Estimation and Data Detection for Underwater Acoustic OFDM Systems
Authors:
Lei Wan,
Jiang Zhu,
En Cheng,
Zhiwei Xu
Abstract:
In this paper, we propose an iterative receiver based on gridless variational Bayesian line spectra estimation (VALSE) named JCCD-VALSE that \emph{j}ointly estimates the \emph{c}arrier frequency offset (CFO), the \emph{c}hannel with high resolution and carries out \emph{d}ata decoding. Based on a modularized point of view and motivated by the high resolution and low complexity gridless VALSE algor…
▽ More
In this paper, we propose an iterative receiver based on gridless variational Bayesian line spectra estimation (VALSE) named JCCD-VALSE that \emph{j}ointly estimates the \emph{c}arrier frequency offset (CFO), the \emph{c}hannel with high resolution and carries out \emph{d}ata decoding. Based on a modularized point of view and motivated by the high resolution and low complexity gridless VALSE algorithm, three modules named the VALSE module, the minimum mean squared error (MMSE) module and the decoder module are built. Soft information is exchanged between the modules to progressively improve the channel estimation and data decoding accuracy. Since the delays of multipaths of the channel are treated as continuous parameters, instead of on a grid, the leakage effect is avoided. Besides, the proposed approach is a more complete Bayesian approach as all the nuisance parameters such as the noise variance, the parameters of the prior distribution of the channel, the number of paths are automatically estimated. Numerical simulations and sea test data are utilized to demonstrate that the proposed approach performs significantly better than the existing grid-based generalized approximate message passing (GAMP) based \emph{j}oint \emph{c}hannel and \emph{d}ata decoding approach (JCD-GAMP). Furthermore, it is also verified that joint processing including CFO estimation provides performance gain.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Potential Advantages of Peak Picking Multi-Voltage Threshold Digitizer in Energy Determination in Radiation Measurement
Authors:
Kezhang Zhu,
Junhua Mei,
Yuming Su,
**** Dai,
Nicola D'Ascenzo,
Hao Wang,
Peng Xiao,
Lin Wan,
Qingguo Xie
Abstract:
The Multi-voltage Threshold (MVT) method, which samples the signal by certain reference voltages, has been well developed as being adopted in pre-clinical and clinical digital positron emission tomography(PET) system. To improve its energy measurement performance, we propose a Peak Picking MVT(PP-MVT) Digitizer in this paper. Firstly, a sampled Peak Point(the highest point in pulse signal), which…
▽ More
The Multi-voltage Threshold (MVT) method, which samples the signal by certain reference voltages, has been well developed as being adopted in pre-clinical and clinical digital positron emission tomography(PET) system. To improve its energy measurement performance, we propose a Peak Picking MVT(PP-MVT) Digitizer in this paper. Firstly, a sampled Peak Point(the highest point in pulse signal), which carries the values of amplitude feature voltage and amplitude arriving time, is added to traditional MVT with a simple peak sampling circuit. Secondly, an amplitude deviation statistical analysis, which compares the energy deviation of various reconstruction models, is used to select adaptive reconstruction models for signal pulses with different amplitudes. After processing 30,000 randomly-chosen pulses sampled by the oscilloscope with a 22Na point source, our method achieves an energy resolution of 17.50% within a 450-650 KeV energy window, which is 2.44% better than the result of traditional MVT with same thresholds; and we get a count number at 15225 in the same energy window while the result of MVT is at 14678. When the PP-MVT involves less thresholds than traditional MVT, the advantages of better energy resolution and larger count number can still be maintained, which shows the robustness and the flexibility of PP-MVT Digitizer. This improved method indicates that adding feature peak information could improve the performance on signal sampling and reconstruction, which canbe proved by the better performance in energy determination in radiation measurement.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Signal Combination for Language Identification
Authors:
Shengye Wang,
Li Wan,
Yang Yu,
Ignacio Lopez Moreno
Abstract:
Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to…
▽ More
Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to combine signals from recognizers with that of a baseline that only uses low-level acoustic signals. Experimental results show that the deep neural network model outperforms the lattice-based ensemble model, and it reduced the error rate from 5.5% in the baseline to 4.3%, which is a 21.8% relative reduction.
△ Less
Submitted 4 November, 2019; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Personal VAD: Speaker-Conditioned Voice Activity Detection
Authors:
Shao** Ding,
Quan Wang,
Shuo-yiin Chang,
Li Wan,
Ignacio Lopez Moreno
Abstract:
In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We…
▽ More
In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.
△ Less
Submitted 8 April, 2020; v1 submitted 12 August, 2019;
originally announced August 2019.
-
Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions
Authors:
Ruize Han,
Yujun Zhang,
Wei Feng,
Chenxing Gong,
Xiaoyu Zhang,
Jiewen Zhao,
Liang Wan,
Song Wang
Abstract:
Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collabo…
▽ More
Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collaborative analysis, the first step is to associate people, referred to as subjects in this paper, across these two views. This is a very challenging problem due to large human-appearance difference between top and horizontal views. In this paper, we present a new approach to address this problem by exploring and matching the subjects' spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects' relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image. We collect a new dataset consisting of top-view and horizontal-view image pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
Tuplemax Loss for Language Identification
Authors:
Li Wan,
Prashant Sridhar,
Yang Yu,
Quan Wang,
Ignacio Lopez Moreno
Abstract:
In many scenarios of a language identification task, the user will specify a small set of languages which he/she can speak instead of a large set of all possible languages. We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named tuplemax loss. As a matter of fact, a typical language ident…
▽ More
In many scenarios of a language identification task, the user will specify a small set of languages which he/she can speak instead of a large set of all possible languages. We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named tuplemax loss. As a matter of fact, a typical language identification system launched in North America has about 95% users who could speak no more than two languages. Using the tuplemax loss, our system achieved a 2.33% error rate, which is a relative 39.4% improvement over the 3.85% error rate of standard softmax loss method.
△ Less
Submitted 17 February, 2019; v1 submitted 29 November, 2018;
originally announced November 2018.
-
DOA and Polarization Estimation for Non-Circular Signals in 3-D Millimeter Wave Polarized Massive MIMO Systems
Authors:
Liangtian Wan,
Kaihui Liu,
Ying-Chang Liang,
Tong Zhu
Abstract:
In this paper, an algorithm of multiple signal classification (MUSIC) is proposed for two-dimensional (2-D) direction of- arrival (DOA) and polarization estimation of non-circular signal in three-dimensional (3-D) millimeter wave polarized largescale/ massive multiple-input-multiple-output (MIMO) systems. The traditional MUSIC-based algorithms can estimate either the DOA and polarization for circu…
▽ More
In this paper, an algorithm of multiple signal classification (MUSIC) is proposed for two-dimensional (2-D) direction of- arrival (DOA) and polarization estimation of non-circular signal in three-dimensional (3-D) millimeter wave polarized largescale/ massive multiple-input-multiple-output (MIMO) systems. The traditional MUSIC-based algorithms can estimate either the DOA and polarization for circular signal or the DOA for non-circular signal by using spectrum search. By contrast, in the proposed algorithm only the DOA estimation needs spectrum search, and the polarization estimation has a closedform expression. First, a novel dimension-reduced MUSIC (DRMUSIC) is proposed for DOA and polarization estimation of circular signal with low computational complexity. Next, based on the quaternion theory, a novel algorithm named quaternion non-circular MUSIC (QNC-MUSIC) is proposed for parameter estimation of non-circular signal with high estimation accuracy. Then based on the DOA estimation result using QNC-MUSIC, the polarization estimation of non-circular signal is acquired by using the closed-form expression of the polarization estimation in DR-MUSIC. In addition, the computational complexity analysis shows that compared with the conventional DOA and polarization estimation algorithms, our proposed QNC-MUSIC and DRMUSIC have much lower computational complexity, especially when the source number is large. The stochastic Cramer-Rao Bound (CRB) for the estimation of the 2-D DOA and polarization parameters of the non-circular signals is derived as well. Finally, numerical examples are provided to demonstrate that the proposed algorithms can improve the parameter estimation performance when the large-scale/massive MIMO systems are employed.
△ Less
Submitted 15 December, 2017;
originally announced December 2017.
-
Attention-Based Models for Text-Dependent Speaker Verification
Authors:
F A Rezaur Rahman Chowdhury,
Quan Wang,
Ignacio Lopez Moreno,
Li Wan
Abstract:
Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependen…
▽ More
Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.
△ Less
Submitted 31 January, 2018; v1 submitted 28 October, 2017;
originally announced October 2017.
-
Speaker Diarization with LSTM
Authors:
Quan Wang,
Carlton Downey,
Li Wan,
Philip Andrew Mansfield,
Ignacio Lopez Moreno
Abstract:
For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vecto…
▽ More
For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.
△ Less
Submitted 23 January, 2022; v1 submitted 28 October, 2017;
originally announced October 2017.
-
Generalized End-to-End Loss for Speaker Verification
Authors:
Li Wan,
Quan Wang,
Alan Papir,
Ignacio Lopez Moreno
Abstract:
In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE…
▽ More
In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, our model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time. We also introduce the MultiReader technique, which allows us to do domain adaptation - training a more accurate model that supports multiple keywords (i.e. "OK Google" and "Hey Google") as well as multiple dialects.
△ Less
Submitted 9 November, 2020; v1 submitted 28 October, 2017;
originally announced October 2017.