Search | arXiv e-print repository

Understanding What Affects Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

Authors: Jiafei Lyu, Le Wan, Xiu Li, Zongqing Lu

Abstract: Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best… ▽ More Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB). △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: Part of this work is accepted as AAMAS 2024 extended abstract

arXiv:2310.04367 [pdf]

A Marketplace Price Anomaly Detection System at Scale

Authors: Akshit Sarpal, Qiwen Kang, Fang** Huang, Yang Song, Lijie Wan

Abstract: Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor cu… ▽ More Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor customer experience and potential for revenue loss. We present MoatPlus (Masked Optimal Anchors using Trees, Proximity-based Labeling and Unsupervised Statistical-features), a scalable price anomaly detection framework for a growing marketplace platform. The goal is to leverage proximity and historical price trends from unsupervised statistical features to generate an upper price bound. We build an ensemble of models to detect irregularities in price-based features, exclude irregular features and use optimized weighting scheme to build a reliable price bound in real-time pricing pipeline. We observed that our approach improves precise anchor coverage by up to 46.6% in high-vulnerability item subsets △ Less

Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 10 pages, 4 figures, 7 tables

arXiv:2306.08956 [pdf, other]

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Authors: Liang Wan, Hongqing Liu, Yi Zhou, Jie Ji

Abstract: The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to effectively exploit time-frequency domain information. By combining the DPRNN module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising performance in speech separation with a limited model size. In this paper, we explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Ne… ▽ More The Dual-Path Convolution Recurrent Network (DPCRN) was proposed to effectively exploit time-frequency domain information. By combining the DPRNN module with Convolution Recurrent Network (CRN), the DPCRN obtained a promising performance in speech separation with a limited model size. In this paper, we explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement. We use self-attention modules to exploit the long-time information, where the intra-chunk self-attentions are used to model the spectrum pattern and the inter-chunk self-attention are used to model the dependence between consecutive frames. Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation, which is more suitable for long sequences of speech signals. In addition, we propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network. Experiments show that with only 0.23M parameters, the proposed model achieves a better performance than DPCRN. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:1910.09687 [pdf, other]

Signal Combination for Language Identification

Authors: Shengye Wang, Li Wan, Yang Yu, Ignacio Lopez Moreno

Abstract: Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to… ▽ More Google's multilingual speech recognition system combines low-level acoustic signals with language-specific recognizer signals to better predict the language of an utterance. This paper presents our experience with different signal combination methods to improve overall language identification accuracy. We compare the performance of a lattice-based ensemble model and a deep neural network model to combine signals from recognizers with that of a baseline that only uses low-level acoustic signals. Experimental results show that the deep neural network model outperforms the lattice-based ensemble model, and it reduced the error rate from 5.5% in the baseline to 4.3%, which is a 21.8% relative reduction. △ Less

Submitted 4 November, 2019; v1 submitted 21 October, 2019; originally announced October 2019.

arXiv:1909.11532 [pdf, other]

Deep Neural Network Framework Based on Backward Stochastic Differential Equations for Pricing and Hedging American Options in High Dimensions

Authors: Yangang Chen, Justin W. L. Wan

Abstract: We propose a deep neural network framework for computing prices and deltas of American options in high dimensions. The architecture of the framework is a sequence of neural networks, where each network learns the difference of the price functions between adjacent timesteps. We introduce the least squares residual of the associated backward stochastic differential equation as the loss function. Our… ▽ More We propose a deep neural network framework for computing prices and deltas of American options in high dimensions. The architecture of the framework is a sequence of neural networks, where each network learns the difference of the price functions between adjacent timesteps. We introduce the least squares residual of the associated backward stochastic differential equation as the loss function. Our proposed framework yields prices and deltas on the entire spacetime, not only at a given point. The computational cost of the proposed approach is quadratic in dimension, which addresses the curse of dimensionality issue that state-of-the-art approaches suffer. Our numerical simulations demonstrate these contributions, and show that the proposed neural network framework outperforms state-of-the-art approaches in high dimensions. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: 35 pages, 11 figures, 15 tables

arXiv:1908.04284 [pdf, other]

Personal VAD: Speaker-Conditioned Voice Activity Detection

Authors: Shao** Ding, Quan Wang, Shuo-yiin Chang, Li Wan, Ignacio Lopez Moreno

Abstract: In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We… ▽ More In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task. △ Less

Submitted 8 April, 2020; v1 submitted 12 August, 2019; originally announced August 2019.

Comments: Speaker Odyssey 2020

arXiv:1811.12290 [pdf, other]

Tuplemax Loss for Language Identification

Authors: Li Wan, Prashant Sridhar, Yang Yu, Quan Wang, Ignacio Lopez Moreno

Abstract: In many scenarios of a language identification task, the user will specify a small set of languages which he/she can speak instead of a large set of all possible languages. We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named tuplemax loss. As a matter of fact, a typical language ident… ▽ More In many scenarios of a language identification task, the user will specify a small set of languages which he/she can speak instead of a large set of all possible languages. We want to model such prior knowledge into the way we train our neural networks, by replacing the commonly used softmax loss function with a novel loss function named tuplemax loss. As a matter of fact, a typical language identification system launched in North America has about 95% users who could speak no more than two languages. Using the tuplemax loss, our system achieved a 2.33% error rate, which is a relative 39.4% improvement over the 3.85% error rate of standard softmax loss method. △ Less

Submitted 17 February, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

Comments: Submitted to ICASSP 2019

arXiv:1801.10123 [pdf, ps, other]

Links: A High-Dimensional Online Clustering Method

Authors: Philip Andrew Mansfield, Quan Wang, Carlton Downey, Li Wan, Ignacio Lopez Moreno

Abstract: We present a novel algorithm, called Links, designed to perform online clustering on unit vectors in a high-dimensional Euclidean space. The algorithm is appropriate when it is necessary to cluster data efficiently as it streams in, and is to be contrasted with traditional batch clustering algorithms that have access to all data at once. For example, Links has been successfully applied to embeddin… ▽ More We present a novel algorithm, called Links, designed to perform online clustering on unit vectors in a high-dimensional Euclidean space. The algorithm is appropriate when it is necessary to cluster data efficiently as it streams in, and is to be contrasted with traditional batch clustering algorithms that have access to all data at once. For example, Links has been successfully applied to embedding vectors generated from face images or voice recordings for the purpose of recognizing people, thereby providing real-time identification during video or audio capture. △ Less

Submitted 30 January, 2018; originally announced January 2018.

arXiv:1710.10470 [pdf, other]

Attention-Based Models for Text-Dependent Speaker Verification

Authors: F A Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno, Li Wan

Abstract: Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependen… ▽ More Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model. △ Less

Submitted 31 January, 2018; v1 submitted 28 October, 2017; originally announced October 2017.

Comments: Submitted to ICASSP 2018

arXiv:1710.10468 [pdf, other]

Speaker Diarization with LSTM

Authors: Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, Ignacio Lopez Moreno

Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vecto… ▽ More For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs. △ Less

Submitted 23 January, 2022; v1 submitted 28 October, 2017; originally announced October 2017.

Comments: Published at ICASSP 2018

arXiv:1710.10467 [pdf, other]

Generalized End-to-End Loss for Speaker Verification

Authors: Li Wan, Quan Wang, Alan Papir, Ignacio Lopez Moreno

Abstract: In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE… ▽ More In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, our model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time. We also introduce the MultiReader technique, which allows us to do domain adaptation - training a more accurate model that supports multiple keywords (i.e. "OK Google" and "Hey Google") as well as multiple dialects. △ Less

Submitted 9 November, 2020; v1 submitted 28 October, 2017; originally announced October 2017.

Comments: Published at ICASSP 2018

Showing 1–11 of 11 results for author: Wan, L