Search | arXiv e-print repository

arXiv:2308.14763 [pdf, other]

VoiceBank-2023: A Multi-Speaker Mandarin Speech Corpus for Constructing Personalized TTS Systems for the Speech Impaired

Authors: Jia-Jyu Su, Pang-Chen Liao, Yen-Ting Lin, Wu-Hao Li, Guan-Ting Liou, Cheng-Che Kao, Wei-Cheng Chen, Jen-Chieh Chiang, Wen-Yang Chang, Pin-Han Lin, Chen-Yu Chiang

Abstract: Services of personalized TTS systems for the Mandarin-speaking speech impaired are rarely mentioned. Taiwan started the VoiceBanking project in 2020, aiming to build a complete set of services to deliver personalized Mandarin TTS systems to amyotrophic lateral sclerosis patients. This paper reports the corpus design, corpus recording, data purging and correction for the corpus, and evaluations of… ▽ More Services of personalized TTS systems for the Mandarin-speaking speech impaired are rarely mentioned. Taiwan started the VoiceBanking project in 2020, aiming to build a complete set of services to deliver personalized Mandarin TTS systems to amyotrophic lateral sclerosis patients. This paper reports the corpus design, corpus recording, data purging and correction for the corpus, and evaluations of the developed personalized TTS systems, for the VoiceBanking project. The developed corpus is named after the VoiceBank-2023 speech corpus because of its release year. The corpus contains 29.78 hours of utterances with prompts of short paragraphs and common phrases spoken by 111 native Mandarin speakers. The corpus is labeled with information about gender, degree of speech impairment, types of users, transcription, SNRs, and speaking rates. The VoiceBank-2023 is available by request for non-commercial use and welcomes all parties to join the VoiceBanking project to improve the services for the speech impaired. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: submitted to 26th International Conference of the ORIENTAL-COCOSDA

arXiv:2306.05085 [pdf, other]

TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception

Authors: Yi-Hsin Chen, Ying-Chieh Weng, Chia-Hao Kao, Cheng Chien, Wei-Chen Chiu, Wen-Hsiao Peng

Abstract: This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific pr… ▽ More This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task. △ Less

Submitted 18 August, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Accepted to ICCV 2023

arXiv:2305.10807 [pdf, other]

Transformer-based Variable-rate Image Compression with Region-of-interest Control

Authors: Chia-Hao Kao, Ying-Chieh Weng, Yi-Hsin Chen, Wei-Chen Chiu, Wen-Hsiao Peng

Abstract: This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the region-of-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive token… ▽ More This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the region-of-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive tokens according to the input image, an ROI mask, and a rate parameter. The separation of the ROI mask and the rate parameter allows an intuitive way to achieve variable-rate and ROI coding simultaneously. Extensive experiments validate the effectiveness of our proposed method and confirm its superiority over the other competing methods. △ Less

Submitted 1 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE ICIP 2023

arXiv:2303.10351 [pdf, other]

Weight-sharing Supernet for Searching Specialized Acoustic Event Classification Networks Across Device Constraints

Authors: Guan-Ting Lin, Qingming Tang, Chieh-Chi Kao, Viktor Rozgic, Chao Wang

Abstract: Acoustic Event Classification (AEC) has been widely used in devices such as smart speakers and mobile phones for home safety or accessibility support. As AEC models run on more and more devices with diverse computation resource constraints, it became increasingly expensive to develop models that are tuned to achieve optimal accuracy/computation trade-off for each given computation resource constra… ▽ More Acoustic Event Classification (AEC) has been widely used in devices such as smart speakers and mobile phones for home safety or accessibility support. As AEC models run on more and more devices with diverse computation resource constraints, it became increasingly expensive to develop models that are tuned to achieve optimal accuracy/computation trade-off for each given computation resource constraint. In this paper, we introduce a Once-For-All (OFA) Neural Architecture Search (NAS) framework for AEC. Specifically, we first train a weight-sharing supernet that supports different model architectures, followed by automatically searching for a model given specific computational resource constraints. Our experimental results showed that by just training once, the resulting model from NAS significantly outperforms both models trained individually from scratch and knowledge distillation (25.4% and 7.3% relative improvement). We also found that the benefit of weight-sharing supernet training of ultra-small models comes not only from searching but from optimization. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2301.04775 [pdf, ps, other]

On Phase Change Rate Maximization with Practical Applications

Authors: Chung-Yao Kao, Shinji Hara, Yutaka Hori, Tetsuya Iwasaki, Sei Zhen Khong

Abstract: We recapitulate the notion of phase change rate maximization and demonstrate the usefulness of its solution on analyzing the robust instability of a cyclic network of multi-agent systems subject to a homogenous multiplicative perturbation. Subsequently, we apply the phase change rate maximization result to two practical applications. The first is a magnetic levitation system, while the second is a… ▽ More We recapitulate the notion of phase change rate maximization and demonstrate the usefulness of its solution on analyzing the robust instability of a cyclic network of multi-agent systems subject to a homogenous multiplicative perturbation. Subsequently, we apply the phase change rate maximization result to two practical applications. The first is a magnetic levitation system, while the second is a repressilator with time-delay in synthetic biology. △ Less

Submitted 4 April, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

arXiv:2209.13210 [pdf, other]

Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265

Authors: Yung-Han Ho, Chia-Hao Kao, Wen-Hsiao Peng, **-Chun Hsieh

Abstract: This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently,… ▽ More This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: Accepted by VCIP 2022. arXiv admin note: text overlap with arXiv:2203.05127

arXiv:2205.09658 [pdf, other]

Image-Based Conditioning for Action Policy Smoothness in Autonomous Miniature Car Racing with Reinforcement Learning

Authors: Bo-Jiun Hsu, Hoang-Giang Cao, I Lee, Chih-Yu Kao, **-Bo Huang, I-Chen Wu

Abstract: In recent years, deep reinforcement learning has achieved significant results in low-level controlling tasks. However, the problem of control smoothness has less attention. In autonomous driving, unstable control is inevitable since the vehicle might suddenly change its actions. This problem will lower the controlling system's efficiency, induces excessive mechanical wear, and causes uncontrollabl… ▽ More In recent years, deep reinforcement learning has achieved significant results in low-level controlling tasks. However, the problem of control smoothness has less attention. In autonomous driving, unstable control is inevitable since the vehicle might suddenly change its actions. This problem will lower the controlling system's efficiency, induces excessive mechanical wear, and causes uncontrollable, dangerous behavior to the vehicle. In this paper, we apply the Conditioning for Action Policy Smoothness (CAPS) with image-based input to smooth the control of an autonomous miniature car racing. Applying CAPS and sim-to-real transfer methods helps to stabilize the control at a higher speed. Especially, the agent with CAPS and CycleGAN reduces 21.80% of the average finishing lap time. Moreover, we also conduct extensive experiments to analyze the impact of CAPS components. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2203.11997 [pdf, other]

Federated Self-Supervised Learning for Acoustic Event Classification

Authors: Meng Feng, Chieh-Chi Kao, Qingming Tang, Ming Sun, Viktor Rozgic, Spyros Matsoukas, Chao Wang

Abstract: Standard acoustic event classification (AEC) solutions require large-scale collection of data from client devices for model optimization. Federated learning (FL) is a compelling framework that decouples data collection and model training to enhance customer privacy. In this work, we investigate the feasibility of applying FL to improve AEC performance while no customer data can be directly uploade… ▽ More Standard acoustic event classification (AEC) solutions require large-scale collection of data from client devices for model optimization. Federated learning (FL) is a compelling framework that decouples data collection and model training to enhance customer privacy. In this work, we investigate the feasibility of applying FL to improve AEC performance while no customer data can be directly uploaded to the server. We assume no pseudo labels can be inferred from on-device user inputs, aligning with the typical use cases of AEC. We adapt self-supervised learning to the FL framework for on-device continual learning of representations, and it results in improved performance of the downstream AEC classifiers without labeled/pseudo-labeled data available. Compared to the baseline w/o FL, the proposed method improves precision up to 20.3\% relatively while maintaining the recall. Our work differs from prior work in FL in that our approach does not require user-generated learning targets, and the data we use is collected from our Beta program and is de-identified, to maximally simulate the production settings. △ Less

Submitted 22 March, 2022; originally announced March 2022.

arXiv:2203.05127 [pdf, other]

Action-Constrained Reinforcement Learning for Frame-Level Bit Allocation in HEVC/H.265 through Frank-Wolfe Policy Optimization

Authors: Yung-Han Ho, Yun Liang, Chia-Hao Kao, Wen-Hsiao Peng

Abstract: This paper presents a reinforcement learning (RL) framework that leverages Frank-Wolfe policy optimization to address frame-level bit allocation for HEVC/H.265. Most previous RL-based approaches adopt the single-critic design, which weights the rewards for distortion minimization and rate regularization by an empirically chosen hyper-parameter. More recently, the dual-critic design is proposed to… ▽ More This paper presents a reinforcement learning (RL) framework that leverages Frank-Wolfe policy optimization to address frame-level bit allocation for HEVC/H.265. Most previous RL-based approaches adopt the single-critic design, which weights the rewards for distortion minimization and rate regularization by an empirically chosen hyper-parameter. More recently, the dual-critic design is proposed to update the actor network by alternating the rate and distortion critics. However, the convergence of training is not guaranteed. To address this issue, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the frame-level bit allocation as an action-constrained RL problem. In this new framework, the rate critic serves to specify a feasible action set, and the distortion critic updates the actor network towards maximizing the reconstruction quality while conforming to the action constraint. Experimental results show that when trained to optimize the video multi-method assessment fusion (VMAF) metric, our NFWPO-based model outperforms both the single-critic and the dual-critic methods. It also demonstrates comparable rate-distortion performance to the 2-pass average bit rate control of x265. △ Less

Submitted 9 March, 2022; originally announced March 2022.

arXiv:2202.09500 [pdf, ps, other]

Exact Instability Margin Analysis and Minimum-Norm Strong Stabilization -- phase change rate maximization --

Authors: Shinji Hara, Chung-Yao Kao, Sei Zhen Khong, Tetsuya Iwasaki, Yutaka Hori

Abstract: This paper is concerned with a new optimization problem named "phase change rate maximization" for single-input-single-output linear time-invariant systems. The problem relates to two control problems, namely robust instability analysis against stable perturbations and minimum-norm strong stabilization. We define an index of the instability margin called "robust instability radius (RIR)" as the sm… ▽ More This paper is concerned with a new optimization problem named "phase change rate maximization" for single-input-single-output linear time-invariant systems. The problem relates to two control problems, namely robust instability analysis against stable perturbations and minimum-norm strong stabilization. We define an index of the instability margin called "robust instability radius (RIR)" as the smallest $H_\infty$-norm of a stable perturbation that stabilizes a given unstable system. This paper has two main contributions. It is first shown that the problem of finding the exact RIR via the small-gain condition can be transformed into the problem of maximizing the phase change rate at the peak frequency with a phase constraint. Then, we show that the maximum is attained by a constant or a first-order all-pass function and derive conditions, under which the RIR can be exactly characterized, in terms of the phase change rate. Two practical applications are provided to illustrate the utility of our results. △ Less

Submitted 8 October, 2023; v1 submitted 18 February, 2022; originally announced February 2022.

arXiv:2202.00673 [pdf, other]

doi 10.21437/SPSC.2021-4

Visualizing Automatic Speech Recognition -- Means for a Better Understanding?

Authors: Karla Markert, Romain Parracone, Mykhailo Kulakov, Philip Sperl, Ching-Yu Kao, Konstantin Böttinger

Abstract: Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarif… ▽ More Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication

arXiv:2102.03229 [pdf, other]

Multi-Task Self-Supervised Pre-Training for Music Classification

Authors: Ho-Hsiang Wu, Chieh-Chi Kao, Qingming Tang, Ming Sun, Brian McFee, Juan Pablo Bello, Chao Wang

Abstract: Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset.… ▽ More Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks. △ Less

Submitted 5 February, 2021; originally announced February 2021.

Comments: Copyright 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2101.00796 [pdf, other]

A Reduced Codebook and Re-Interpolation Approach for Enhancing Quality in Chroma Subsampling

Authors: Kuo-Liang Chung, Chen-Wei Kao

Abstract: Prior to encoding RGB full-color images or Bayer color filter array (CFA) images, chroma subsampling is a necessary and crucial step at the server side. In this paper, we first propose a flow diagram approach to analyze the coordinate-inconsistency (CI) problem and the upsampling process-inconsistency (UPI) problem existing in the traditional and state-of-the-art chroma subsampling methods under t… ▽ More Prior to encoding RGB full-color images or Bayer color filter array (CFA) images, chroma subsampling is a necessary and crucial step at the server side. In this paper, we first propose a flow diagram approach to analyze the coordinate-inconsistency (CI) problem and the upsampling process-inconsistency (UPI) problem existing in the traditional and state-of-the-art chroma subsampling methods under the current coding environment. In addition, we explain why the two problems degrade the quality of the reconstructed images. Next, we propose a reduced codebook and re-interpolation (RCRI) approach to solve the two problems for enhancing the quality of the reconstructed images. Based on the testing RGB full-color images and Bayer CFA images, the comprehensive experimental results demonstrated at least 1.4 dB and 2.4 dB quality improvement effects, respectively, of our RCRI approach against the CI and UPI problems for the traditional and state-of-the-art chroma subsampling methods. △ Less

Submitted 4 January, 2021; originally announced January 2021.

arXiv:2010.06676 [pdf, other]

On Front-end Gain Invariant Modeling for Wake Word Spotting

Authors: Yixin Gao, Noah D. Stein, Chieh-Chi Kao, Yunliang Cai, Ming Sun, Tao Zhang, Shiv Vitaladevuni

Abstract: Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is trained wi… ▽ More Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is trained with the AFE-processed audio data, its performance is sensitive to AFE variations, such as gain changes. In addition, when deploying to new devices, the WW performance is not guaranteed because the AFE is unknown to the WW model. To address these issues, we propose a novel approach to use a new feature called $Δ$LFBE to decouple the AFE gain variations from the WW model. We modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged. We evaluate our WW models using data collected from real household settings and showed the models with the $Δ$LFBE is robust to AFE gain changes. Specifically, when AFE gain changes up to $\pm$12dB, the baseline CNN model lost up to relative 19.0% in false alarm rate or 34.3% in false reject rate, while the model with $Δ$LFBE demonstrates no performance loss. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: In Proc. Interspeech 2020

arXiv:2009.01759 [pdf, other]

Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging

Authors: Chun-Chieh Chang, Chieh-Chi Kao, Ming Sun, Chao Wang

Abstract: Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetitive nature of acoustic events, we propose to leverage this information to regulate the KD training for Audio Tagging. This novel KD method, "Intra-Utt… ▽ More Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetitive nature of acoustic events, we propose to leverage this information to regulate the KD training for Audio Tagging. This novel KD method, "Intra-Utterance Similarity Preserving KD" (IUSP), shows promising results for the audio tagging task. It is motivated by the previously published KD method: "Similarity Preserving KD" (SP). However, instead of preserving the pairwise similarities between inputs within a mini-batch, our method preserves the pairwise similarities between the frames of a single input utterance. Our proposed KD method, IUSP, shows consistent improvements over SP across student models of different sizes on the DCASE 2019 Task 5 dataset for audio tagging. There is a 27.1% to 122.4% percent increase in improvement of micro AUPRC over the baseline relative to SP's improvement of over the baseline. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: Accepted to Interspeech 2020

arXiv:2008.03350 [pdf, other]

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

Authors: Chieh-Chi Kao, Bowen Shi, Ming Sun, Chao Wang

Abstract: This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-kn… ▽ More This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: Accepted by Interspeech 2020

arXiv:2002.09143 [pdf, other]

Few-shot acoustic event detection via meta-learning

Authors: Bowen Shi, Ming Sun, Krishna C. Puvvada, Chieh-Chi Kao, Spyros Matsoukas, Chao Wang

Abstract: We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a… ▽ More We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a variety of meta-learning approaches, which are conventionally used to solve few-shot classification problem. Compared to supervised baselines, meta-learning models achieve superior performance, thus showing its effectiveness on generalization to new audio events. Our analysis including impact of initialization and domain discrepancy further validate the advantage of meta-learning approaches in few-shot AED. △ Less

Submitted 21 February, 2020; originally announced February 2020.

Comments: ICASSP 2020

arXiv:2002.06279 [pdf, other]

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Authors: Chieh-Chi Kao, Ming Sun, Weiran Wang, Chao Wang

Abstract: Acoustic event classification (AEC) and acoustic event detection (AED) refer to the task of detecting whether specific target events occur in audios. As long short-term memory (LSTM) leads to state-of-the-art results in various speech related tasks, it is employed as a popular solution for AEC as well. This paper focuses on investigating the dynamics of LSTM model on AEC tasks. It includes a detai… ▽ More Acoustic event classification (AEC) and acoustic event detection (AED) refer to the task of detecting whether specific target events occur in audios. As long short-term memory (LSTM) leads to state-of-the-art results in various speech related tasks, it is employed as a popular solution for AEC as well. This paper focuses on investigating the dynamics of LSTM model on AEC tasks. It includes a detailed analysis on LSTM memory retaining, and a benchmarking of nine different pooling methods on LSTM models using 1.7M generated mixture clips of multiple events with different signal-to-noise ratios. This paper focuses on understanding: 1) utterance-level classification accuracy; 2) sensitivity to event position within an utterance. The analysis is done on the dataset for the detection of rare sound events from DCASE 2017 Challenge. We find max pooling on the prediction level to perform the best among the nine pooling approaches in terms of classification accuracy and insensitivity to event position within an utterance. To authors' best knowledge, this is the first kind of such work focused on LSTM dynamics for AEC tasks. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: Accepted to ICASSP 2020

arXiv:2002.04500 [pdf]

doi 10.1038/s41379-020-0640-y

Artificial Intelligence Assistance Significantly Improves Gleason Grading of Prostate Biopsies by Pathologists

Authors: Wouter Bulten, Maschenka Balkenhol, Jean-Joël Awoumou Belinga, Américo Brilhante, Aslı Çakır, Xavier Farré, Katerina Geronatsiou, Vincent Molinié, Guilherme Pereira, Paromita Roy, Günter Saile, Paulo Salles, Ewout Schaafsma, Joëlle Tschui, Anne-Marie Vos, Hester van Boven, Robert Vink, Jeroen van der Laak, Christina Hulsbergen-van de Kaa, Geert Litjens

Abstract: While the Gleason score is the most important prognostic marker for prostate cancer patients, it suffers from significant observer variability. Artificial Intelligence (AI) systems, based on deep learning, have proven to achieve pathologist-level performance at Gleason grading. However, the performance of such systems can degrade in the presence of artifacts, foreign tissue, or other anomalies. Pa… ▽ More While the Gleason score is the most important prognostic marker for prostate cancer patients, it suffers from significant observer variability. Artificial Intelligence (AI) systems, based on deep learning, have proven to achieve pathologist-level performance at Gleason grading. However, the performance of such systems can degrade in the presence of artifacts, foreign tissue, or other anomalies. Pathologists integrating their expertise with feedback from an AI system could result in a synergy that outperforms both the individual pathologist and the system. Despite the hype around AI assistance, existing literature on this topic within the pathology domain is limited. We investigated the value of AI assistance for grading prostate biopsies. A panel of fourteen observers graded 160 biopsies with and without AI assistance. Using AI, the agreement of the panel with an expert reference standard significantly increased (quadratically weighted Cohen's kappa, 0.799 vs 0.872; p=0.018). Our results show the added value of AI systems for Gleason grading, but more importantly, show the benefits of pathologist-AI synergy. △ Less

Submitted 11 February, 2020; originally announced February 2020.

Comments: 21 pages, 5 figures

Journal ref: Modern Pathology, Available online 5 August 2020

arXiv:1912.10323 [pdf, ps, other]

Integral quadratic constraints for asynchronous sample-and-hold links

Authors: Michael Cantoni, Chung-Yao Kao, Mark A. Fabbro

Abstract: A model is proposed for a class of asynchronous sample-and-hold operators that is relevant in the analysis of embedded and networked systems. The model is parametrized by characteristics of the corresponding time-varying input-output delay. Uncertainty in the relationship between the timing of zero-order-hold update events at the output and the possibly aperiodic sampling events at the input means… ▽ More A model is proposed for a class of asynchronous sample-and-hold operators that is relevant in the analysis of embedded and networked systems. The model is parametrized by characteristics of the corresponding time-varying input-output delay. Uncertainty in the relationship between the timing of zero-order-hold update events at the output and the possibly aperiodic sampling events at the input means that the delay does not always reset to a fixed value. This is distinct from the well-studied synchronous case in which the delay intermittently resets to zero at output update times. The main result provides a family of integral quadratic constraints that covers the proposed model. To demonstrate an application of this result, robust $\mathbf{L}_2$ stability and performance certificates are devised for an asynchronous sampled-data implementation of a feedback loop around given linear time-invariant continuous-time open-loop dynamics. Numerical examples are also presented. △ Less

Submitted 1 January, 2020; v1 submitted 21 December, 2019; originally announced December 2019.

arXiv:1907.07980 [pdf, other]

doi 10.1016/S1470-2045(19)30739-9

Automated Gleason Grading of Prostate Biopsies using Deep Learning

Authors: Wouter Bulten, Hans Pinckaers, Hester van Boven, Robert Vink, Thomas de Bel, Bram van Ginneken, Jeroen van der Laak, Christina Hulsbergen-van de Kaa, Geert Litjens

Abstract: The Gleason score is the most important prognostic marker for prostate cancer patients but suffers from significant inter-observer variability. We developed a fully automated deep learning system to grade prostate biopsies. The system was developed using 5834 biopsies from 1243 patients. A semi-automatic labeling technique was used to circumvent the need for full manual annotation by pathologists.… ▽ More The Gleason score is the most important prognostic marker for prostate cancer patients but suffers from significant inter-observer variability. We developed a fully automated deep learning system to grade prostate biopsies. The system was developed using 5834 biopsies from 1243 patients. A semi-automatic labeling technique was used to circumvent the need for full manual annotation by pathologists. The developed system achieved a high agreement with the reference standard. In a separate observer experiment, the deep learning system outperformed 10 out of 15 pathologists. The system has the potential to improve prostate cancer prognostics by acting as a first or second reader. △ Less

Submitted 18 July, 2019; originally announced July 2019.

Comments: 13 pages, 6 figures

Journal ref: The Lancet Oncology, Available online 8 January 2020

arXiv:1907.01448 [pdf, other]

Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification

Authors: Chieh-Chi Kao, Ming Sun, Yixin Gao, Shiv Vitaladevuni, Chao Wang

Abstract: This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does… ▽ More This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for small-footprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained. △ Less

Submitted 2 July, 2019; originally announced July 2019.

Comments: Accepted by Interspeech 2019

arXiv:1907.00873 [pdf, ps, other]

Compression of Acoustic Event Detection Models With Quantized Distillation

Authors: Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

Abstract: Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on lar… ▽ More Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on large deep models, which are computational demanding and challenging to deploy on devices with constrained computational resources. In this paper, we present a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model). Experimental results show proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent (2% of teacher, 12% of full-precision student) through quantization. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: Interspeech 2019

arXiv:1905.00855 [pdf, ps, other]

Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization Training

Authors: Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

Abstract: In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models. Our experimental results show this combined compression approach is very effective. For a three-layer long short-term memory (LSTM) based AED model, the original model size can be r… ▽ More In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models. Our experimental results show this combined compression approach is very effective. For a three-layer long short-term memory (LSTM) based AED model, the original model size can be reduced to 1% with negligible loss of accuracy. Our approach enables the feasibility of deploying AED for resource-constraint applications. △ Less

Submitted 2 May, 2019; originally announced May 2019.

Comments: NeuralPS 2018 CDNNRIA workshop

arXiv:1904.12926 [pdf, other]

Semi-supervised Acoustic Event Detection based on tri-training

Authors: Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

Abstract: This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset. Recent acoustic event detectors are based on large-scale neural networks, which are typically trained with huge amounts of labeled data. Labels for acoustic events are expensive to obtain, and relevant acoustic event audios can be limited, especially for rare events. In this paper we leverage an… ▽ More This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset. Recent acoustic event detectors are based on large-scale neural networks, which are typically trained with huge amounts of labeled data. Labels for acoustic events are expensive to obtain, and relevant acoustic event audios can be limited, especially for rare events. In this paper we leverage an Internet-scale unlabeled dataset with potential domain shift to improve the detection of acoustic events. Based on the classic tri-training approach, our proposed method shows accuracy improvement over both the supervised training baseline, and semisupervised self-training set-up, in all pre-defined acoustic event detection tasks. As our approach relies on ensemble models, we further show the improvements can be distilled to a single model via knowledge distillation, with the resulting single student model maintaining high accuracy of teacher ensemble models. △ Less

Submitted 29 April, 2019; originally announced April 2019.

Comments: 5 pages

arXiv:1808.06676 [pdf, other]

A simple model for detection of rare sound events

Authors: Weiran Wang, Chieh-chi Kao, Chao Wang

Abstract: We propose a simple recurrent model for detecting rare sound events, when the time boundaries of events are available for training. Our model optimizes the combination of an utterance-level loss, which classifies whether an event occurs in an utterance, and a frame-level loss, which classifies whether each frame corresponds to the event when it does occur. The two losses make use of a shared vecto… ▽ More We propose a simple recurrent model for detecting rare sound events, when the time boundaries of events are available for training. Our model optimizes the combination of an utterance-level loss, which classifies whether an event occurs in an utterance, and a frame-level loss, which classifies whether each frame corresponds to the event when it does occur. The two losses make use of a shared vectorial representation the event, and are connected by an attention mechanism. We demonstrate our model on Task 2 of the DCASE 2017 challenge, and achieve competitive performance. △ Less

Submitted 20 August, 2018; originally announced August 2018.

Comments: Accepted by Interspeech 2018

arXiv:1808.06627 [pdf, other]

R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection

Authors: Chieh-Chi Kao, Weiran Wang, Ming Sun, Chao Wang

Abstract: This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN, a well known region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context fr… ▽ More This paper proposes a Region-based Convolutional Recurrent Neural Network (R-CRNN) for audio event detection (AED). The proposed network is inspired by Faster-RCNN, a well known region-based convolutional network framework for visual object detection. Different from the original Faster-RCNN, a recurrent layer is added on top of the convolutional network to capture the long-term temporal context from the extracted high level features. While most of the previous works on AED generate predictions at frame level first, and then use post-processing to predict the onset/offset timestamps of events from a probability sequence; the proposed method generates predictions at event level directly and can be trained end-to-end with a multitask loss, which optimizes the classification and localization of audio events simultaneously. The proposed method is tested on DCASE 2017 Challenge dataset. To the best of our knowledge, R-CRNN is the best performing single-model method among all methods without using ensembles both on development and evaluation sets. Compared to the other region-based network for AED (R-FCN) with an event-based error rate (ER) of 0.18 on the development set, our method reduced the ER to half. △ Less

Submitted 20 August, 2018; originally announced August 2018.

Comments: Accepted by Interspeech 2018

Showing 1–27 of 27 results for author: Kao, C