Search | arXiv e-print repository

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Authors: Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik

Abstract: Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encod… ▽ More Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i.e., large mismatch) and increased complexity. In this work, we propose a novel architecture to efficiently detect arbitrary keywords based on an audio-compliant text encoder which inherently has homogeneous representation with audio embedding, and it is also much smaller than a compatible text encoder. Our text encoder converts the text to phonemes using a grapheme-to-phoneme (G2P) model, and then to an embedding using representative phoneme vectors, extracted from the paired audio encoder on rich speech datasets. We further augment our method with confusable keyword generation to develop an audio-text embedding verifier with strong discriminative power. Experimental results show that our scheme outperforms the state-of-the-art results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC) metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from 23.36% to 14.4%. △ Less

Submitted 12 August, 2023; originally announced August 2023.

arXiv:2306.05245 [pdf, other]

Matching Latent Encoding for Audio-Text based Keyword Spotting

Authors: Kumari Nishu, Minsik Cho, Devang Naik

Abstract: Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text e… ▽ More Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2303.08253 [pdf, other]

R2 Loss: Range Restriction Loss for Model Compression and Quantization

Authors: Arnav Kundu, Chungkuk Yoo, Srijan Mishra, Minsik Cho, Saurabh Adya

Abstract: Model quantization and compression is widely used techniques to reduce usage of computing resource at inference time. While state-of-the-art works have been achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To overcome the challenge, we focus on outliers in weights of a pre-trained model which di… ▽ More Model quantization and compression is widely used techniques to reduce usage of computing resource at inference time. While state-of-the-art works have been achieved reasonable accuracy with higher bit such as 4bit or 8bit, but still it is challenging to quantize/compress a model further, e.g., 1bit or 2bit. To overcome the challenge, we focus on outliers in weights of a pre-trained model which disrupt effective lower bit quantization and compression. In this work, we propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training. By effectively restricting range of weights, we mold the overall distribution into a tight shape to ensure high quantization bit resolution, therefore allowing model compression and quantization techniques can to utilize their limited numeric representation powers better. We introduce three different, L-inf R2-Loss, its extension Margin R2-Loss and a new Soft-Min-MaxR2-Loss to be used as an auxiliary loss during full-precision model training. These R2-Loss can be used in different cases such as L-inf and Margin R2-Loss would be effective for symmetric quantization, while Soft-Min-Max R2-Loss shows better performance for model compression. In our experiment, R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques. With R2-Loss, MobileNet-V2 2bit weight and 8bit activation PTQ, MobileNet-V1 2bit weight and activation QAT, ResNet18 1bit weight compression are improved to 59.49% from 50.66%, 59.05% from 55.96%, and 52.58% from 45.54%, respectively. △ Less

Submitted 11 February, 2024; v1 submitted 14 March, 2023; originally announced March 2023.

arXiv:2301.13729 [pdf, other]

Low-rank LQR Optimal Control Design over Wireless Communication Networks

Authors: Myung Cho, Abdallah Abdallah, Mohammad Rasouli

Abstract: This paper considers a LQR optimal control design problem for distributed control systems with multi-agents. To control large-scale distributed systems such as smart-grid and multi-agent robotic systems over wireless communication networks, it is desired to design a feedback controller by considering various constraints on communication such as limited power, limited energy, or limited communicati… ▽ More This paper considers a LQR optimal control design problem for distributed control systems with multi-agents. To control large-scale distributed systems such as smart-grid and multi-agent robotic systems over wireless communication networks, it is desired to design a feedback controller by considering various constraints on communication such as limited power, limited energy, or limited communication bandwidth, etc. In this paper, we focus on the reduction of communication energy in an LQR optimal control design problem on wireless communication networks. By considering the characteristic of wireless communication, i.e., Radio Frequency (RF) signal can spread in all directions in a broadcast way, we formulate a low-rank LQR optimal control model to reduce the communication energy in the distributed feedback control system. To solve the problem, we propose an Alternating Direction Method of Multipliers (ADMM) based algorithm. Through various numerical experiments, we demonstrate that a feedback controller designed using low-rank structure can outperform the previous work on sparse LQR optimal control design, which focuses on reducing the number of communication links in a network, in terms of energy consumption, system stability margin against noise and error in communication. △ Less

Submitted 31 January, 2023; originally announced January 2023.

Comments: 10 pages

arXiv:2212.04396 [pdf, other]

On Attack Detection and Identification for the Cyber-Physical System using Lifted System Model

Authors: Dawei Sun, Minhyun Cho, Inseok Hwang

Abstract: Motivated by the safety and security issues related to cyber-physical systems with potentially multi-rate, delayed, and nonuniformly sampled measurements, we investigate the attack detection and identification using the lifted system model in this paper. Attack detectability and identifiability based on the lifted system model are formally defined and rigorously characterized in a novel approach.… ▽ More Motivated by the safety and security issues related to cyber-physical systems with potentially multi-rate, delayed, and nonuniformly sampled measurements, we investigate the attack detection and identification using the lifted system model in this paper. Attack detectability and identifiability based on the lifted system model are formally defined and rigorously characterized in a novel approach. The method of checking detectability is discussed, and a residual design problem for attack detection is formulated in a general way. For attack identification, we define and characterize it by generalizing the concept of mode discernibility for switched systems, and a method for identifying the attack is discussed based on the theoretical analysis. An illustrative example of an unmanned aircraft system (UAS) is provided to validate the main results. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: It is the preprint of a paper submitted to Automatica

arXiv:2212.02929 [pdf, ps, other]

Deep Neural Networks Based on Iterative Thresholding and Projection Algorithms for Sparse LQR Control Design

Authors: Myung Cho

Abstract: In this paper, we consider an LQR design problem for distributed control systems. For large-scale distributed systems, finding a solution might be computationally demanding due to communications among agents. To this aim, we deal with LQR minimization problem with a regularization for sparse feedback matrix, which can lead to achieve the reduction of the communication links in the distributed cont… ▽ More In this paper, we consider an LQR design problem for distributed control systems. For large-scale distributed systems, finding a solution might be computationally demanding due to communications among agents. To this aim, we deal with LQR minimization problem with a regularization for sparse feedback matrix, which can lead to achieve the reduction of the communication links in the distributed control systems. For this work, we introduce a simple but efficient iterative algorithms - Iterative Shrinkage Thresholding Algorithm (ISTA) and Iterative Sparse Projection Algorithm (ISPA). They can give us a trade-off solution between LQR cost and sparsity level on feedback matrix. Moreover, in order to improve the speed of the proposed algorithms, we design deep neural network models based on the proposed iterative algorithms. Numerical experiments demonstrate that our algorithms can outperform the previous methods using the Alternating Direction Method of Multiplier (ADMM) [1] and the Gradient Support Pursuit (GraSP) [2], and their deep neural network models can improve the performance of the proposed algorithms in convergence speed. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: 14 pages

arXiv:2211.03885 [pdf, other]

Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Shuai Liu, Chaoyu Feng, Furui Bai, Xiaotao Wang, Lei Lei, Ziyao Yi, Yan Xiang, Zibin Liu, Shaoqing Li, Keming Shi, Dehui Kong, Ke Xu, Minsu Kwon, Yaqi Wu, Jiesi Zheng, Zhihao Fan, Xun Wu, Feng Zhang, Albert No, Minhyeok Cho, Zewen Chen, Xiaze Zhang, Ran Li , et al. (13 additional authors not shown)

Abstract: The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. Th… ▽ More The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2210.15425 [pdf, other]

HEiMDaL: Highly Efficient Method for Detection and Localization of wake-words

Authors: Arnav Kundu, Mohammad Samragh Razlighi, Minsik Cho, Priyanka Padmanabhan, Devang Naik

Abstract: Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when… ▽ More Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when the DNN and HMM are trained independently. Sequence discriminative training cannot fully mitigate the loss-metric mismatch due to the inherent Markovian style of the operation. We propose an low footprint CNN model, called HEiMDaL, to detect and localize keywords in streaming conditions. We introduce an alignment-based classification loss to detect the occurrence of the keyword along with an offset loss to predict the start of the keyword. HEiMDaL shows 73% reduction in detection metrics along with equivalent localization accuracy and with the same memory footprint as existing DNN-HMM style models for a given wake-word. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.13567 [pdf, ps, other]

I see what you hear: a vision-inspired method to localize words

Authors: Mohammad Samragh, Arnav Kundu, Ting-Yao Hu, Minsik Cho, Aman Chadha, Ashish Shrivastava, Oncel Tuzel, Devang Naik

Abstract: This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lig… ▽ More This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%. △ Less

Submitted 24 October, 2022; originally announced October 2022.

arXiv:2204.09578 [pdf, other]

doi 10.1109/IEDM19574.2021.9720616

Restructuring TCAD System: Teaching Traditional TCAD New Tricks

Authors: Sanghoon Myung, Wonik Jang, Seonghoon **, Jae Myung Choe, Changwook Jeong, Dae Sin Kim

Abstract: Traditional TCAD simulation has succeeded in predicting and optimizing the device performance; however, it still faces a massive challenge - a high computational cost. There have been many attempts to replace TCAD with deep learning, but it has not yet been completely replaced. This paper presents a novel algorithm restructuring the traditional TCAD system. The proposed algorithm predicts three-di… ▽ More Traditional TCAD simulation has succeeded in predicting and optimizing the device performance; however, it still faces a massive challenge - a high computational cost. There have been many attempts to replace TCAD with deep learning, but it has not yet been completely replaced. This paper presents a novel algorithm restructuring the traditional TCAD system. The proposed algorithm predicts three-dimensional (3-D) TCAD simulation in real-time while capturing a variance, enables deep learning and TCAD to complement each other, and fully resolves convergence errors. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Comments: In Proceedings of 2021 IEEE International Electron Devices Meeting (IEDM)

Journal ref: Proc. of IEDM 2021, 18.2.1-18.2.4 (2021)

arXiv:2204.02455 [pdf, other]

Improving Voice Trigger Detection with Metric Learning

Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

Abstract: Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented… ▽ More Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model. △ Less

Submitted 13 September, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted at InterSpeech 2022

arXiv:2203.04314 [pdf, other]

doi 10.1109/ACCESS.2023.3272665

PyNET-QxQ: An Efficient PyNET Variant for QxQ Bayer Pattern Demosaicing in CMOS Image Sensors

Authors: Minhyeok Cho, Haechang Lee, Hyunwoo Je, Kijeong Kim, Dongil Ryu, Albert No

Abstract: Deep learning-based image signal processor (ISP) models for mobile cameras can generate high-quality images that rival those of professional DSLR cameras. However, their computational demands often make them unsuitable for mobile settings. Additionally, modern mobile cameras employ non-Bayer color filter arrays (CFA) such as Quad Bayer, Nona Bayer, and QxQ Bayer to enhance image quality, yet most… ▽ More Deep learning-based image signal processor (ISP) models for mobile cameras can generate high-quality images that rival those of professional DSLR cameras. However, their computational demands often make them unsuitable for mobile settings. Additionally, modern mobile cameras employ non-Bayer color filter arrays (CFA) such as Quad Bayer, Nona Bayer, and QxQ Bayer to enhance image quality, yet most existing deep learning-based ISP (or demosaicing) models focus primarily on standard Bayer CFAs. In this study, we present PyNET-QxQ, a lightweight demosaicing model specifically designed for QxQ Bayer CFA patterns, which is derived from the original PyNET. We also propose a knowledge distillation method called progressive distillation to train the reduced network more effectively. Consequently, PyNET-QxQ contains less than 2.5% of the parameters of the original PyNET while preserving its performance. Experiments using QxQ images captured by a proto type QxQ camera sensor show that PyNET-QxQ outperforms existing conventional algorithms in terms of texture and edge reconstruction, despite its significantly reduced parameter count. △ Less

Submitted 5 May, 2023; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE Access

arXiv:2108.02716 [pdf, other]

Link Quality-Guaranteed Minimum-Cost Millimeter-Wave Base Station Deployment

Authors: Miaomiao Dong, Taejoon Kim, Minsung Cho, Kangeun Lee, Sungrok Yoon

Abstract: Today's growth in the volume of wireless devices coupled with the promise of supporting data-intensive 5G-&-beyond use cases is driving the industry to deploy more millimeter-wave (mmWave) base stations (BSs). Although mmWave cellular systems can carry a larger volume of traffic, dense deployment, in turn, increases the BS installation and maintenance cost, which has been largely ignored in their… ▽ More Today's growth in the volume of wireless devices coupled with the promise of supporting data-intensive 5G-&-beyond use cases is driving the industry to deploy more millimeter-wave (mmWave) base stations (BSs). Although mmWave cellular systems can carry a larger volume of traffic, dense deployment, in turn, increases the BS installation and maintenance cost, which has been largely ignored in their utilization. In this paper, we present an approach to the problem of mmWave BS deployment in urban environments by minimizing BS deployment cost subject to BS association and user equipment (UE) outage constraints. By exploiting the macro diversity, which enables each UE to be associated with multiple BSs, we derive an expression for UE outage that integrates physical blockage, UE access-limited blockage, and signal-to-interference-plus-noise-ratio (SINR) outage into its expression. The minimum-cost BS deployment problem is then formulated as integer non-linear programming (INP). The combinatorial nature of the problem motivates the pursuit of the optimal solution by decomposing the original problem into the two separable subproblems, i.e., cell coverage optimization and minimum subset selection subproblems. We provide the optimal solution and theoretical justifications for each subproblem. The simulation results demonstrating UE outage guarantees of the proposed method are presented. Interestingly, the proposed method produces a unique distribution of the macro-diversity orders over the network that is distinct from other benchmarks. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: 16 pages, submitted to IEEE Transactions on Wireless Communications

arXiv:2008.01944 [pdf, ps, other]

Optimal Pooling Matrix Design for Group Testing with Dilution (Row Degree) Constraints

Authors: Jirong Yi, Myung Cho, Xiaodong Wu, Raghu Mudumbai, Weiyu Xu

Abstract: In this paper, we consider the problem of designing optimal pooling matrix for group testing (for example, for COVID-19 virus testing) with the constraint that no more than $r>0$ samples can be pooled together, which we call "dilution constraint". This problem translates to designing a matrix with elements being either 0 or 1 that has no more than $r$ '1's in each row and has a certain performance… ▽ More In this paper, we consider the problem of designing optimal pooling matrix for group testing (for example, for COVID-19 virus testing) with the constraint that no more than $r>0$ samples can be pooled together, which we call "dilution constraint". This problem translates to designing a matrix with elements being either 0 or 1 that has no more than $r$ '1's in each row and has a certain performance guarantee of identifying anomalous elements. We explicitly give pooling matrix designs that satisfy the dilution constraint and have performance guarantees of identifying anomalous elements, and prove their optimality in saving the largest number of tests, namely showing that the designed matrices have the largest width-to-height ratio among all constraint-satisfying 0-1 matrices. △ Less

Submitted 5 August, 2020; originally announced August 2020.

Comments: group testing design, COVID-19

arXiv:1306.2665 [pdf, ps, other]

Precisely Verifying the Null Space Conditions in Compressed Sensing: A Sandwiching Algorithm

Authors: Myung Cho, Weiyu Xu

Abstract: In this paper, we propose new efficient algorithms to verify the null space condition in compressed sensing (CS). Given an $(n-m) \times n$ ($m>0$) CS matrix $A$ and a positive $k$, we are interested in computing $\displaystyle α_k = \max_{\{z: Az=0,z\neq 0\}}\max_{\{K: |K|\leq k\}}$ ${\|z_K \|_{1}}{\|z\|_{1}}$, where $K$ represents subsets of $\{1,2,...,n\}$, and $|K|$ is the cardinality of $K$.… ▽ More In this paper, we propose new efficient algorithms to verify the null space condition in compressed sensing (CS). Given an $(n-m) \times n$ ($m>0$) CS matrix $A$ and a positive $k$, we are interested in computing $\displaystyle α_k = \max_{\{z: Az=0,z\neq 0\}}\max_{\{K: |K|\leq k\}}$ ${\|z_K \|_{1}}{\|z\|_{1}}$, where $K$ represents subsets of $\{1,2,...,n\}$, and $|K|$ is the cardinality of $K$. In particular, we are interested in finding the maximum $k$ such that $α_k < {1}{2}$. However, computing $α_k$ is known to be extremely challenging. In this paper, we first propose a series of new polynomial-time algorithms to compute upper bounds on $α_k$. Based on these new polynomial-time algorithms, we further design a new sandwiching algorithm, to compute the \emph{exact} $α_k$ with greatly reduced complexity. When needed, this new sandwiching algorithm also achieves a smooth tradeoff between computational complexity and result accuracy. Empirical results show the performance improvements of our algorithm over existing known methods; and our algorithm outputs precise values of $α_k$, with much lower complexity than exhaustive search. △ Less

Submitted 9 August, 2013; v1 submitted 11 June, 2013; originally announced June 2013.

Comments: 30 pages

Showing 1–15 of 15 results for author: Cho, M