Search | arXiv e-print repository

arXiv:2405.19380 [pdf, other]

Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret

Authors: Yeoneung Kim, Gihun Kim, Insoon Yang

Abstract: We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby acc… ▽ More We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 61 pages, 6 figures

arXiv:2405.13413 [pdf, other]

Boosted Neural Decoders: Achieving Extreme Reliability of LDPC Codes for 6G Networks

Authors: Hee-Youl Kwak, Dae-Young Yun, Yongjune Kim, Sang-Hyo Kim, Jong-Seon No

Abstract: Ensuring extremely high reliability is essential for channel coding in 6G networks. The next-generation of ultra-reliable and low-latency communications (xURLLC) scenario within 6G networks requires a frame error rate (FER) below 10-9. However, low-density parity-check (LDPC) codes, the standard in 5G new radio (NR), encounter a challenge known as the error floor phenomenon, which hinders to achie… ▽ More Ensuring extremely high reliability is essential for channel coding in 6G networks. The next-generation of ultra-reliable and low-latency communications (xURLLC) scenario within 6G networks requires a frame error rate (FER) below 10-9. However, low-density parity-check (LDPC) codes, the standard in 5G new radio (NR), encounter a challenge known as the error floor phenomenon, which hinders to achieve such low rates. To tackle this problem, we introduce an innovative solution: boosted neural min-sum (NMS) decoder. This decoder operates identically to conventional NMS decoders, but is trained by novel training methods including: i) boosting learning with uncorrected vectors, ii) block-wise training schedule to address the vanishing gradient issue, iii) dynamic weight sharing to minimize the number of trainable weights, iv) transfer learning to reduce the required sample count, and v) data augmentation to expedite the sampling process. Leveraging these training strategies, the boosted NMS decoder achieves the state-of-the art performance in reducing the error floor as well as superior waterfall performance. Remarkably, we fulfill the 6G xURLLC requirement for 5G LDPC codes without the severe error floor. Additionally, the boosted NMS decoder, once its weights are trained, can perform decoding without additional modules, making it highly practical for immediate application. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 12 pages, 11 figures

arXiv:2405.09193 [pdf, other]

Autonomous Cooperative Levels of Multiple-Heterogeneous Unmanned Vehicle Systems

Authors: Yoo-Bin Bae, Yeong-Ung Kim, Jun-Oh Park, Hyo-Sung Ahn

Abstract: As multiple and heterogenous unmanned vehicle systems continue to play an increasingly important role in addressing complex missions in the real world, the need for effective cooperation among unmanned vehicles becomes paramount. The concept of autonomous cooperation, wherein unmanned vehicles cooperate without human intervention or human control, offers promising avenues for enhancing the efficie… ▽ More As multiple and heterogenous unmanned vehicle systems continue to play an increasingly important role in addressing complex missions in the real world, the need for effective cooperation among unmanned vehicles becomes paramount. The concept of autonomous cooperation, wherein unmanned vehicles cooperate without human intervention or human control, offers promising avenues for enhancing the efficiency and adaptability of intelligence of multiple-heterogeneous unmanned vehicle systems. Despite the growing interests in this domain, as far as the authors are concerned, there exists a notable lack of comprehensive literature on defining explicit concept and classifying levels of autonomous cooperation of multiple-heterogeneous unmanned vehicle systems. In this aspect, this article aims to define the explicit concept of autonomous cooperation of multiple-heterogeneous unmanned vehicle systems. Furthermore, we provide a novel criterion to assess the technical maturity of the developed unmanned vehicle systems by classifying the autonomous cooperative levels of multiple-heterogeneous unmanned vehicle systems. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2404.15333 [pdf, other]

EB-GAME: A Game-Changer in ECG Heartbeat Anomaly Detection

Authors: JuneYoung Park, Da Young Kim, Yunsoo Kim, Jisu Yoo, Tae Joon Kim

Abstract: Cardiologists use electrocardiograms (ECG) for the detection of arrhythmias. However, continuous monitoring of ECG signals to detect cardiac abnormal-ities requires significant time and human resources. As a result, several deep learning studies have been conducted in advance for the automatic detection of arrhythmia. These models show relatively high performance in supervised learning, but are no… ▽ More Cardiologists use electrocardiograms (ECG) for the detection of arrhythmias. However, continuous monitoring of ECG signals to detect cardiac abnormal-ities requires significant time and human resources. As a result, several deep learning studies have been conducted in advance for the automatic detection of arrhythmia. These models show relatively high performance in supervised learning, but are not applicable in cases with few training examples. This is because abnormal ECG data is scarce compared to normal data in most real-world clinical settings. Therefore, in this study, GAN-based anomaly detec-tion, i.e., unsupervised learning, was employed to address the issue of data imbalance. This paper focuses on detecting abnormal signals in electrocardi-ograms (ECGs) using only labels from normal signals as training data. In-spired by self-supervised vision transformers, which learn by dividing images into patches, and masked auto-encoders, known for their effectiveness in patch reconstruction and solving information redundancy, we introduce the ECG Heartbeat Anomaly Detection model, EB-GAME. EB-GAME was trained and validated on the MIT-BIH Arrhythmia Dataset, where it achieved state-of-the-art performance on this benchmark. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.07217 [pdf, other]

Attention-aware Semantic Communications for Collaborative Inference

Authors: Jiwoong Im, Nayoung Kwon, Taewoo Park, Jiheon Woo, Jaeho Lee, Yongjune Kim

Abstract: We propose a communication-efficient collaborative inference framework in the domain of edge inference, focusing on the efficient use of vision transformer (ViT) models. The partitioning strategy of conventional collaborative inference fails to reduce communication cost because of the inherent architecture of ViTs maintaining consistent layer dimensions across the entire transformer encoder. There… ▽ More We propose a communication-efficient collaborative inference framework in the domain of edge inference, focusing on the efficient use of vision transformer (ViT) models. The partitioning strategy of conventional collaborative inference fails to reduce communication cost because of the inherent architecture of ViTs maintaining consistent layer dimensions across the entire transformer encoder. Therefore, instead of employing the partitioning strategy, our framework utilizes a lightweight ViT model on the edge device, with the server deploying a complicated ViT model. To enhance communication efficiency and achieve the classification accuracy of the server model, we propose two strategies: 1) attention-aware patch selection and 2) entropy-aware image transmission. Attention-aware patch selection leverages the attention scores generated by the edge device's transformer encoder to identify and select the image patches critical for classification. This strategy enables the edge device to transmit only the essential patches to the server, significantly improving communication efficiency. Entropy-aware image transmission uses min-entropy as a metric to accurately determine whether to depend on the lightweight model on the edge device or to request the inference from the server model. In our framework, the lightweight ViT model on the edge device acts as a semantic encoder, efficiently identifying and selecting the crucial image information required for the classification task. Our experiments demonstrate that the proposed collaborative inference framework can reduce communication overhead by 68% with only a minimal loss in accuracy compared to the server model on the ImageNet dataset. △ Less

Submitted 31 May, 2024; v1 submitted 23 February, 2024; originally announced April 2024.

arXiv:2404.02592 [pdf]

Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

Authors: Ye** Jeon, Yunsu Kim, Gary Geunbae Lee

Abstract: Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and Fas… ▽ More Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Accepted to LREC-COLING 2024

arXiv:2404.02477 [pdf, ps, other]

Enhancing Sum-Rate Performance in Constrained Multicell Networks: A Low-Information Exchange Approach

Authors: You** Kim, Jonggyu Jang, Hyun Jong Yang

Abstract: Despite the extensive research on massive MIMO systems for 5G telecommunications and beyond, the reality is that many deployed base stations are equipped with a limited number of antennas rather than supporting massive MIMO configurations. Furthermore, while the cell-less network concept, which eliminates cell boundaries, is under investigation, practical deployments often grapple with significant… ▽ More Despite the extensive research on massive MIMO systems for 5G telecommunications and beyond, the reality is that many deployed base stations are equipped with a limited number of antennas rather than supporting massive MIMO configurations. Furthermore, while the cell-less network concept, which eliminates cell boundaries, is under investigation, practical deployments often grapple with significantly limited backhaul connection capacities between base stations. This letter explores techniques to maximize the sum-rate performance within the constraints of these more realistically equipped multicell networks. We propose an innovative approach that dramatically reduces the need for information exchange between base stations to a mere few bits, in stark contrast to conventional methods that require the exchange of hundreds of bits. Our proposed method not only addresses the limitations imposed by current network infrastructure but also showcases significantly improved performance under these constrained conditions. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 5 pages, 12 figures

arXiv:2404.00559 [pdf, other]

Hierarchical Climate Control Strategy for Electric Vehicles with Door-Opening Consideration

Authors: Sanghyeon Nam, Hye** Lee, Youngki Kim, Kyoung hyun Kwak, Kyoungseok Han

Abstract: This study proposes a novel climate control strategy for electric vehicles (EVs) by addressing door-opening interruptions, an overlooked aspect in EV thermal management. We create and validate an EV simulation model that incorporates door-opening scenarios. Three controllers are compared using the simulation model: (i) a hierarchical non-linear model predictive control (NMPC) with a unique coolant… ▽ More This study proposes a novel climate control strategy for electric vehicles (EVs) by addressing door-opening interruptions, an overlooked aspect in EV thermal management. We create and validate an EV simulation model that incorporates door-opening scenarios. Three controllers are compared using the simulation model: (i) a hierarchical non-linear model predictive control (NMPC) with a unique coolant dividing layer and a component for cabin air inflow regulation based on door-opening signals; (ii) a single MPC controller; and (iii) a rule-based controller. The hierarchical controller outperforms, reducing door-opening temperature drops by 46.96% and 51.33% compared to single layer MPC and rule-based methods in the relevant section. Additionally, our strategy minimizes the maximum temperature gaps between the sections during recovery by 86.4% and 78.7%, surpassing single layer MPC and rule-based approaches, respectively. We believe that this result opens up future possibilities for incorporating the thermal comfort of passengers across all sections within the vehicle. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: This paper, intended for presentation at the IEEE Intelligent Vehicles Symposium (IV) 2024, comprises six pages and includes eight figures

arXiv:2403.05136 [pdf, other]

DeRO: Dead Reckoning Based on Radar Odometry With Accelerometers Aided for Robot Localization

Authors: Hoang Viet Do, Yong Hun Kim, Joo Han Lee, Min Ho Lee, ** Woo Song

Abstract: In this paper, we propose a radar odometry structure that directly utilizes radar velocity measurements for dead reckoning while maintaining its ability to update estimations within the Kalman filter framework. Specifically, we employ the Doppler velocity obtained by a 4D Frequency Modulated Continuous Wave (FMCW) radar in conjunction with gyroscope data to calculate poses. This approach helps mit… ▽ More In this paper, we propose a radar odometry structure that directly utilizes radar velocity measurements for dead reckoning while maintaining its ability to update estimations within the Kalman filter framework. Specifically, we employ the Doppler velocity obtained by a 4D Frequency Modulated Continuous Wave (FMCW) radar in conjunction with gyroscope data to calculate poses. This approach helps mitigate high drift resulting from accelerometer biases and double integration. Instead, tilt angles measured by gravitational force are utilized alongside relative distance measurements from radar scan matching for the filter's measurement update. Additionally, to further enhance the system's accuracy, we estimate and compensate for the radar velocity scale factor. The performance of the proposed method is verified through five real-world open-source datasets. The results demonstrate that our approach reduces position error by 47% and rotation error by 52% on average compared to the state-of-the-art radar-inertial fusion method in terms of absolute trajectory error. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 9 pages, 5 figures, 1 table, conference

ACM Class: I.2.9

arXiv:2403.01256 [pdf]

Resilient Microgrid Formation Considering Communication Interruptions

Authors: Jian Zhong, Chen Chen, Young-** Kim, Yuxiong Huang, Mengjie Teng, Yiheng Bian, Zhaohong Bie

Abstract: Distribution system (DS) communication failures following extreme events often degrade monitoring and control functions, thus preventing the acquisition of complete global DS component state information, on which existing post-disaster DS restoration methods are based. This letter proposes methods of inferring the states of DS components in the case of incomplete component state information. By us… ▽ More Distribution system (DS) communication failures following extreme events often degrade monitoring and control functions, thus preventing the acquisition of complete global DS component state information, on which existing post-disaster DS restoration methods are based. This letter proposes methods of inferring the states of DS components in the case of incomplete component state information. By using the known DS information, the operating states of unobservable DS branches and buses can be inferred, providing complete information for DS performance restoration before full communication recovery △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2402.16998 [pdf, other]

What Do Language Models Hear? Probing for Auditory Representations in Language Models

Authors: Jerry Ngo, Yoon Kim

Abstract: This work explores whether language models encode meaningfully grounded representations of sounds of objects. We learn a linear probe that retrieves the correct text representation of an object given a snippet of audio related to that object, where the sound representation is given by a pretrained audio model. This probe is trained via a contrastive loss that pushes the language representations an… ▽ More This work explores whether language models encode meaningfully grounded representations of sounds of objects. We learn a linear probe that retrieves the correct text representation of an object given a snippet of audio related to that object, where the sound representation is given by a pretrained audio model. This probe is trained via a contrastive loss that pushes the language representations and sound representations of an object to be close to one another. After training, the probe is tested on its ability to generalize to objects that were not seen during training. Across different language models and audio models, we find that the probe generalization is above chance in many cases, indicating that despite being trained only on raw text, language models encode grounded knowledge of sounds for some objects. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.06463 [pdf, other]

Cardiac ultrasound simulation for autonomous ultrasound navigation

Authors: Abdoul Aziz Amadou, Laura Peralta, Paul Dryburgh, Paul Klein, Kaloian Petkov, Richard James Housden, Vivek Singh, Rui Liao, Young-Ho Kim, Florin Christian Ghesu, Tommaso Mansi, Ronak Rajani, Alistair Young, Kawal Rhode

Abstract: Ultrasound is well-established as an imaging modality for diagnostic and interventional purposes. However, the image quality varies with operator skills as acquiring and interpreting ultrasound images requires extensive training due to the imaging artefacts, the range of acquisition parameters and the variability of patient anatomies. Automating the image acquisition task could improve acquisition… ▽ More Ultrasound is well-established as an imaging modality for diagnostic and interventional purposes. However, the image quality varies with operator skills as acquiring and interpreting ultrasound images requires extensive training due to the imaging artefacts, the range of acquisition parameters and the variability of patient anatomies. Automating the image acquisition task could improve acquisition reproducibility and quality but training such an algorithm requires large amounts of navigation data, not saved in routine examinations. Thus, we propose a method to generate large amounts of ultrasound images from other modalities and from arbitrary positions, such that this pipeline can later be used by learning algorithms for navigation. We present a novel simulation pipeline which uses segmentations from other modalities, an optimized volumetric data representation and GPU-accelerated Monte Carlo path tracing to generate view-dependent and patient-specific ultrasound images. We extensively validate the correctness of our pipeline with a phantom experiment, where structures' sizes, contrast and speckle noise properties are assessed. Furthermore, we demonstrate its usability to train neural networks for navigation in an echocardiography view classification experiment by generating synthetic images from more than 1000 patients. Networks pre-trained with our simulations achieve significantly superior performance in settings where large real datasets are not available, especially for under-represented classes. The proposed approach allows for fast and accurate patient-specific ultrasound image generation, and its usability for training networks for navigation-related tasks is demonstrated. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: 24 pages, 10 figures, 5 tables

ACM Class: I.6.0; I.5.4; J.3

arXiv:2402.05402 [pdf, other]

A State-of-the-art Survey on Full-duplex Network Design

Authors: Yonghwi Kim, Hyung-Joo Moon, Hanju Yoo, Byoungnam, Kim, Kai-Kit Wong, Chan-Byoung Chae

Abstract: Full-duplex (FD) technology is gaining popularity for integration into a wide range of wireless networks due to its demonstrated potential in recent studies. In contrast to half-duplex (HD) technology, the implementation of FD in networks necessitates considering inter-node interference (INI) from various network perspectives. When deploying FD technology in networks, several critical factors must… ▽ More Full-duplex (FD) technology is gaining popularity for integration into a wide range of wireless networks due to its demonstrated potential in recent studies. In contrast to half-duplex (HD) technology, the implementation of FD in networks necessitates considering inter-node interference (INI) from various network perspectives. When deploying FD technology in networks, several critical factors must be taken into account. These include self-interference (SI) and the requisite SI cancellation (SIC) processes, as well as the selection of multiple user equipment (UE) per time slot. Additionally, inter-node interference (INI), including cross-link interference (CLI) and inter-cell interference (ICI), become crucial issues during concurrent uplink (UL) and downlink (DL) transmission and reception, similar to SI. Since most INI is challenging to eliminate, a comprehensive investigation that covers radio resource control (RRC), medium access control (MAC), and the physical layer (PHY) is essential in the context of FD network design, rather than focusing on individual network layers and types. This paper covers state-of-the-art studies, including protocols and documents from 3GPP for FD, MAC protocol, user scheduling, and CLI handling. The methods are also compared through a network-level system simulation based on 3D ray-tracing. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 23 pages, 10 figures, To appear in Proceedings of the IEEE

arXiv:2402.05350 [pdf, other]

Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model

Authors: Junghun Cha, Ali Haider, Seoyun Yang, Hoeyeong **, Subin Yang, A. F. M. Shahab Uddin, Jaehyoung Kim, Soo Ye Kim, Sung-Ho Bae

Abstract: A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies… ▽ More A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies has become an indispensable task for many products, it has not been systematically explored, and to the best of our knowledge, no public datasets are available. In this paper, we define this problem as Descanning and introduce a new high-quality and large-scale dataset named DESCAN-18K. It contains 18K pairs of original and scanned images collected in the wild containing multiple complex degradations. In order to eliminate such complex degradations, we propose a new image restoration model called DescanDiffusion consisting of a color encoder that corrects the global color degradation and a conditional denoising diffusion probabilistic model (DDPM) that removes local degradations. To further improve the generalization ability of DescanDiffusion, we also design a synthetic data generation scheme by reproducing prominent degradations in scanned images. We demonstrate that our DescanDiffusion outperforms other baselines including commercial restoration products, objectively and subjectively, via comprehensive experiments and analyses. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted to AAAI 2024

arXiv:2401.15313 [pdf, other]

Multi-Robot Relative Pose Estimation in SE(2) with Observability Analysis: A Comparison of Extended Kalman Filtering and Robust Pose Graph Optimization

Authors: Kihoon Shin, Hyunjae Sim, Seungwon Nam, Yonghee Kim, Jae Hu, Kwang-Ki K. Kim

Abstract: In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose es… ▽ More In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose estimation can be achieved through range-only or bearing-only measurements, provided both robots have non-zero linear velocities. In cases where odometry data from a target robot are not directly transmitted but estimated by the ego robot, both range and bearing measurements are necessary to ensure observability of relative pose estimation. For ROS/Gazebo simulations, we explore four sensing and communication structures. We compare extended Kalman filtering (EKF) and pose graph optimization (PGO) estimation using different robust loss functions (filtering and smoothing with varying batch sizes of sliding windows) in terms of estimation accuracy. In hardware experiments, two Turtlebot3 equipped with UWB modules are used for real-world inter-robot relative pose estimation, applying both EKF and PGO and comparing their performance. △ Less

Submitted 4 February, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

Comments: 20 pages, 21 figures

MSC Class: 93C85; 93E11; 93E24; 90C26; 93E10; 62M20;

arXiv:2401.11268 [pdf, other]

Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric

Authors: Golara Javadi, Kamer Ali Yuksel, Yunsu Kim, Thiago Castro Ferreira, Mohamed Al-Badrashiny

Abstract: In the realm of automatic speech recognition (ASR), the quest for models that not only perform with high accuracy but also offer transparency in their decision-making processes is crucial. The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems. Through experiments and analyses, the capabilitie… ▽ More In the realm of automatic speech recognition (ASR), the quest for models that not only perform with high accuracy but also offer transparency in their decision-making processes is crucial. The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems. Through experiments and analyses, the capabilities of the NoRefER (No Reference Error Rate) metric are explored in identifying word-level errors to aid post-editors in refining ASR hypotheses. The investigation also extends to the utility of NoRefER in the corpus-building process, demonstrating its effectiveness in augmenting datasets with insightful annotations. The diagnostic aspects of NoRefER are examined, revealing its ability to provide valuable insights into model behaviors and decision patterns. This has proven beneficial for prioritizing hypotheses in post-editing workflows and fine-tuning ASR models. The findings suggest that NoRefER is not merely a tool for error detection but also a comprehensive framework for enhancing ASR systems' transparency, efficiency, and effectiveness. To ensure the reproducibility of the results, all source codes of this study are made publicly available. △ Less

Submitted 2 February, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Journal ref: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024), Seoul, Korea

arXiv:2401.02014 [pdf, other]

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Authors: Ye** Jeon, Yunsu Kim, Gary Geunbae Lee

Abstract: Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that mode… ▽ More Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models. △ Less

Submitted 5 March, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: Accepted to AAAI 2024

arXiv:2312.03312 [pdf, other]

Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation

Authors: Wonjun Lee, Gary Geunbae Lee, Yunsu Kim

Abstract: This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. A… ▽ More This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally, we introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation. Experiments on the CommonVoice 12.0 dataset show significant reductions in Word Error Rate (WER) for low-resource languages, highlighting the effectiveness of our approach. This research contributes to the advancements of two-pass ASR systems in low-resource languages, offering the potential for improved cross-lingual transfer learning. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: 8 pages, ASRU 2023 Accepted

arXiv:2312.01842 [pdf, other]

Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking

Authors: Jihyun Lee, Ye** Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee

Abstract: Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audi… ▽ More Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted in ASRU 2023

arXiv:2312.01285 [pdf, other]

A Literature Review on the Smart Wheelchair Systems

Authors: Yane Kim, Bharath Velamala, Youngseo Choi, Yu** Kim, Hyunkin Kim, Nishad Kulkarni, Eung-Joo Lee

Abstract: This study offers an in-depth analysis of smart wheelchair (SW) systems, charting their progression from early developments to future innovations. It delves into various Brain-Computer Interface (BCI) systems, including mu rhythm, event-related potential, and steady-state visual evoked potential. The paper addresses challenges in signal categorization, proposing the sparse Bayesian extreme learnin… ▽ More This study offers an in-depth analysis of smart wheelchair (SW) systems, charting their progression from early developments to future innovations. It delves into various Brain-Computer Interface (BCI) systems, including mu rhythm, event-related potential, and steady-state visual evoked potential. The paper addresses challenges in signal categorization, proposing the sparse Bayesian extreme learning machine as an innovative solution. Additionally, it explores the integration of emotional states in BCI systems, the application of alternative control methods such as EMG-based systems, and the deployment of intelligent adaptive interfaces utilizing recurrent quantum neural networks. The study also covers advancements in autonomous navigation, assistance, and map**, emphasizing their importance in SW systems. The human aspect of SW interaction receives considerable attention, specifically in terms of privacy, physiological factors, and the refinement of control mechanisms. The paper acknowledges the commercial challenges faced, like the limitations of indoor usage and the necessity for user training. For future applications, the research explores the potential of autonomous systems adept at adapting to changing environments and user needs. This exploration includes reinforcement learning and various control methods, such as eye and voice control, to improve adaptability and interaction. The potential integration with smart home technologies, including advanced features such as robotic arms, is also considered, aiming to further enhance user accessibility and independence. Ultimately, this study seeks to provide a thorough overview of SW systems, presenting extensive research to detail their historical evolution, current state, and future prospects. △ Less

Submitted 3 December, 2023; originally announced December 2023.

arXiv:2312.00919 [pdf, other]

Rethinking Skip Connections in Spiking Neural Networks with Time-To-First-Spike Coding

Authors: Youngeun Kim, Adar Kahana, Ruokai Yin, Yuhang Li, Panos Stinis, George Em Karniadakis, Priyadarshini Panda

Abstract: Time-To-First-Spike (TTFS) coding in Spiking Neural Networks (SNNs) offers significant advantages in terms of energy efficiency, closely mimicking the behavior of biological neurons. In this work, we delve into the role of skip connections, a widely used concept in Artificial Neural Networks (ANNs), within the domain of SNNs with TTFS coding. Our focus is on two distinct types of skip connection a… ▽ More Time-To-First-Spike (TTFS) coding in Spiking Neural Networks (SNNs) offers significant advantages in terms of energy efficiency, closely mimicking the behavior of biological neurons. In this work, we delve into the role of skip connections, a widely used concept in Artificial Neural Networks (ANNs), within the domain of SNNs with TTFS coding. Our focus is on two distinct types of skip connection architectures: (1) addition-based skip connections, and (2) concatenation-based skip connections. We find that addition-based skip connections introduce an additional delay in terms of spike timing. On the other hand, concatenation-based skip connections circumvent this delay but produce time gaps between after-convolution and skip connection paths, thereby restricting the effective mixing of information from these two paths. To mitigate these issues, we propose a novel approach involving a learnable delay for skip connections in the concatenation-based skip connection architecture. This approach successfully bridges the time gap between the convolutional and skip branches, facilitating improved information mixing. We conduct experiments on public datasets including MNIST and Fashion-MNIST, illustrating the advantage of the skip connection in TTFS coding architectures. Additionally, we demonstrate the applicability of TTFS coding on beyond image recognition tasks and extend it to scientific machine-learning tasks, broadening the potential uses of SNNs. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.17396 [pdf, other]

Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset

Authors: Yu** Jeon, Eunsue Choi, Youngchan Kim, Yunseong Moon, Khalid Omer, Felix Heide, Seung-Hwan Baek

Abstract: Image datasets are essential not only in validating existing methods in computer vision but also in develo** new methods. Most existing image datasets focus on trichromatic intensity images to mimic human vision. However, polarization and spectrum, the wave properties of light that animals in harsh environments and with limited brain capacity often rely on, remain underrepresented in existing da… ▽ More Image datasets are essential not only in validating existing methods in computer vision but also in develo** new methods. Most existing image datasets focus on trichromatic intensity images to mimic human vision. However, polarization and spectrum, the wave properties of light that animals in harsh environments and with limited brain capacity often rely on, remain underrepresented in existing datasets. Although spectro-polarimetric datasets exist, these datasets have insufficient object diversity, limited illumination conditions, linear-only polarization data, and inadequate image count. Here, we introduce two spectro-polarimetric datasets: trichromatic Stokes images and hyperspectral Stokes images. These novel datasets encompass both linear and circular polarization; they introduce multiple spectral channels; and they feature a broad selection of real-world scenes. With our dataset in hand, we analyze the spectro-polarimetric image statistics, develop efficient representations of such high-dimensional data, and evaluate spectral dependency of shape-from-polarization methods. As such, the proposed dataset promises a foundation for data-driven spectro-polarimetric imaging and vision research. Dataset and code will be publicly available. △ Less

Submitted 30 November, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.07227 [pdf, other]

CARTOS: A Charging-Aware Real-Time Operating System for Intermittent Batteryless Devices

Authors: Mohsen Karimi, Yidi Wang, Youngbin Kim, Yoo** Lim, Hyoseung Kim

Abstract: This paper presents CARTOS, a charging-aware real-time operating system designed to enhance the functionality of intermittently-powered batteryless devices (IPDs) for various Internet of Things (IoT) applications. While IPDs offer significant advantages such as extended lifespan and operability in extreme environments, they pose unique challenges, including the need to ensure forward progress of p… ▽ More This paper presents CARTOS, a charging-aware real-time operating system designed to enhance the functionality of intermittently-powered batteryless devices (IPDs) for various Internet of Things (IoT) applications. While IPDs offer significant advantages such as extended lifespan and operability in extreme environments, they pose unique challenges, including the need to ensure forward progress of program execution amidst variable energy availability and maintaining reliable real-time time behavior during power disruptions. To address these challenges, CARTOS introduces a mixed-preemption scheduling model that classifies tasks into computational and peripheral tasks, and ensures their efficient and timely execution by adopting just-in-time checkpointing for divisible computation tasks and uninterrupted execution for indivisible peripheral tasks. CARTOS also supports processing chains of tasks with precedence constraints and adapts its scheduling in response to environmental changes to offer continuous execution under diverse conditions. CARTOS is implemented with new APIs and components added to FreeRTOS but is designed for portability to other embedded RTOSs. Through real hardware experiments and simulations, CARTOS exhibits superior performance over state-of-the-art methods, demonstrating that it can serve as a practical platform for develo** resilient, real-time sensing applications on IPDs. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.04753 [pdf, other]

1SPU: 1-step Speech Processing Unit

Authors: Karan Singla, Shahab Jalalvand, Yeon-Jun Kim, Antonio Moreno Daniel, Srinivas Bangalore, Andrej Ljolje, Ben Stern

Abstract: Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Proce… ▽ More Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the <pad> tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags. △ Less

Submitted 10 December, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: Accepted at International Conference on Natural Language Processing 2023

arXiv:2310.14506 [pdf, other]

Label Space Partition Selection for Multi-Object Tracking Using Two-Layer Partitioning

Authors: Ji Youn Lee, Changbeom Shim, Hoa Van Nguyen, Tran Thien Dat Nguyen, Hyun** Choi, Youngho Kim

Abstract: Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the i… ▽ More Estimating the trajectories of multi-objects poses a significant challenge due to data association ambiguity, which leads to a substantial increase in computational requirements. To address such problems, a divide-and-conquer manner has been employed with parallel computation. In this strategy, distinguished objects that have unique labels are grouped based on their statistical dependencies, the intersection of predicted measurements. Several geometry approaches have been used for label grou** since finding all intersected label pairs is clearly infeasible for large-scale tracking problems. This paper proposes an efficient implementation of label grou** for label-partitioned generalized labeled multi-Bernoulli filter framework using a secondary partitioning technique. This allows for parallel computation in the label graph indexing step, avoiding generating and eliminating duplicate comparisons. Additionally, we compare the performance of the proposed technique with several efficient spatial searching algorithms. The results demonstrate the superior performance of the proposed approach on large-scale data sets, enabling scalable trajectory estimation. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: 6 pages, 4 figures

arXiv:2310.07654 [pdf, other]

Audio-Visual Neural Syntax Acquisition

Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.06546 [pdf, other]

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Authors: Haeyun Choi, Jio Gim, Yuho Lee, Youngin Kim, Young-Joo Suh

Abstract: This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address th… ▽ More This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2309.14741 [pdf, other]

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

Authors: Hee-Soo Heo, KiHyun Nam, Bong-** Lee, Youngki Kwon, Minjae Lee, You ** Kim, Joon Son Chung

Abstract: In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remain… ▽ More In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remains fixed in this training process. This results in two similarity scores: one for the speakers information and one for the session information. The latter score acts as a compensator for the former that might be skewed due to session variations. Our extensive experiments demonstrate that session information can be effectively compensated without retraining of the embedding extractor. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2309.14668 [pdf]

Depolarized Holography with Polarization-multiplexing Metasurface

Authors: Seung-Woo Nam, Young** Kim, Dongyeon Kim, Yoonchan Jeong

Abstract: The evolution of computer-generated holography (CGH) algorithms has prompted significant improvements in the performances of holographic displays. Nonetheless, they start to encounter a limited degree of freedom in CGH optimization and physical constraints stemming from the coherent nature of holograms. To surpass the physical limitations, we consider polarization as a new degree of freedom by uti… ▽ More The evolution of computer-generated holography (CGH) algorithms has prompted significant improvements in the performances of holographic displays. Nonetheless, they start to encounter a limited degree of freedom in CGH optimization and physical constraints stemming from the coherent nature of holograms. To surpass the physical limitations, we consider polarization as a new degree of freedom by utilizing a novel optical platform called metasurface. Polarization-multiplexing metasurfaces enable incoherent-like behavior in holographic displays due to the mutual incoherence of orthogonal polarization states. We leverage this unique characteristic of a metasurface by integrating it into a holographic display and exploiting polarization diversity to bring an additional degree of freedom for CGH algorithms. To minimize the speckle noise while maximizing the image quality, we devise a fully differentiable optimization pipeline by taking into account the metasurface proxy model, thereby jointly optimizing spatial light modulator phase patterns and geometric parameters of metasurface nanostructures. We evaluate the metasurface-enabled depolarized holography through simulations and experiments, demonstrating its ability to reduce speckle noise and enhance image quality. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: 15 pages, 13 figures, to be published in SIGGRAPH Asia 2023

arXiv:2309.12306 [pdf, other]

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

Authors: Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You ** Kim, Youngjoon Jang, Joon Son Chung

Abstract: The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full se… ▽ More The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.00372 [pdf, other]

On the Localization of Ultrasound Image Slices within Point Distribution Models

Authors: Lennart Bastian, Vincent Bürgin, Ha Young Kim, Alexander Baumann, Benjamin Busam, Mahdi Saleh, Nassir Navab

Abstract: Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for autom… ▽ More Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for automated US image slice localization within a 3D shape representation to ease how such sonographic diagnoses are carried out. Our proposed method learns a common latent embedding space between US image patches and the 3D surface of an individual's thyroid shape, or a statistical aggregation in the form of a statistical shape model (SSM), via contrastive metric learning. Using cross-modality registration and Procrustes analysis, we leverage features from our model to register US slices to a 3D mesh representation of the thyroid shape. We demonstrate that our multi-modal registration framework can localize images on the 3D surface topology of a patient-specific organ and the mean shape of an SSM. Experimental results indicate slice positions can be predicted within an average of 1.2 mm of the ground-truth slice location on the patient-specific 3D anatomy and 4.6 mm on the SSM, exemplifying its usefulness for slice localization during sonographic acquisitions. Code is publically available: \href{https://github.com/vuenc/slice-to-shape}{https://github.com/vuenc/slice-to-shape} △ Less

Submitted 1 September, 2023; originally announced September 2023.

Comments: ShapeMI Workshop @ MICCAI 2023; 12 pages 2 figures

arXiv:2308.15791 [pdf, other]

Neural Video Compression with Temporal Layer-Adaptive Hierarchical B-frame Coding

Authors: Yeongwoong Kim, Suyong Bahk, Seungeon Kim, Won Hee Lee, Dokwan Oh, Hui Yong Kim

Abstract: Neural video compression (NVC) is a rapidly evolving video coding research area, with some models achieving superior coding efficiency compared to the latest video coding standard Versatile Video Coding (VVC). In conventional video coding standards, the hierarchical B-frame coding, which utilizes a bidirectional prediction structure for higher compression, had been well-studied and exploited. In N… ▽ More Neural video compression (NVC) is a rapidly evolving video coding research area, with some models achieving superior coding efficiency compared to the latest video coding standard Versatile Video Coding (VVC). In conventional video coding standards, the hierarchical B-frame coding, which utilizes a bidirectional prediction structure for higher compression, had been well-studied and exploited. In NVC, however, limited research has investigated the hierarchical B scheme. In this paper, we propose an NVC model exploiting hierarchical B-frame coding with temporal layer-adaptive optimization. We first extend an existing unidirectional NVC model to a bidirectional model, which achieves -21.13% BD-rate gain over the unidirectional baseline model. However, this model faces challenges when applied to sequences with complex or large motions, leading to performance degradation. To address this, we introduce temporal layer-adaptive optimization, incorporating methods such as temporal layer-adaptive quality scaling (TAQS) and temporal layer-adaptive latent scaling (TALS). The final model with the proposed methods achieves an impressive BD-rate gain of -39.86% against the baseline. It also resolves the challenges in sequences with large or complex motions with up to -49.13% more BD-rate gains than the simple bidirectional extension. This improvement is attributed to the allocation of more bits to lower temporal layers, thereby enhancing overall reconstruction quality with smaller bits. Since our method has little dependency on a specific NVC model architecture, it can serve as a general tool for extending unidirectional NVC models to the ones with hierarchical B-frame coding. △ Less

Submitted 5 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.04009 [pdf, other]

Safe Control Synthesis for Multicopter via Control Barrier Function Backstep**

Authors: **rae Kim, Youdan Kim

Abstract: A safe controller for multicopter is proposed using control barrier function. Multicopter dynamics are reformulated to deal with mixed-relative-degree and non-strict-feedback-form dynamics, and a time-varying safe backstep** controller is designed. Despite the time-varying variation, it is proven that the control input can be obtained by solving quadratic programming with affine inequality const… ▽ More A safe controller for multicopter is proposed using control barrier function. Multicopter dynamics are reformulated to deal with mixed-relative-degree and non-strict-feedback-form dynamics, and a time-varying safe backstep** controller is designed. Despite the time-varying variation, it is proven that the control input can be obtained by solving quadratic programming with affine inequality constraints. The proposed controller does not utilize a cascade control system design, unlike existing studies on the safe control of multicopter. Various safety constraints on angular velocity, total thrust direction, velocity, and position can be considered. Numerical simulation results support that the proposed safe controller does not violate all safety constraints including low- and high-level dynamics. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 6 pages, 2 figures, accepted for IEEE Conference on Decision and Control (CDC) 2023

arXiv:2308.02416 [pdf, other]

Local-Global Temporal Fusion Network with an Attention Mechanism for Multiple and Multiclass Arrhythmia Classification

Authors: Yun Kwan Kim, Minji Lee, Kunwook Jo, Hee Seok Song, Seong-Whan Lee

Abstract: Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms (ECGs). However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not consid… ▽ More Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms (ECGs). However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local temporal information extraction, (ii) global pattern extraction, and (iii) local-global information fusion with attention to perform arrhythmia detection and classification with a constrained input length. The 10-class and 4-class performances of our approach were assessed by detecting the onset and offset of arrhythmia as an episode and the duration of arrhythmia based on the MIT-BIH arrhythmia database (MITDB) and MIT-BIH atrial fibrillation database (AFDB), respectively. The results were statistically superior to those achieved by the comparison models. To check the generalization ability of the proposed method, an AFDB-trained model was tested on the MITDB, and superior performance was attained compared with that of a state-of-the-art model. The proposed method can capture local-global information and dynamics without incurring information losses. Therefore, arrhythmias can be recognized more accurately, and their occurrence times can be calculated; thus, the clinical field can create more accurate treatment plans by using the proposed method. △ Less

Submitted 13 October, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: 14 pages, 6 figures

MSC Class: 68T07; 92C55

arXiv:2308.01025 [pdf]

Error Analysis of CORDIC Processor with FPGA Implementation

Authors: Young-Man Kim

Abstract: The coordinate rotation digital computer (CORDIC) is a shift-add based fast computing algorithm which has been found in many digital signal processing (DSP) applications. In this paper, a detailed error analysis based on mean square error criteria and its implementation on FPGA is presented. Two considered error sources are an angle approximation error and a quantization error due to finite word l… ▽ More The coordinate rotation digital computer (CORDIC) is a shift-add based fast computing algorithm which has been found in many digital signal processing (DSP) applications. In this paper, a detailed error analysis based on mean square error criteria and its implementation on FPGA is presented. Two considered error sources are an angle approximation error and a quantization error due to finite word length in fixed-point number system. The error bound and variance are discussed in theory. The CORDIC algorithm is implemented on FPGA using the Xilinx Zynq-7000 development board called ZedBoard. Those results of theoretical error analysis are practically investigated by implementing it on actual FPGA board. In addition, Matlab is used to provide theoretical value as a baseline model by being set up in double-precision floating-point to compare it with the practical value of errors on FPGA implementation. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 5 pages, 7 Figures

arXiv:2307.13665 [pdf]

FPGA Implementation of Robust Residual Generator

Authors: Y. M. Kim

Abstract: In this paper, one can explicitly see the process of implementing the robust residual generator on digital domain, especially on FPGA. Firstly, the baseline model is developed in double precision floating point format. To develop the baseline model, key parameters such as SNR and detection window length are selected in the identification stage. (Please refer to the uploaded paper because this box… ▽ More In this paper, one can explicitly see the process of implementing the robust residual generator on digital domain, especially on FPGA. Firstly, the baseline model is developed in double precision floating point format. To develop the baseline model, key parameters such as SNR and detection window length are selected in the identification stage. (Please refer to the uploaded paper because this box doesn't accept more ty** beyond this point) △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: 6 pages, 3 figures

arXiv:2306.16670 [pdf, other]

doi 10.1109/TCSVT.2023.3302858

End-to-End Learnable Multi-Scale Feature Compression for VCM

Authors: Yeongwoong Kim, Hyewon Jeong, Janghyun Yu, Younhee Kim, Jooyoung Lee, Se Yoon Jeong, Hui Yong Kim

Abstract: The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compressio… ▽ More The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection... △ Less

Submitted 8 August, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: 13 pages, accepted by IEEE Transactions on Circuits and Systems for Video Technology

arXiv:2306.13020 [pdf]

Toward Automated Detection of Microbleeds with Anatomical Scale Localization: A Complete Clinical Diagnosis Support Using Deep Learning

Authors: Jun-Ho Kim, Young Noh, Haejoon Lee, Seul Lee, Woo-Ram Kim, Koung Mi Kang, Eung Yeop Kim, Mohammed A. Al-masni, Dong-Hyun Kim

Abstract: Cerebral Microbleeds (CMBs) are chronic deposits of small blood products in the brain tissues, which have explicit relation to various cerebrovascular diseases depending on their anatomical location, including cognitive decline, intracerebral hemorrhage, and cerebral infarction. However, manual detection of CMBs is a time-consuming and error-prone process because of their sparse and tiny structura… ▽ More Cerebral Microbleeds (CMBs) are chronic deposits of small blood products in the brain tissues, which have explicit relation to various cerebrovascular diseases depending on their anatomical location, including cognitive decline, intracerebral hemorrhage, and cerebral infarction. However, manual detection of CMBs is a time-consuming and error-prone process because of their sparse and tiny structural properties. The detection of CMBs is commonly affected by the presence of many CMB mimics that cause a high false-positive rate (FPR), such as calcification and pial vessels. This paper proposes a novel 3D deep learning framework that does not only detect CMBs but also inform their anatomical location in the brain (i.e., lobar, deep, and infratentorial regions). For the CMB detection task, we propose a single end-to-end model by leveraging the U-Net as a backbone with Region Proposal Network (RPN). To significantly reduce the FPs within the same single model, we develop a new scheme, containing Feature Fusion Module (FFM) that detects small candidates utilizing contextual information and Hard Sample Prototype Learning (HSPL) that mines CMB mimics and generates additional loss term called concentration loss using Convolutional Prototype Learning (CPL). The anatomical localization task does not only tell to which region the CMBs belong but also eliminate some FPs from the detection task by utilizing anatomical information. The results show that the proposed RPN that utilizes the FFM and HSPL outperforms the vanilla RPN and achieves a sensitivity of 94.66% vs. 93.33% and an average number of false positives per subject (FPavg) of 0.86 vs. 14.73. Also, the anatomical localization task further improves the detection performance by reducing the FPavg to 0.56 while maintaining the sensitivity of 94.66%. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 16 pages, 10 figures,3 tables

arXiv:2306.12562 [pdf, other]

Neural Spectro-polarimetric Fields

Authors: Youngchan Kim, Wonjoon **, Sunghyun Cho, Seung-Hwan Baek

Abstract: Modeling the spatial radiance distribution of light rays in a scene has been extensively explored for applications, including view synthesis. Spectrum and polarization, the wave properties of light, are often neglected due to their integration into three RGB spectral bands and their non-perceptibility to human vision. However, these properties are known to encompass substantial material and geomet… ▽ More Modeling the spatial radiance distribution of light rays in a scene has been extensively explored for applications, including view synthesis. Spectrum and polarization, the wave properties of light, are often neglected due to their integration into three RGB spectral bands and their non-perceptibility to human vision. However, these properties are known to encompass substantial material and geometric information about a scene. Here, we propose to model spectro-polarimetric fields, the spatial Stokes-vector distribution of any light ray at an arbitrary wavelength. We present Neural Spectro-polarimetric Fields (NeSpoF), a neural representation that models the physically-valid Stokes vector at given continuous variables of position, direction, and wavelength. NeSpoF manages inherently noisy raw measurements, showcases memory efficiency, and preserves physically vital signals - factors that are crucial for representing the high-dimensional signal of a spectro-polarimetric field. To validate NeSpoF, we introduce the first multi-view hyperspectral-polarimetric image dataset, comprised of both synthetic and real-world scenes. These were captured using our compact hyperspectral-polarimetric imaging system, which has been calibrated for robustness against system imperfections. We demonstrate the capabilities of NeSpoF on diverse scenes. △ Less

Submitted 10 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.05291 [pdf]

One shot learning based drivers head movement identification using a millimetre wave radar sensor

Authors: Hong Nhung Nguyen, Seongwook Lee, Tien Tung Nguyen, Yong Hwa Kim

Abstract: Concentration of drivers on traffic is a vital safety issue; thus, monitoring a driver being on road becomes an essential requirement. The key purpose of supervision is to detect abnormal behaviours of the driver and promptly send warnings to him her for avoiding incidents related to traffic accidents. In this paper, to meet the requirement, based on radar sensors applications, the authors first u… ▽ More Concentration of drivers on traffic is a vital safety issue; thus, monitoring a driver being on road becomes an essential requirement. The key purpose of supervision is to detect abnormal behaviours of the driver and promptly send warnings to him her for avoiding incidents related to traffic accidents. In this paper, to meet the requirement, based on radar sensors applications, the authors first use a small sized millimetre wave radar installed at the steering wheel of the vehicle to collect signals from different head movements of the driver. The received signals consist of the reflection patterns that change in response to the head movements of the driver. Then, in order to distinguish these different movements, a classifier based on the measured signal of the radar sensor is designed. However, since the collected data set is not large, in this paper, the authors propose One shot learning to classify four cases of driver's head movements. The experimental results indicate that the proposed method can classify the four types of cases according to the various head movements of the driver with a high accuracy reaching up to 100. In addition, the classification performance of the proposed method is significantly better than that of the convolutional neural network model. △ Less

Submitted 31 May, 2023; originally announced June 2023.

arXiv:2306.00680 [pdf, other]

Encoder-decoder multimodal speaker change detection

Authors: Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You ** Kim, Young-ki Kwon, Minjae Lee, Bong-** Lee

Abstract: The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are bui… ▽ More The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 5 pages, accepted for presentation at INTERSPEECH 2023

arXiv:2305.16664 [pdf, other]

doi 10.21437/Interspeech.2023-1679

Score-balanced Loss for Multi-aspect Pronunciation Assessment

Authors: Hee** Do, Yunsu Kim, Gary Geunbae Lee

Abstract: With rapid technological growth, automatic pronunciation assessment has transitioned toward systems that evaluate pronunciation in various aspects, such as fluency and stress. However, despite the highly imbalanced score labels within each aspect, existing studies have rarely tackled the data imbalance problem. In this paper, we suggest a novel loss function, score-balanced loss, to address the pr… ▽ More With rapid technological growth, automatic pronunciation assessment has transitioned toward systems that evaluate pronunciation in various aspects, such as fluency and stress. However, despite the highly imbalanced score labels within each aspect, existing studies have rarely tackled the data imbalance problem. In this paper, we suggest a novel loss function, score-balanced loss, to address the problem caused by uneven data, such as bias toward the majority scores. As a re-weighting approach, we assign higher costs when the predicted score is of the minority class, thus, guiding the model to gain positive feedback for sparse score prediction. Specifically, we design two weighting factors by leveraging the concept of an effective number of samples and using the ranks of scores. We evaluate our method on the speechocean762 dataset, which has noticeably imbalanced scores for several aspects. Improved results particularly on such uneven aspects prove the effectiveness of our method. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted at Interspeech 2023

arXiv:2305.14799 [pdf, other]

doi 10.1109/TPWRS.2023.3334080

Sample-Efficient Learning for a Surrogate Model of Three-Phase Distribution System

Authors: Hoang Tien Nguyen, Young-** Kim, Dae-Hyun Choi

Abstract: A surrogate model that accurately predicts distribution system voltages is crucial for reliable smart grid planning and operation. This letter proposes a fixed-point data-driven surrogate modeling method that employs a limited dataset to learn the power-voltage relationship of an unbalanced three-phase distribution system. The proposed surrogate model is designed using a fixed-point load-flow equa… ▽ More A surrogate model that accurately predicts distribution system voltages is crucial for reliable smart grid planning and operation. This letter proposes a fixed-point data-driven surrogate modeling method that employs a limited dataset to learn the power-voltage relationship of an unbalanced three-phase distribution system. The proposed surrogate model is designed using a fixed-point load-flow equation, and the stochastic gradient descent method with an automatic differentiation technique is employed to update the parameters of the surrogate model using complex power and voltage samples. Numerical examples in IEEE 13-bus, 37-bus, and 123-bus systems demonstrate that the proposed surrogate model can outperform surrogate models based on the deep neural network and Gaussian process regarding prediction accuracy and sample efficiency △ Less

Submitted 18 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Journal ref: IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 2361-2364, Jan. 2024

arXiv:2305.14732 [pdf, other]

Increasing Electric Vehicles Utilization in Transit Fleets using Learning, Predictions, Optimization, and Automation

Authors: Jacopo Guanetti, Yeojun Kim, Xu Shen, Joel Donham, Santosh Alexander, Bruce Wootton, Francesco Borrelli

Abstract: This work presents a novel hierarchical approach to increase Battery Electric Buses (BEBs) utilization in transit fleets. The proposed approach relies on three key components. A learning-based BEB digital twin cloud platform is used to accurately predict BEB charge consumption on a per vehicle, per driver, and per route basis, and accurately predict the time-to-charge BEB batteries to any level. T… ▽ More This work presents a novel hierarchical approach to increase Battery Electric Buses (BEBs) utilization in transit fleets. The proposed approach relies on three key components. A learning-based BEB digital twin cloud platform is used to accurately predict BEB charge consumption on a per vehicle, per driver, and per route basis, and accurately predict the time-to-charge BEB batteries to any level. These predictions are then used by a Predictive Block Assignment module to maximize the BEB fleet utilization. This module computes the optimal BEB daily assignment and charge management strategy. A Depot Parking and Charging Queue Management module is used to autonomously park and charge the vehicles based on their charging demands. The paper discusses the technical approach and benefits of each level in architecture and concludes with a realistic simulations study. The study shows that if our approach is employed, BEB fleet utilization can increase by 50% compared to state-of-the-art methods. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at the 35th IEEE Intelligent Vehicles Symposium (IV 2023)

arXiv:2305.08878 [pdf, other]

Learning to Learn Unlearned Feature for Brain Tumor Segmentation

Authors: Seungyub Han, Yeongmo Kim, Seokhyeon Ha, Jungwoo Lee, Seunghong Choi

Abstract: We propose a fine-tuning algorithm for brain tumor segmentation that needs only a few data samples and helps networks not to forget the original tasks. Our approach is based on active learning and meta-learning. One of the difficulties in medical image segmentation is the lack of datasets with proper annotations, because it requires doctors to tag reliable annotation and there are many variants of… ▽ More We propose a fine-tuning algorithm for brain tumor segmentation that needs only a few data samples and helps networks not to forget the original tasks. Our approach is based on active learning and meta-learning. One of the difficulties in medical image segmentation is the lack of datasets with proper annotations, because it requires doctors to tag reliable annotation and there are many variants of a disease, such as glioma and brain metastasis, which are the different types of brain tumor and have different structural features in MR images. Therefore, it is impossible to produce the large-scale medical image datasets for all types of diseases. In this paper, we show a transfer learning method from high grade glioma to brain metastasis, and demonstrate that the proposed algorithm achieves balanced parameters for both glioma and brain metastasis domains within a few steps. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Comments: Medical Imaging Meets NeurIPS 2018

arXiv:2303.16511 [pdf, other]

Joint unsupervised and supervised learning for context-aware language identification

Authors: **seok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim

Abstract: Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this probl… ▽ More Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. The proposed method learns the context of speech through masked language modeling (MLM) loss and simultaneously trains to determine the language of the utterance with supervised learning loss. The proposed joint learning was found to reduce the error rate by 15.6% compared to the same structure model trained by supervised-only learning on a subset of the VoxLingua107 dataset consisting of sub-three-second utterances in 11 languages. △ Less

Submitted 14 April, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.16205 [pdf]

doi 10.1093/pnasnexus/pgad111

mHealth hyperspectral learning for instantaneous spatiospectral imaging of hemodynamics

Authors: Yuhyun Ji, Sang Mok Park, Semin Kwon, Jung Woo Leem, Vidhya Vijayakrishnan Nair, Yunjie Tong, Young L. Kim

Abstract: Hyperspectral imaging acquires data in both the spatial and frequency domains to offer abundant physical or biological information. However, conventional hyperspectral imaging has intrinsic limitations of bulky instruments, slow data acquisition rate, and spatiospectral tradeoff. Here we introduce hyperspectral learning for snapshot hyperspectral imaging in which sampled hyperspectral data in a sm… ▽ More Hyperspectral imaging acquires data in both the spatial and frequency domains to offer abundant physical or biological information. However, conventional hyperspectral imaging has intrinsic limitations of bulky instruments, slow data acquisition rate, and spatiospectral tradeoff. Here we introduce hyperspectral learning for snapshot hyperspectral imaging in which sampled hyperspectral data in a small subarea are incorporated into a learning algorithm to recover the hypercube. Hyperspectral learning exploits the idea that a photograph is more than merely a picture and contains detailed spectral information. A small sampling of hyperspectral data enables spectrally informed learning to recover a hypercube from an RGB image. Hyperspectral learning is capable of recovering full spectroscopic resolution in the hypercube, comparable to high spectral resolutions of scientific spectrometers. Hyperspectral learning also enables ultrafast dynamic imaging, leveraging ultraslow video recording in an off-the-shelf smartphone, given that a video comprises a time series of multiple RGB images. To demonstrate its versatility, an experimental model of vascular development is used to extract hemodynamic parameters via statistical and deep-learning approaches. Subsequently, the hemodynamics of peripheral microcirculation is assessed at an ultrafast temporal resolution up to a millisecond, using a conventional smartphone camera. This spectrally informed learning method is analogous to compressed sensing; however, it further allows for reliable hypercube recovery and key feature extractions with a transparent learning algorithm. This learning-powered snapshot hyperspectral imaging method yields high spectral and temporal resolutions and eliminates the spatiospectral tradeoff, offering simple hardware requirements and potential applications of various machine-learning techniques. △ Less

Submitted 5 April, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

Journal ref: PNAS Nexus, pgad111, 2023

arXiv:2303.07592 [pdf, other]

Lightweight feature encoder for wake-up word detection based on self-supervised speech representation

Authors: Hyungjun Lim, Younggwan Kim, Kiho Yeom, Eunjoo Seo, Hoodong Lee, Stanley Jungkyu Choi, Honglak Lee

Abstract: Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work… ▽ More Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work, we propose LiteFEW, a lightweight feature encoder for wake-up word detection that preserves the inherent ability of wav2vec 2.0 with a minimum scale. In the method, the knowledge of the pre-trained wav2vec 2.0 is compressed by introducing an auto-encoder-based dimensionality reduction technique and distilled to LiteFEW. Experimental results on the open-source "Hey Snips" dataset show that the proposed method applied to various model structures significantly improves the performance, achieving over 20% of relative improvements with only 64k parameters. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.10186 [pdf, other]

E2E Spoken Entity Extraction for Virtual Agents

Authors: Karan Singla, Yeon-Jun Kim, Srinivas Bangalore

Abstract: In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to… ▽ More In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech ignoring the superfluous portions such as carrier phrases, or spell name entities. In the context of dialog from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step approach which first generates lexical transcriptions followed by text-based entity extraction for identifying spoken entities. △ Less

Submitted 9 November, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted at EMNLP 2023 Industry Track

arXiv:2301.09058 [pdf, other]

Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification

Authors: Kwangje Baeg, Yeong-Gwan Kim, Young-Sub Han, Byoung-Ki Jeon

Abstract: Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech informatio… ▽ More Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task. △ Less

Submitted 22 January, 2023; originally announced January 2023.

Showing 1–50 of 163 results for author: Kim, Y