Search | arXiv e-print repository

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

Abstract: Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models f… ▽ More Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.06650 [pdf, other]

Predicting the risk of early-stage breast cancer recurrence using H\&E-stained tissue images

Authors: Geongyu Lee, Joonho Lee, Tae-Yeong Kwak, Sun Woo Kim, Youngmee Kwon, Chungyeul Kim, Hyeyoon Chang

Abstract: Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images la… ▽ More Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images labeled with the risk prediction via genomics assays were used, and we obtained sensitivity of 0.857, 0.746, and 0.529 for predicting low, intermediate, and high risk, and specificity of 0.816, 0.803, and 0.972. When compared to the expert pathologist's regional histology grade information, a Pearson's correlation coefficient of 0.61 was obtained. When we checked the model learned through these studies through the class activation map, we found that it actually considered tubule formation and mitotic rate when predicting different risk groups. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 12 pages, 7 figures

arXiv:2405.02066 [pdf, other]

WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights

Authors: Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim

Abstract: The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representat… ▽ More The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods. △ Less

Submitted 27 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2403.08187 [pdf, other]

Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 12 pages, 2 figures

ACM Class: I.2.7

arXiv:2402.00977 [pdf, other]

Enhanced fringe-to-phase framework using deep learning

Authors: Won-Hoe Kim, Bongjoong Kim, Hyung-Gun Chi, Jae-Sang Hyun

Abstract: In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrap**. In this study, we introduce SFNet, a symmetric fusion network that transforms two fring… ▽ More In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrap**. In this study, we introduce SFNet, a symmetric fusion network that transforms two fringe images into an absolute phase. To enhance output reliability, Our framework predicts refined phases by incorporating information from fringe images of a different frequency than those used as input. This allows us to achieve high accuracy with just two images. Comparative experiments and ablation studies validate the effectiveness of our proposed method. The dataset and code are publicly accessible on our project page https://wonhoe-kim.github.io/SFNet. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 35 pages, 13 figures, 6 tables

arXiv:2401.18006 [pdf, other]

EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation

Authors: Jonathan W. Kim, Ahmed Alaa, Danilo Bernardo

Abstract: In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the developmen… ▽ More In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the development EEG ML models that exhibit multi-scale electrophysiological understanding and classification capabilities. Moreover, typical ML EEG approaches utilize black-box approaches, limiting their interpretability and trustworthiness in clinical contexts. Thus, we propose EEG-GPT, a unifying approach to EEG classification that leverages advances in large language models (LLM). EEG-GPT achieves excellent performance comparable to current state-of-the-art deep learning methods in classifying normal from abnormal EEG in a few-shot learning paradigm utilizing only 2% of training data. Furthermore, it offers the distinct advantages of providing intermediate reasoning steps and coordinating specialist EEG tools across multiple scales in its operation, offering transparent and interpretable step-by-step verification, thereby promoting trustworthiness in clinical contexts. △ Less

Submitted 3 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.13836 [pdf, other]

doi 10.1016/j.conengprac.2024.105841

Machine learning for industrial sensing and control: A survey and practical perspective

Authors: Nathan P. Lawrence, Seshu Kumar Damarla, Jong Woo Kim, Aditya Tulsyan, Faraz Amjad, Kai Wang, Benoit Chachuat, Jong Min Lee, Biao Huang, R. Bhushan Gopaluni

Abstract: With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: so… ▽ More With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: soft sensing, process optimization, and control. Soft sensing contains a wealth of industrial applications of statistical and machine learning methods. We quantitatively identify research trends, allowing insight into the most successful techniques in practice. We consider two distinct flavors for data-driven optimization and control: hybrid modeling in conjunction with mathematical programming techniques and reinforcement learning. Throughout these application areas, we discuss their respective industrial requirements and challenges. A common challenge is the interpretability and efficiency of purely data-driven methods. This suggests a need to carefully balance deep learning techniques with domain knowledge. As a result, we highlight ways prior knowledge may be integrated into industrial machine learning applications. The treatment of methods, problems, and applications presented here is poised to inform and inspire practitioners and researchers to develop impactful data-driven sensing, optimization, and control solutions in the process industries. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 48 pages

Journal ref: Control Engineering Practice 2024

arXiv:2312.13313 [pdf, other]

ParamISP: Learned Forward and Inverse ISPs using Camera Parameters

Authors: Woohyeok Kim, Geonu Kim, Junyong Lee, Seungyong Lee, Seung-Hwan Baek, Sunghyun Cho

Abstract: RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes… ▽ More RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time, and have limitations when used for various applications. In this paper, we propose ParamISP, a learning-based method for forward and inverse conversion between sRGB and RAW images, that adopts a novel neural-network module to utilize camera parameters, which is dubbed as ParamNet. Given the camera parameters provided in the EXIF data, ParamNet converts them into a feature vector to control the ISP networks. Extensive experiments demonstrate that ParamISP achieve superior RAW and sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and camera-to-camera transfer. △ Less

Submitted 14 April, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2311.04468 [pdf]

A human brain atlas of chi-separation for normative iron and myelin distributions

Authors: Kyeongseon Min, Beomseok Sohn, Woo Jung Kim, Chae Jung Park, Soohwa Song, Dong Hoon Shin, Kyung Won Chang, Na-Young Shin, Minjun Kim, Hyeong-Geol Shin, Phil Hyu Lee, Jongho Lee

Abstract: Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility map** technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opene… ▽ More Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility map** technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opened a potential for generating high resolution iron and myelin maps in the brain. Utilizing this technique, this study constructs a normative chi-separation atlas from 106 healthy human brains. The resulting atlas provides detailed anatomical structures associated with the distributions of iron and myelin, clearly delineating subcortical nuclei, thalamic nuclei, and white matter fiber bundles. Additionally, susceptibility values in a number of regions of interest are reported along with age-dependent changes. This atlas may have direct applications such as localization of subcortical structures for deep brain stimulation or high-intensity focused ultrasound and also serve as a valuable resource for future research. △ Less

Submitted 2 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 19 pages, 9 figures

arXiv:2310.07663 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747073

Deep Video Inpainting Guided by Audio-Visual Self-Supervision

Authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon

Abstract: Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-vis… ▽ More Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted at ICASSP 2022

arXiv:2309.11988 [pdf, ps, other]

Relaxed Conditions for Parameterized Linear Matrix Inequality in the Form of Nested Fuzzy Summations

Authors: Do Wan Kim, Donghwan Lee

Abstract: The aim of this study is to investigate less conservative conditions for parameterized linear matrix inequalities (PLMIs) that are formulated as nested fuzzy summations. Such PLMIs are commonly encountered in stability analysis and control design problems for Takagi-Sugeno (T-S) fuzzy systems. Utilizing the weighted inequality of arithmetic and geometric means (AM-GM inequality), we develop new, l… ▽ More The aim of this study is to investigate less conservative conditions for parameterized linear matrix inequalities (PLMIs) that are formulated as nested fuzzy summations. Such PLMIs are commonly encountered in stability analysis and control design problems for Takagi-Sugeno (T-S) fuzzy systems. Utilizing the weighted inequality of arithmetic and geometric means (AM-GM inequality), we develop new, less conservative linear matrix inequalities for the PLMIs. This methodology enables us to efficiently handle the product of membership functions that have intersecting indices. Through empirical case studies, we demonstrate that our proposed conditions produce less conservative results compared to existing approaches in the literature. △ Less

Submitted 18 December, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: This work has been submitted to IEEE Transactions on Systems, Man and Cybernetics: Systems for possible publications

arXiv:2309.08208 [pdf, other]

HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-** Yu

Abstract: Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since… ▽ More Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Submitted to 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

arXiv:2309.06841 [pdf, ps, other]

On the Local Quadratic Stability of T-S Fuzzy Systems in the Vicinity of the Origin

Authors: Donghwan Lee, Do Wan Kim

Abstract: The main goal of this paper is to introduce new local stability conditions for continuous-time Takagi-Sugeno (T-S) fuzzy systems. These stability conditions are based on linear matrix inequalities (LMIs) in combination with quadratic Lyapunov functions. Moreover, they integrate information on the membership functions at the origin and effectively leverage the linear structure of the underlying non… ▽ More The main goal of this paper is to introduce new local stability conditions for continuous-time Takagi-Sugeno (T-S) fuzzy systems. These stability conditions are based on linear matrix inequalities (LMIs) in combination with quadratic Lyapunov functions. Moreover, they integrate information on the membership functions at the origin and effectively leverage the linear structure of the underlying nonlinear system in the vicinity of the origin. As a result, the proposed conditions are proved to be less conservative compared to existing methods using fuzzy Lyapunov functions in the literature. Moreover, we establish that the proposed methods offer necessary and sufficient conditions for the local exponential stability of T-S fuzzy systems. The paper also includes discussions on the inherent limitations associated with fuzzy Lyapunov approaches. To demonstrate the theoretical results, we provide comprehensive examples that elucidate the core concepts and validate the efficacy of the proposed conditions. △ Less

Submitted 13 September, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2308.07788 [pdf, ps, other]

GIST-AiTeR Speaker Diarization System for VoxCeleb Speaker Recognition Challenge (VoxSRC) 2023

Authors: Dongkeon Park, Ji Won Kim, Kang Ryeol Kim, Do Hyun Lee, Hong Kook Kim

Abstract: This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MF… ▽ More This report describes the submission system by the GIST-AiTeR team for the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23) Track 4. Our submission system focuses on implementing diverse speaker diarization (SD) techniques, including ResNet293 and MFA-Conformer with different combinations of segment and hop length. Then, those models are combined into an ensemble model. The ResNet293 and MFA-Conformer models exhibited the diarization error rates (DERs) of 3.65% and 3.83% on VAL46, respectively. The submitted ensemble model provided a DER of 3.50% on VAL46, and consequently, it achieved a DER of 4.88% on the VoxSRC-23 test set. △ Less

Submitted 25 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: VoxSRC 2023 Track4

arXiv:2307.16706 [pdf, ps, other]

Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes

Authors: Donghwan Lee, Han-Dong Lim, Do Wan Kim

Abstract: The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parame… ▽ More The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. We first introduce a novel distributed DP, inspired by the distributed optimization method of Wang and Elia. Next, a new distributed DP is introduced through a decoupling process. The convergence of the DP algorithms is proved through systems and control perspectives. The study in this paper sets the stage for new distributed temporal different learning algorithms. △ Less

Submitted 13 June, 2024; v1 submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.10628 [pdf, other]

PAS: Partial Additive Speech Data Augmentation Method for Noise Robust Speaker Verification

Authors: Wonbin Kim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Chan-yeong Lim, Ha-** Yu

Abstract: Background noise reduces speech intelligibility and quality, making speaker verification (SV) in noisy environments a challenging task. To improve the noise robustness of SV systems, additive noise data augmentation method has been commonly used. In this paper, we propose a new additive noise method, partial additive speech (PAS), which aims to train SV systems to be less affected by noisy environ… ▽ More Background noise reduces speech intelligibility and quality, making speaker verification (SV) in noisy environments a challenging task. To improve the noise robustness of SV systems, additive noise data augmentation method has been commonly used. In this paper, we propose a new additive noise method, partial additive speech (PAS), which aims to train SV systems to be less affected by noisy environments. The experimental results demonstrate that PAS outperforms traditional additive noise in terms of equal error rates (EER), with relative improvements of 4.64% and 5.01% observed in SE-ResNet34 and ECAPA-TDNN. We also show the effectiveness of proposed method by analyzing attention modules and visualizing speaker embeddings. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 5 pages, 2 figures, 1 table, accepted to CKAIA2023 as a conference paper

arXiv:2306.14384 [pdf, other]

Multitask Learning for Multiple Recognition Tasks: A Framework for Lower-limb Exoskeleton Robot Applications

Authors: Joonhyun Kim, Seongmin Ha, Dongbin Shin, Seoyeon Ham, Jaepil Jang, Wansoo Kim

Abstract: To control the lower-limb exoskeleton robot effectively, it is essential to accurately recognize user status and environmental conditions. Previous studies have typically addressed these recognition challenges through independent models for each task, resulting in an inefficient model development process. In this study, we propose a Multitask learning approach that can address multiple recognition… ▽ More To control the lower-limb exoskeleton robot effectively, it is essential to accurately recognize user status and environmental conditions. Previous studies have typically addressed these recognition challenges through independent models for each task, resulting in an inefficient model development process. In this study, we propose a Multitask learning approach that can address multiple recognition challenges simultaneously. This approach can enhance data efficiency by enabling knowledge sharing between each recognition model. We demonstrate the effectiveness of this approach using Gait phase recognition (GPR) and Terrain classification (TC) as examples, the most conventional recognition tasks in lower-limb exoskeleton robots. We first created a high-performing GPR model that achieved a Root mean square error (RMSE) value of 2.345 $\pm$ 0.08 and then utilized its knowledge-sharing backbone feature network to learn a TC model with an extremely limited dataset. Using a limited dataset for the TC model allows us to validate the data efficiency of our proposed Multitask learning approach. We compared the accuracy of the proposed TC model against other TC baseline models. The proposed model achieved 99.5 $\pm$ 0.044% accuracy with a limited dataset, outperforming other baseline models, demonstrating its effectiveness in terms of data efficiency. Future research will focus on extending the Multitask learning framework to encompass additional recognition tasks. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: Accepted for publication in the Proceedings of the 2023 IEEE International Conference on RO-MAN 2023 BUSAN, 7 pages

arXiv:2306.13020 [pdf]

Toward Automated Detection of Microbleeds with Anatomical Scale Localization: A Complete Clinical Diagnosis Support Using Deep Learning

Authors: Jun-Ho Kim, Young Noh, Haejoon Lee, Seul Lee, Woo-Ram Kim, Koung Mi Kang, Eung Yeop Kim, Mohammed A. Al-masni, Dong-Hyun Kim

Abstract: Cerebral Microbleeds (CMBs) are chronic deposits of small blood products in the brain tissues, which have explicit relation to various cerebrovascular diseases depending on their anatomical location, including cognitive decline, intracerebral hemorrhage, and cerebral infarction. However, manual detection of CMBs is a time-consuming and error-prone process because of their sparse and tiny structura… ▽ More Cerebral Microbleeds (CMBs) are chronic deposits of small blood products in the brain tissues, which have explicit relation to various cerebrovascular diseases depending on their anatomical location, including cognitive decline, intracerebral hemorrhage, and cerebral infarction. However, manual detection of CMBs is a time-consuming and error-prone process because of their sparse and tiny structural properties. The detection of CMBs is commonly affected by the presence of many CMB mimics that cause a high false-positive rate (FPR), such as calcification and pial vessels. This paper proposes a novel 3D deep learning framework that does not only detect CMBs but also inform their anatomical location in the brain (i.e., lobar, deep, and infratentorial regions). For the CMB detection task, we propose a single end-to-end model by leveraging the U-Net as a backbone with Region Proposal Network (RPN). To significantly reduce the FPs within the same single model, we develop a new scheme, containing Feature Fusion Module (FFM) that detects small candidates utilizing contextual information and Hard Sample Prototype Learning (HSPL) that mines CMB mimics and generates additional loss term called concentration loss using Convolutional Prototype Learning (CPL). The anatomical localization task does not only tell to which region the CMBs belong but also eliminate some FPs from the detection task by utilizing anatomical information. The results show that the proposed RPN that utilizes the FFM and HSPL outperforms the vanilla RPN and achieves a sensitivity of 94.66% vs. 93.33% and an average number of false positives per subject (FPavg) of 0.86 vs. 14.73. Also, the anatomical localization task further improves the detection performance by reducing the FPavg to 0.56 while maintaining the sensitivity of 94.66%. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 16 pages, 10 figures,3 tables

arXiv:2306.06461 [pdf]

Semi-supervsied Learning-based Sound Event Detection using Freuqency Dynamic Convolution with Large Kernel Attention for DCASE Challenge 2023 Task 4

Authors: Ji Won Kim, Sang Won Son, Yoonah Song, Hong Kook Kim, Il Hoon Song, Jeong Eun Lim

Abstract: This report proposes a frequency dynamic convolution (FDY) with a large kernel attention (LKA)-convolutional recurrent neural network (CRNN) with a pre-trained bidirectional encoder representation from audio transformers (BEATs) embedding-based sound event detection (SED) model that employs a mean-teacher and pseudo-label approach to address the challenge of limited labeled data for DCASE 2023 Tas… ▽ More This report proposes a frequency dynamic convolution (FDY) with a large kernel attention (LKA)-convolutional recurrent neural network (CRNN) with a pre-trained bidirectional encoder representation from audio transformers (BEATs) embedding-based sound event detection (SED) model that employs a mean-teacher and pseudo-label approach to address the challenge of limited labeled data for DCASE 2023 Task 4. The proposed FDY with LKA integrates the FDY and LKA module to effectively capture time-frequency patterns, long-term dependencies, and high-level semantic information in audio signals. The proposed FDY with LKA-CRNN with a BEATs embedding network is initially trained on the entire DCASE 2023 Task 4 dataset using the mean-teacher approach, generating pseudo-labels for weakly labeled, unlabeled, and the AudioSet. Subsequently, the proposed SED model is retrained using the same pseudo-label approach. A subset of these models is selected for submission, demonstrating superior F1-scores and polyphonic SED score performance on the DCASE 2023 Challenge Task 4 validation dataset. △ Less

Submitted 10 June, 2023; originally announced June 2023.

Comments: DCASE 2023 Challenge Task 4A, 5 pages

arXiv:2306.00633 [pdf, other]

doi 10.1109/ACCESS.2023.3282786

Low-Cost GNSS Simulators with Wireless Clock Synchronization for Indoor Positioning

Authors: Woohyun Kim, Jiwon Seo

Abstract: In regions where global navigation satellite systems (GNSS) signals are unavailable, such as underground areas and tunnels, GNSS simulators can be deployed for transmitting simulated GNSS signals. Then, a GNSS receiver in the simulator coverage outputs the position based on the received GNSS signals (e.g., Global Positioning System (GPS) L1 signals in this study) transmitted by the corresponding s… ▽ More In regions where global navigation satellite systems (GNSS) signals are unavailable, such as underground areas and tunnels, GNSS simulators can be deployed for transmitting simulated GNSS signals. Then, a GNSS receiver in the simulator coverage outputs the position based on the received GNSS signals (e.g., Global Positioning System (GPS) L1 signals in this study) transmitted by the corresponding simulator. This approach provides periodic position updates to GNSS users while deploying a small number of simulators without modifying the hardware and software of user receivers. However, the simulator clock should be synchronized to the GNSS satellite clock to generate almost identical signals to the live-sky GNSS signals, which is necessary for seamless indoor and outdoor positioning handover. The conventional clock synchronization method based on the wired connection between each simulator and an outdoor GNSS antenna causes practical difficulty and increases the cost of deploying the simulators. This study proposes a wireless clock synchronization method based on a private time server and time delay calibration. Additionally, we derived the constraints for determining the optimal simulator coverage and separation between adjacent simulators. The positioning performance of the proposed GPS simulator-based indoor positioning system was demonstrated in the underground testbed for a driving vehicle with a GPS receiver and a pedestrian with a smartphone. The average position errors were 3.7 m for the vehicle and 9.6 m for the pedestrian during the field tests with successful indoor and outdoor positioning handovers. Since those errors are within the coverage of each deployed simulator, it is confirmed that the proposed system with wireless clock synchronization can effectively provide periodic position updates to users where live-sky GNSS signals are unavailable. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Submitted to IEEE Access

arXiv:2304.06237 [pdf, other]

Deep learning based ECG segmentation for delineation of diverse arrhythmias

Authors: Chankyu Joung, Mi** Kim, Tae** Paik, Seong-Ho Kong, Seung-Young Oh, Won Kyeong Jeon, Jae-hu Jeon, Joong-Sik Hong, Wan-Joong Kim, Woong Kook, Myung-** Cha, Otto van Koert

Abstract: Accurate delineation of key waveforms in an ECG is a critical initial step in extracting relevant features to support the diagnosis and treatment of heart conditions. Although deep learning based methods using a segmentation model to locate the P, QRS, and T waves have shown promising results, their ability to handle signals exhibiting arrhythmia remains unclear. This study builds on existing rese… ▽ More Accurate delineation of key waveforms in an ECG is a critical initial step in extracting relevant features to support the diagnosis and treatment of heart conditions. Although deep learning based methods using a segmentation model to locate the P, QRS, and T waves have shown promising results, their ability to handle signals exhibiting arrhythmia remains unclear. This study builds on existing research by introducing a U-Net-like segmentation model for ECG delineation, with a particular focus on diverse arrhythmias. For this purpose, we curate an internal dataset containing waveform boundary annotations for various arrhythmia types to train and validate our model. Our key contributions include identifying segmentation model failures in different arrhythmia types, develo** a robust model using a diverse training set, achieving comparable performance on benchmark datasets, and introducing a classification guided strategy to reduce false P wave predictions for specific arrhythmias. This study advances deep learning based ECG delineation in the context of arrhythmias and highlights its challenges. △ Less

Submitted 6 September, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

arXiv:2303.08670 [pdf, other]

Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

Authors: Minsu Kim, Chae Won Kim, Yong Man Ro

Abstract: Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking… ▽ More Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking face video). Since the drawbacks of audio-based forced alignment can be complemented using the visual information when the audio signal is under poor condition, we try to develop a novel video-based forced alignment method. However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in develo** visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). The proposed DVFA can align the input transcription (i.e., sentence) with the talking face video without accessing the speech audio. Moreover, by augmenting the alignment task with anomaly case detection, DVFA can detect mismatches between the input transcription and the input video while performing the alignment. Therefore, we can robustly align the text with the talking face video even if there exist error words in the text. Through extensive experiments, we show the effectiveness of the proposed DVFA not only in the alignment task but also in interpreting the outputs of VSR models. △ Less

Submitted 26 February, 2023; originally announced March 2023.

Comments: Accepted in AAAI2023

arXiv:2302.12172 [pdf, other]

Vision-Language Generative Model for View-Specific Chest X-ray Generation

Authors: Hyungyung Lee, Da Young Lee, Wonjae Kim, **-Hwa Kim, Tackeun Kim, Jihang Kim, Leonard Sunwoo, Edward Choi

Abstract: Synthetic medical data generation has opened up new possibilities in the healthcare domain, offering a powerful tool for simulating clinical scenarios, enhancing diagnostic and treatment quality, gaining granular medical knowledge, and accelerating the development of unbiased algorithms. In this context, we present a novel approach called ViewXGen, designed to overcome the limitations of existing… ▽ More Synthetic medical data generation has opened up new possibilities in the healthcare domain, offering a powerful tool for simulating clinical scenarios, enhancing diagnostic and treatment quality, gaining granular medical knowledge, and accelerating the development of unbiased algorithms. In this context, we present a novel approach called ViewXGen, designed to overcome the limitations of existing methods that rely on general domain pipelines using only radiology reports to generate frontal-view chest X-rays. Our approach takes into consideration the diverse view positions found in the dataset, enabling the generation of chest X-rays with specific views, which marks a significant advancement in the field. To achieve this, we introduce a set of specially designed tokens for each view position, tailoring the generation process to the user's preferences. Furthermore, we leverage multi-view chest X-rays as input, incorporating valuable information from different views within the same study. This integration rectifies potential errors and contributes to faithfully capturing abnormal findings in chest X-ray generation. To validate the effectiveness of our approach, we conducted statistical analyses, evaluating its performance in a clinical efficacy metric on the MIMIC-CXR dataset. Also, human evaluation demonstrates the remarkable capabilities of ViewXGen, particularly in producing realistic view-specific X-rays that closely resemble the original images. △ Less

Submitted 29 April, 2024; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: Accepted at CHIL 2024

arXiv:2212.04356 [pdf, other]

Robust Speech Recognition via Large-Scale Weak Supervision

Authors: Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

Abstract: We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuni… ▽ More We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2210.05142 [pdf, ps, other]

doi 10.1016/j.automatica.2023.111371

A Design Method of Distributed Algorithms via Discrete-time Blended Dynamics Theorem

Authors: Jeong Woo Kim, ** Gyu Lee, Donggil Lee, Hyungbo Shim

Abstract: We develop a discrete-time version of the blended dynamics theorem for the use of designing distributed computation algorithms. The blended dynamics theorem enables to predict the behavior of heterogeneous multi-agent systems. Therefore, once we get a blended dynamics for a particular computational task, design idea of node dynamics for individual heterogeneous agents can easily occur. In the cont… ▽ More We develop a discrete-time version of the blended dynamics theorem for the use of designing distributed computation algorithms. The blended dynamics theorem enables to predict the behavior of heterogeneous multi-agent systems. Therefore, once we get a blended dynamics for a particular computational task, design idea of node dynamics for individual heterogeneous agents can easily occur. In the continuous-time case, prediction by blended dynamics was enabled by high coupling gain among neighboring agents. In the discrete-time case, we propose an equivalent action, which we call multi-step coupling in this paper. Compared to the continuous-time case, the blended dynamics can have more variety depending on the coupling matrix. This benefit is demonstrated with three applications; distributed estimation of network size, distributed computation of the PageRank, and distributed computation of the degree sequence of a graph, which correspond to the coupling by doubly-stochastic, column-stochastic, and row-stochastic matrices, respectively. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Journal ref: Automatica, vol. 159, pp. 111371, Jan 2024

arXiv:2209.11123 [pdf, other]

doi 10.1016/j.ifacol.2020.12.126

Modern Machine Learning Tools for Monitoring and Control of Industrial Processes: A Survey

Authors: R. Bhushan Gopaluni, Aditya Tulsyan, Benoit Chachuat, Biao Huang, Jong Min Lee, Faraz Amjad, Seshu Kumar Damarla, Jong Woo Kim, Nathan P. Lawrence

Abstract: Over the last ten years, we have seen a significant increase in industrial data, tremendous improvement in computational power, and major theoretical advances in machine learning. This opens up an opportunity to use modern machine learning tools on large-scale nonlinear monitoring and control problems. This article provides a survey of recent results with applications in the process industry. Over the last ten years, we have seen a significant increase in industrial data, tremendous improvement in computational power, and major theoretical advances in machine learning. This opens up an opportunity to use modern machine learning tools on large-scale nonlinear monitoring and control problems. This article provides a survey of recent results with applications in the process industry. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: IFAC World Congress 2020

arXiv:2209.10357 [pdf, other]

GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker Recognition Challenge

Authors: Dongkeon Park, Yechan Yu, Kyeong Wan Park, Ji Won Kim, Hong Kook Kim

Abstract: This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization sys… ▽ More This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization systems according to different model combinations with the best experimental efforts. Our final submission is an ensemble system of all the four systems and achieves a diarization error rate of 5.12% on the challenge evaluation set, ranked third at the diarization track of the challenge. △ Less

Submitted 6 October, 2022; v1 submitted 21 September, 2022; originally announced September 2022.

Comments: 2022 VoxSRC Track4

arXiv:2209.01724 [pdf, ps, other]

Towards Deep Learning-aided Wireless Channel Estimation and Channel State Information Feedback for 6G

Authors: Wonjun Kim, Yongjun Ahn, **hong Kim, Byonghyo Shim

Abstract: Deep learning (DL), a branch of artificial intelligence (AI) techniques, has shown great promise in various disciplines such as image classification and segmentation, speech recognition, language translation, among others. This remarkable success of DL has stimulated increasing interest in applying this paradigm to wireless channel estimation in recent years. Since DL principles are inductive in n… ▽ More Deep learning (DL), a branch of artificial intelligence (AI) techniques, has shown great promise in various disciplines such as image classification and segmentation, speech recognition, language translation, among others. This remarkable success of DL has stimulated increasing interest in applying this paradigm to wireless channel estimation in recent years. Since DL principles are inductive in nature and distinct from the conventional rule-based algorithms, when one tries to use DL technique to the channel estimation, one might easily get stuck and confused by so many knobs to control and small details to be aware of. The primary purpose of this paper is to discuss key issues and possible solutions in DL-based wireless channel estimation and channel state information (CSI) feedback including the DL model selection, training data acquisition, and neural network design for 6G. Specifically, we present several case studies together with the numerical experiments to demonstrate the effectiveness of the DL-based wireless channel estimation framework. △ Less

Submitted 4 September, 2022; originally announced September 2022.

arXiv:2208.12544 [pdf]

doi 10.1016/j.combustflame.2022.112583

Deep learning-based denoising for fast time-resolved flame emission spectroscopy in high-pressure combustion environment

Authors: Taekeun Yoon, Seon Woong Kim, Hosung Byun, Younsik Kim, Campbell D. Carter, Hyungrok Do

Abstract: A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicati… ▽ More A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicating the gas properties become relatively weaker. Then, the property estimation based on the short-gated spectrum is difficult and inaccurate. Denoising convolutional neural networks (CNN) can enhance the SNR of the short-gated spectrum. A new CNN architecture including a reversible down- and up-sampling (DU) operator and a loss function based on proper orthogonal decomposition (POD) coefficients is proposed. For training and testing the CNN, flame chemiluminescence spectra were captured from a stable methane-air flat flame using a portable spectrometer (spectral range: 250 - 850 nm, resolution: 0.5 nm) with varied equivalence ratio (0.8 - 1.2), pressure (1 - 10 bar), and exposure time (0.05, 0.2, 0.4, and 2 s). The long exposure (2 s) spectra were used as the ground truth when training the denoising CNN. A kriging model with POD is trained by the long-gated spectra for calibration, and then the prediction of the gas properties taking the denoised short-gated spectrum as the input: The property prediction errors of pressure and equivalence ratio were remarkably lowered in spite of the low SNR attendant with reduced exposure. △ Less

Submitted 26 December, 2022; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: 25 pages, 12 figures, accepted to Combustion and Flame

Report number: Combustion and Flame 248 (2023) 112583

arXiv:2207.06330 [pdf, other]

Left Ventricle Contouring of Apical Three-Chamber Views on 2D Echocardiography

Authors: Alberto Gomez, Mihaela Porumb, Angela Mumith, Thierry Judge, Shan Gao, Woo-** Cho Kim, Jorge Oliveira, Agis Chartsias

Abstract: We propose a new method to automatically contour the left ventricle on 2D echocardiographic images. Unlike most existing segmentation methods, which are based on predicting segmentation masks, we focus at predicting the endocardial contour and the key landmark points within this contour (basal points and apex). This provides a representation that is closer to how experts perform manual annotations… ▽ More We propose a new method to automatically contour the left ventricle on 2D echocardiographic images. Unlike most existing segmentation methods, which are based on predicting segmentation masks, we focus at predicting the endocardial contour and the key landmark points within this contour (basal points and apex). This provides a representation that is closer to how experts perform manual annotations and hence produce results that are physiologically more plausible. Our proposed method uses a two-headed network based on the U-Net architecture. One head predicts the 7 contour points, and the other head predicts a distance map to the contour. This approach was compared to the U-Net and to a point based approach, achieving performance gains of up to 30\% in terms of landmark localisation (<4.5mm) and distance to the ground truth contour (<3.5mm). △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: Submitted to MICCAI-ASMUS 2022

arXiv:2206.06541 [pdf, other]

Pixel-by-pixel Mean Opinion Score (pMOS) for No-Reference Image Quality Assessment

Authors: Wook-Hyung Kim, Cheul-hee Hahm, Anant Baijal, Namuk Kim, Ilhyun Cho, Jayoon Koo

Abstract: Deep-learning based techniques have contributed to the remarkable progress in the field of automatic image quality assessment (IQA). Existing IQA methods are designed to measure the quality of an image in terms of Mean Opinion Score (MOS) at the image-level (i.e. the whole image) or at the patch-level (dividing the image into multiple units and measuring quality of each patch). Some applications m… ▽ More Deep-learning based techniques have contributed to the remarkable progress in the field of automatic image quality assessment (IQA). Existing IQA methods are designed to measure the quality of an image in terms of Mean Opinion Score (MOS) at the image-level (i.e. the whole image) or at the patch-level (dividing the image into multiple units and measuring quality of each patch). Some applications may require assessing the quality at the pixel-level (i.e. MOS value for each pixel), however, this is not possible in case of existing techniques as the spatial information is lost owing to their network structures. This paper proposes an IQA algorithm that can measure the MOS at the pixel-level, in addition to the image-level MOS. The proposed algorithm consists of three core parts, namely: i) Local IQA; ii) Region of Interest (ROI) prediction; iii) High-level feature embedding. The Local IQA part outputs the MOS at the pixel-level, or pixel-by-pixel MOS - we term it 'pMOS'. The ROI prediction part outputs weights that characterize the relative importance of region when calculating the image-level IQA. The high-level feature embedding part extracts high-level image features which are then embedded into the Local IQA part. In other words, the proposed algorithm yields three outputs: the pMOS which represents MOS for each pixel, the weights from the ROI indicating the relative importance of region, and finally the image-level MOS that is obtained by the weighted sum of pMOS and ROI values. The image-level MOS thus obtained by utilizing pMOS and ROI weights shows superior performance compared to the existing popular IQA techniques. In addition, visualization results indicate that predicted pMOS and ROI outputs are reasonably aligned with the general principles of the human visual system (HVS). △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2206.02222 [pdf, other]

How does a Rational Agent Act in an Epidemic?

Authors: S. Yagiz Olmez, Shubham Aggarwal, ** Won Kim, Erik Miehling, Tamer Başar, Matthew West, Prashant G. Mehta

Abstract: Evolution of disease in a large population is a function of the top-down policy measures from a centralized planner, as well as the self-interested decisions (to be socially active) of individual agents in a large heterogeneous population. This paper is concerned with understanding the latter based on a mean-field type optimal control model. Specifically, the model is used to investigate the role… ▽ More Evolution of disease in a large population is a function of the top-down policy measures from a centralized planner, as well as the self-interested decisions (to be socially active) of individual agents in a large heterogeneous population. This paper is concerned with understanding the latter based on a mean-field type optimal control model. Specifically, the model is used to investigate the role of partial information on an agent's decision-making, and study the impact of such decisions by a large number of agents on the spread of the virus in the population. The motivation comes from the presymptomatic and asymptomatic spread of the COVID-19 virus where an agent unwittingly spreads the virus. We show that even in a setting with fully rational agents, limited information on the viral state can result in an epidemic growth. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: arXiv admin note: text overlap with arXiv:2111.10422

arXiv:2204.10479 [pdf, ps, other]

Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

Authors: Donghwan Lee, Do Wan Kim

Abstract: TD-learning is a fundamental algorithm in the field of reinforcement learning (RL), that is employed to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While significant progress has been made in the theoretical analysis of TD-learning, recent research has uncovered guarantees concerning its statistical efficiency by develo** finite-time erro… ▽ More TD-learning is a fundamental algorithm in the field of reinforcement learning (RL), that is employed to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While significant progress has been made in the theoretical analysis of TD-learning, recent research has uncovered guarantees concerning its statistical efficiency by develo** finite-time error bounds. This paper aims to contribute to the existing body of knowledge by presenting a novel finite-time analysis of tabular temporal difference (TD) learning, which makes direct and effective use of discrete-time stochastic linear system models and leverages Schur matrix properties. The proposed analysis can cover both on-policy and off-policy settings in a unified manner. By adopting this approach, we hope to offer new and straightforward templates that not only shed further light on the analysis of TD-learning and related RL algorithms but also provide valuable insights for future research in this domain. △ Less

Submitted 2 June, 2023; v1 submitted 21 April, 2022; originally announced April 2022.

Comments: arXiv admin note: text overlap with arXiv:2112.14417

arXiv:2203.12053 [pdf, other]

Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content

Authors: Haici Yang, Sanna Wager, Spencer Russell, Mike Luo, Minje Kim, Wontak Kim

Abstract: In the stereo-to-multichannel upmixing problem for music, one of the main tasks is to set the directionality of the instrument sources in the multichannel rendering results. In this paper, we propose a modified variational autoencoder model that learns a latent space to describe the spatial images in multichannel music. We seek to disentangle the spatial images and music content, so the learned la… ▽ More In the stereo-to-multichannel upmixing problem for music, one of the main tasks is to set the directionality of the instrument sources in the multichannel rendering results. In this paper, we propose a modified variational autoencoder model that learns a latent space to describe the spatial images in multichannel music. We seek to disentangle the spatial images and music content, so the learned latent variables are invariant to the music. At test time, we use the latent variables to control the panning of sources. We propose two upmixing use cases: transferring the spatial images from one song to another and blind panning based on the generative model. We report objective and subjective evaluation results to empirically show that our model captures spatial images separately from music content and achieves transfer-based interactive panning. △ Less

Submitted 22 March, 2022; originally announced March 2022.

arXiv:2203.07211 [pdf, other]

Model predictive control and moving horizon estimation for adaptive optimal bolus feeding in high-throughput cultivation of \textit{E. coli}

Authors: Jong Woo Kim, Niels Krausch, Judit Aizpuru, Tilman Barz, Sergio Lucia, Peter Neubauer, Mariano Nicolas Cruz Bournazou

Abstract: We discuss the application of a nonlinear model predictive control (MPC) and a moving horizon estimation (MHE) to achieve an optimal operation of \textit{E. coli} fed-batch cultivations with intermittent bolus feeding. 24 parallel experiments were considered in a high-throughput microbioreactor platform at a 10 mL scale. The robotic island in question can run up to 48 fed-batch processes in parall… ▽ More We discuss the application of a nonlinear model predictive control (MPC) and a moving horizon estimation (MHE) to achieve an optimal operation of \textit{E. coli} fed-batch cultivations with intermittent bolus feeding. 24 parallel experiments were considered in a high-throughput microbioreactor platform at a 10 mL scale. The robotic island in question can run up to 48 fed-batch processes in parallel with automated liquid handling and online and at-line analytics. The implementation of the model-based monitoring and control framework reveals that there are mainly three challenges that need to be addressed; First, the inputs are given in an instantaneous pulsed form by bolus injections, second, online and at-line measurement frequencies are severely imbalanced, and third, optimization for the distinctive multiple reactors can be either parallelized or integrated. We address these challenges by incorporating the concept of impulsive control systems, formulating multi-rate MHE with identifiability analysis, and suggesting criteria for deciding the reactor configuration. In this study, we present the key elements and background theory of the implementation with \textit{in silico} simulations for bacterial fed-batch cultivation. △ Less

Submitted 6 February, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

arXiv:2112.14417

Control Theoretic Analysis of Temporal Difference Learning

Authors: Donghwan Lee, Do Wan Kim

Abstract: The goal of this manuscript is to conduct a controltheoretic analysis of Temporal Difference (TD) learning algorithms. TD-learning serves as a cornerstone in the realm of reinforcement learning, offering a methodology for approximating the value function associated with a given policy in a Markov Decision Process. Despite several existing works that have contributed to the theoretical understandin… ▽ More The goal of this manuscript is to conduct a controltheoretic analysis of Temporal Difference (TD) learning algorithms. TD-learning serves as a cornerstone in the realm of reinforcement learning, offering a methodology for approximating the value function associated with a given policy in a Markov Decision Process. Despite several existing works that have contributed to the theoretical understanding of TD-learning, it is only in recent years that researchers have been able to establish concrete guarantees on its statistical efficiency. In this paper, we introduce a finite-time, control-theoretic framework for analyzing TD-learning, leveraging established concepts from the field of linear systems control. Consequently, this paper provides additional insights into the mechanics of TD learning and the broader landscape of reinforcement learning, all while employing straightforward analytical tools derived from control theory. △ Less

Submitted 8 September, 2023; v1 submitted 29 December, 2021; originally announced December 2021.

Comments: The contents of this paper have some overlaps with some other arxiv paper we have submitted. Therefore, this paper is redundant in my opinion

arXiv:2112.13283 [pdf, other]

Fitting nonlinear models to continuous oxygen data with oscillatory signal variations via a loss based on DynamicTime War**

Authors: Judit Aizpuru, Annina Karolin Kemmer, Jong Woo Kim, Stefan Born, Peter Neubauer, Mariano N. Cruz Bournazou, Tilman Barz

Abstract: High throughput experimental systems play an important role in bioprocess development, as they provide an efficient way of analysing different experimental conditions and perform strain discrimination in previous phases to the industrial scale production. In the millilitre scale, these systems are combinations of parallel mini-bioreactors, liquid handling robots and automated workflows for data ha… ▽ More High throughput experimental systems play an important role in bioprocess development, as they provide an efficient way of analysing different experimental conditions and perform strain discrimination in previous phases to the industrial scale production. In the millilitre scale, these systems are combinations of parallel mini-bioreactors, liquid handling robots and automated workflows for data handling and model based operation. For successfully monitoring cultivation conditions and improving the overall process quality by model-based approaches, a proper model identification is crucial. However, the quality and amount of measurements makes this task challenging considering the complexity of the bio-processes. TheDissolved Oxygen Tension is often the only measurement which is available online, and therefore, a good understanding of the errors in this signal is important for performing a robust estimation.Some of the expected errors will provoke uncertainties in the time-domain of the measurement, and in those cases, the common Weighted Least Squares estimation procedure can fail providing good results. Moreover, these errors will have even a larger effect in the fed-batch phase where bolus feeding is applied, as this generates fast dynamic responses in the signal. In the present work, an insilico study of the performance of Weighted Least Squares estimator is analysed when the expected time-uncertainties are present in the oxygen signal. As an alternative, a loss based on the Dynamic Time War** measure is proposed. The results show how this latter procedure outperforms the former reconstructing the oxygen signal, and in addition, returns less biased parameter estimates. △ Less

Submitted 25 December, 2021; originally announced December 2021.

arXiv:2110.14513 [pdf, other]

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Authors: Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Hwan Lee, Hoon Heo, Kyogu Lee

Abstract: We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on informa… ▽ More We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification. △ Less

Submitted 28 October, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: Neural Information Processing Systems (NeurIPS) 2021

arXiv:2109.09088 [pdf, ps, other]

Relaxed Conditions for Parameterized Linear Matrix Inequality in the Form of Double Sum

Authors: Do Wan Kim, Dong Hwan Lee

Abstract: The aim of this study is to investigate less conservative conditions for a parameterized linear matrix inequality (PLMI) expressed in the form of a double convex sum. This type of PLMI frequently appears in T-S fuzzy control system analysis and design problems. In this letter, we derive new, less conservative linear matrix inequalities (LMIs) for the PLMI by employing the proposed sum relaxation m… ▽ More The aim of this study is to investigate less conservative conditions for a parameterized linear matrix inequality (PLMI) expressed in the form of a double convex sum. This type of PLMI frequently appears in T-S fuzzy control system analysis and design problems. In this letter, we derive new, less conservative linear matrix inequalities (LMIs) for the PLMI by employing the proposed sum relaxation method based on Young's inequality. The derived LMIs are proven to be less conservative than the existing conditions related to this topic in the literature. The proposed technique is applicable to various stability analysis and control design problems for T-S fuzzy systems, which are formulated as solving the PLMIs in the form of a double convex sum. Furthermore, examples is provided to illustrate the reduced conservatism of the derived LMIs. △ Less

Submitted 13 July, 2023; v1 submitted 19 September, 2021; originally announced September 2021.

arXiv:2109.08990 [pdf, other]

doi 10.1109/TAES.2021.3114272

First Demonstration of the Korean eLoran Accuracy in a Narrow Waterway Using Improved ASF Maps

Authors: Woohyun Kim, Pyo-Woong Son, Sul Gee Park, Sang Hyun Park, Jiwon Seo

Abstract: The vulnerabilities of global navigation satellite systems (GNSSs) to radio frequency jamming and spoofing have attracted significant research attention. In particular, the large-scale jamming incidents that occurred in South Korea substantiate the practical importance of implementing a complementary navigation system. This letter briefly summarizes the efforts of South Korea to deploy an enhanced… ▽ More The vulnerabilities of global navigation satellite systems (GNSSs) to radio frequency jamming and spoofing have attracted significant research attention. In particular, the large-scale jamming incidents that occurred in South Korea substantiate the practical importance of implementing a complementary navigation system. This letter briefly summarizes the efforts of South Korea to deploy an enhanced long-range navigation (eLoran) system, which is a terrestrial low-frequency radio navigation system that can complement GNSSs. After four years of research and development, the Korean eLoran testbed system has been recently deployed and is operational since June 1, 2021. Although its initial performance at sea is satisfactory, navigation through a narrow waterway is still challenging because a complete survey of the additional secondary factor (ASF), which is the largest source of error for eLoran, is practically difficult in a narrow waterway. This letter proposes an alternative way to survey the ASF in a narrow waterway and improve the ASF map generation methods. Moreover, the performance of the proposed approach was validated experimentally. △ Less

Submitted 28 September, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

Comments: Submitted to IEEE Transactions on Aerospace and Electronic Systems

arXiv:2107.05009 [pdf, other]

PocketVAE: A Two-step Model for Groove Generation and Control

Authors: Kyungyun Lee, Wonil Kim, Juhan Nam

Abstract: Creating a good drum track to imitate a skilled performer in digital audio workstations (DAWs) can be a time-consuming process, especially for those unfamiliar with drums. In this work, we introduce PocketVAE, a groove generation system that applies grooves to users' rudimentary MIDI tracks, i.e, templates. Grooves can be either transferred from a reference track, generated randomly or with condit… ▽ More Creating a good drum track to imitate a skilled performer in digital audio workstations (DAWs) can be a time-consuming process, especially for those unfamiliar with drums. In this work, we introduce PocketVAE, a groove generation system that applies grooves to users' rudimentary MIDI tracks, i.e, templates. Grooves can be either transferred from a reference track, generated randomly or with conditions, such as genres. Our system, consisting of different modules for each groove component, takes a two-step approach that is analogous to a music creation process. First, the note module updates the user template through addition and deletion of notes; Second, the velocity and microtiming modules add details to this generated note score. In order to model the drum notes, we apply a discrete latent representation method via Vector Quantized Variational Autoencoder (VQ-VAE), as drum notes have a discrete property, unlike velocity and microtiming values. We show that our two-step approach and the usage of a discrete encoding space improves the learning of the original data distribution. Additionally, we discuss the benefit of incorporating control elements - genre, velocity and microtiming patterns - into the model. △ Less

Submitted 11 July, 2021; originally announced July 2021.

arXiv:2106.02391 [pdf, ps, other]

Data-Driven Control Design with LMIs and Dynamic Programming

Authors: Donghwan Lee, Do Wan Kim

Abstract: The goal of this paper is to develop data-driven control design and evaluation strategies based on linear matrix inequalities (LMIs) and dynamic programming. We consider deterministic discrete-time LTI systems, where the system model is unknown. We propose efficient data collection schemes from the state-input trajectories together with data-driven LMIs to design state-feedback controllers for sta… ▽ More The goal of this paper is to develop data-driven control design and evaluation strategies based on linear matrix inequalities (LMIs) and dynamic programming. We consider deterministic discrete-time LTI systems, where the system model is unknown. We propose efficient data collection schemes from the state-input trajectories together with data-driven LMIs to design state-feedback controllers for stabilization and linear quadratic regulation (LQR) problem. In addition, we investigate theoretically guaranteed exploration schemes to acquire valid data from the trajectories under different scenarios. In particular, we prove that as more and more data is accumulated, the collected data becomes valid for the proposed algorithms with higher probability. Finally, data-driven dynamic programming algorithms with convergence guarantees are then discussed. △ Less

Submitted 16 June, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

arXiv:2105.14760 [pdf, ps, other]

Multi-Objective LQG Design with Primal-Dual Method

Authors: Donghwan Lee, Do Wan Kim

Abstract: The goal of this paper is to study a multi-objective linear quadratic Gaussian (LQG) control problem. In particular, we consider an optimal control problem minimizing a quadratic cost over a finite time horizon for linear stochastic systems subject to control energy constraints. To solve the problem, we suggest an efficient bisection line search algorithm which is computationally efficient compare… ▽ More The goal of this paper is to study a multi-objective linear quadratic Gaussian (LQG) control problem. In particular, we consider an optimal control problem minimizing a quadratic cost over a finite time horizon for linear stochastic systems subject to control energy constraints. To solve the problem, we suggest an efficient bisection line search algorithm which is computationally efficient compared to other approaches such as the semidefinite programming. The main idea is to use the Lagrangian function and Karush-Kuhn-Tucker (KKT) optimality conditions to solve the constrained optimization problem. The Lagrange multiplier is searched using the bisection line search. Numerical examples are given to demonstrate the effectiveness of the proposed methods. △ Less

Submitted 31 May, 2021; originally announced May 2021.

arXiv:2012.02753 [pdf, other]

Model-plant mismatch learning offset-free model predictive control

Authors: Sang Hwan Son, Jong Woo Kim, Tae Hoon Oh, Jong Min Lee

Abstract: We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated v… ▽ More We propose model-plant mismatch learning offset-free model predictive control (MPC), which learns and applies the intrinsic model-plant mismatch, to effectively exploit the advantages of model-based and data-driven control strategies and overcome the limitations of each approach. In this study, the model-plant mismatch map on steady-state manifold in the controlled variable space is approximated via a general regression neural network from the steady-state data for each setpoint. Though the learned model-plant mismatch map can provide the information at the equilibrium point (i.e., setpoint), it cannot provide model-plant mismatch information during the transient state. Moreover, the intrinsic model-plant mismatch can vary due to system characteristics changes during operation. Therefore, we additionally apply a supplementary disturbance variable which is updated from the disturbance estimator based on the nominal offset-free MPC scheme. Then, the combined disturbance signal is applied to the target problem and finite-horizon optimal control problem of offset-free MPC to improve the prediction accuracy and closed-loop performance of the controller. By this, we can exploit both the learned model-plant mismatch information and the stabilizing property of the nominal disturbance estimator approach. The closed-loop simulation results demonstrate that the developed scheme can properly learn the intrinsic model-plant mismatch and efficiently improve the model-plant mismatch compensating performance in offset-free MPC. Moreover, we examine the robust asymptotic stability of the developed offset-free MPC scheme, which is known to be difficult to analyze in nominal offset-free MPC, by exploiting the learned model-plant mismatch information. △ Less

Submitted 13 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2009.11812 [pdf]

doi 10.23919/ICCAS50221.2020.9268214

Effect of Outlier Removal from Temporal ASF Corrections on Multichain Loran Positioning Accuracy

Authors: Jongmin Park, Pyo-Woong Son, Woohyun Kim, Joon Hyo Rhee, Jiwon Seo

Abstract: The widely used global navigation satellite systems (GNSSs) are vulnerable to radio frequency interference (RFI). Long-range navigation (Loran), a terrestrial navigation system, can compensate for this weakness; however, it suffers from low positioning accuracy, and studies are under way to improve its positioning performance. One such study has proposed the multichain Loran positioning method tha… ▽ More The widely used global navigation satellite systems (GNSSs) are vulnerable to radio frequency interference (RFI). Long-range navigation (Loran), a terrestrial navigation system, can compensate for this weakness; however, it suffers from low positioning accuracy, and studies are under way to improve its positioning performance. One such study has proposed the multichain Loran positioning method that uses the signals of transmitting stations belonging to different chains. Although the multichain Loran positioning performance is superior to the performance of conventional methods, the additional secondary factor (ASF) can still degrade its positioning accuracy. To mitigate the effects of temporal ASF, which is one of the ASF components, it is necessary to obtain temporal correction data from a nearby reference station at a known location. In this study, an experiment is performed to verify the effect of removing the outliers in the temporal correction data on the multichain Loran positioning accuracy. △ Less

Submitted 24 September, 2020; originally announced September 2020.

Comments: Submitted to ICCAS 2020

Journal ref: 2020 20th International Conference on Control, Automation and Systems (ICCAS)

arXiv:2009.11807 [pdf]

doi 10.23919/ICCAS50221.2020.9268364

Effects of Initial Attitude Estimation Errors on Loosely Coupled Smartphone GPS/IMU Integration System

Authors: Kwansik Park, Woohyun Kim, Jiwon Seo

Abstract: Global Positioning System (GPS) and inertial measurement unit (IMU) sensors are commonly integrated using the extended Kalman filter (EKF), for achieving better navigation performance. However, because of nonlinearity, the performance of the EKF is affected by the initial state estimation errors, and the navigation solutions, including the attitude, diverge rapidly as the initial errors increase.… ▽ More Global Positioning System (GPS) and inertial measurement unit (IMU) sensors are commonly integrated using the extended Kalman filter (EKF), for achieving better navigation performance. However, because of nonlinearity, the performance of the EKF is affected by the initial state estimation errors, and the navigation solutions, including the attitude, diverge rapidly as the initial errors increase. This paper analyzes the data obtained from an outdoor experiment, and investigates the effect of the initial errors on the attitude estimation performance using EKF, which is used in loosely coupled low-cost smartphone GPS/IMU sensors. △ Less

Submitted 24 September, 2020; originally announced September 2020.

Comments: Submitted to ICCAS 2020

Journal ref: 2020 20th International Conference on Control, Automation and Systems (ICCAS)

arXiv:2009.11803 [pdf]

doi 10.23919/ICCAS50221.2020.9268359

Development of Record and Management Software for GPS/Loran Measurements

Authors: Woohyun Kim, Pyo-Woong Son, Joon Hyo Rhee, Jiwon Seo

Abstract: In this paper, a software implementation that records Global Positioning System (GPS) and long-range navigation (Loran) measurement data output from an integrated GPS/Loran receiver and organizes them based on time is proposed. The purpose of the developed software is to collect measurements from multiple Loran transmitter chains for performance analysis of navigation methods using Loran, and to o… ▽ More In this paper, a software implementation that records Global Positioning System (GPS) and long-range navigation (Loran) measurement data output from an integrated GPS/Loran receiver and organizes them based on time is proposed. The purpose of the developed software is to collect measurements from multiple Loran transmitter chains for performance analysis of navigation methods using Loran, and to organize the data based on time to make it easy to use them. In addition, GPS measurements are also collected and managed as ground truth data for performance analysis. The implemented software consists of three modules: recording, classification, and conversion. The recording module records raw text data streamed from the receiver, and the classification module classifies the recorded text data according to the message format. The conversion module parses the classified text data, sorts GPS and Loran measurements based on timestamp, and outputs them according to the software platform of the user to analyze the measurements. Each module of the software runs automatically without user intervention. The functionality of the implemented software was verified using GPS and Loran measurements collected over 24 h from an actual integrated GPS/Loran receiver. △ Less

Submitted 24 September, 2020; originally announced September 2020.

Comments: Submitted to ICCAS 2020

Journal ref: 2020 20th International Conference on Control, Automation and Systems (ICCAS)

arXiv:2009.11587 [pdf, other]

Transfer Learning by Cascaded Network to identify and classify lung nodules for cancer detection

Authors: Shah B. Shrey, Lukman Hakim, Muthusubash Kavitha, Hae Won Kim, Takio Kurita

Abstract: Lung cancer is one of the most deadly diseases in the world. Detecting such tumors at an early stage can be a tedious task. Existing deep learning architecture for lung nodule identification used complex architecture with large number of parameters. This study developed a cascaded architecture which can accurately segment and classify the benign or malignant lung nodules on computed tomography (CT… ▽ More Lung cancer is one of the most deadly diseases in the world. Detecting such tumors at an early stage can be a tedious task. Existing deep learning architecture for lung nodule identification used complex architecture with large number of parameters. This study developed a cascaded architecture which can accurately segment and classify the benign or malignant lung nodules on computed tomography (CT) images. The main contribution of this study is to introduce a segmentation network where the first stage trained on a public data set can help to recognize the images which included a nodule from any data set by means of transfer learning. And the segmentation of a nodule improves the second stage to classify the nodules into benign and malignant. The proposed architecture outperformed the conventional methods with an area under curve value of 95.67\%. The experimental results showed that the classification accuracy of 97.96\% of our proposed architecture outperformed other simple and complex architectures in classifying lung nodules for lung cancer detection. △ Less

Submitted 24 September, 2020; originally announced September 2020.

arXiv:2008.05060 [pdf, other]

doi 10.1109/CVPR.2017.533

Online Graph Completion: Multivariate Signal Recovery in Computer Vision

Authors: Won Hwa Kim, Mona Jalal, Seongjae Hwang, Sterling C. Johnson, Vikas Singh

Abstract: The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical… ▽ More The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical issues such as partially observed measurements, financial constraints and even additional distributional or structural aspects of the data typically fall outside the scope of this treatment. For instance, with sequential acquisition of partial measurements of data that manifest as a matrix (or tensor), novel strategies for completion (or collaborative filtering) of the remaining entries have only been studied recently. Motivated by vision problems where we seek to annotate a large dataset of images via a crowdsourced platform or alternatively, complement results from a state-of-the-art object detector using human feedback, we study the "completion" problem defined on graphs, where requests for additional measurements must be made sequentially. We design the optimization model in the Fourier domain of the graph describing how ideas based on adaptive submodularity provide algorithms that work well in practice. On a large set of images collected from Imgur, we see promising results on images that are otherwise difficult to categorize. We also show applications to an experimental design problem in neuroimaging. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: 9 pages, 7 figures, CVPR 2017 Conference

arXiv:2005.04117 [pdf, other]

NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and Results

Authors: Abdelrahman Abdelhamed, Mahmoud Afifi, Radu Timofte, Michael S. Brown, Yue Cao, Zhilu Zhang, Wangmeng Zuo, Xiaoling Zhang, Jiye Liu, Wendong Chen, Changyuan Wen, Meng Liu, Shuailin Lv, Yunchao Zhang, Zhihong Pan, Baopu Li, Teng Xi, Yanwen Fan, Xiyu Yu, Gang Zhang, **gtuo Liu, Junyu Han, Errui Ding, Songhyun Yu, Bumjun Park , et al. (65 additional authors not shown)

Abstract: This paper reviews the NTIRE 2020 challenge on real image denoising with focus on the newly introduced dataset, the proposed methods and their results. The challenge is a new version of the previous NTIRE 2019 challenge on real image denoising that was based on the SIDD benchmark. This challenge is based on a newly collected validation and testing image datasets, and hence, named SIDD+. This chall… ▽ More This paper reviews the NTIRE 2020 challenge on real image denoising with focus on the newly introduced dataset, the proposed methods and their results. The challenge is a new version of the previous NTIRE 2019 challenge on real image denoising that was based on the SIDD benchmark. This challenge is based on a newly collected validation and testing image datasets, and hence, named SIDD+. This challenge has two tracks for quantitatively evaluating image denoising performance in (1) the Bayer-pattern rawRGB and (2) the standard RGB (sRGB) color spaces. Each track ~250 registered participants. A total of 22 teams, proposing 24 methods, competed in the final phase of the challenge. The proposed methods by the participating teams represent the current state-of-the-art performance in image denoising targeting real noisy images. The newly collected SIDD+ datasets are publicly available at: https://bit.ly/siddplus_data. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Showing 1–50 of 63 results for author: Kim, W