-
Predicting the risk of early-stage breast cancer recurrence using H\&E-stained tissue images
Authors:
Geongyu Lee,
Joonho Lee,
Tae-Yeong Kwak,
Sun Woo Kim,
Youngmee Kwon,
Chungyeul Kim,
Hyeyoon Chang
Abstract:
Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images la…
▽ More
Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images labeled with the risk prediction via genomics assays were used, and we obtained sensitivity of 0.857, 0.746, and 0.529 for predicting low, intermediate, and high risk, and specificity of 0.816, 0.803, and 0.972. When compared to the expert pathologist's regional histology grade information, a Pearson's correlation coefficient of 0.61 was obtained. When we checked the model learned through these studies through the class activation map, we found that it actually considered tubule formation and mitotic rate when predicting different risk groups.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition
Authors:
Muhammad Adi Nugroho,
Sangmin Woo,
Sumin Lee,
**young Park,
Yooseung Wang,
Donguk Kim,
Changick Kim
Abstract:
Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction…
▽ More
Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
Authors:
Seonhee Cho,
Choonghan Kim,
Jiho Lee,
Chetan Chilkunda,
Su** Choi,
Joo Heung Yoon
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises…
▽ More
Recent advancements in Large Multimodal Models (LMMs) have attracted interest in their generalization capability with only a few samples in the prompt. This progress is particularly relevant to the medical domain, where the quality and sensitivity of data pose unique challenges for model training and application. However, the dependency on high-quality data for effective in-context learning raises questions about the feasibility of these models when encountering with the inevitable variations and errors inherent in real-world medical data. In this paper, we introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions. MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data, with significantly fewer parameters. This highlights the potential of leveraging general-domain LLMs for domain-specific tasks and offers a sustainable and cost-effective alternative to traditional LMM developments. Moreover, the robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
△ Less
Submitted 29 April, 2024;
originally announced May 2024.
-
Deep RAW Image Super-Resolution. A NTIRE 2024 Challenge Survey
Authors:
Marcos V. Conde,
Florin-Alexandru Vasluianu,
Radu Timofte,
Jianxing Zhang,
Jia Li,
Fan Wang,
Xiaopeng Li,
Zikun Liu,
Hyunhee Park,
Sejun Song,
Changho Kim,
Zhijuan Huang,
Hongyuan Yu,
Cheng Wan,
Wending Xiang,
Jiamin Lin,
Hang Zhong,
Qiaosong Zhang,
Yue Sun,
Xuanwu Yin,
Kunlong Zuo,
Senyan Xu,
Siyuan Jiang,
Zhi**g Sun,
Jiaying Zhu
, et al. (10 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as nois…
▽ More
This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
A Robust Ensemble Algorithm for Ischemic Stroke Lesion Segmentation: Generalizability and Clinical Utility Beyond the ISLES Challenge
Authors:
Ezequiel de la Rosa,
Mauricio Reyes,
Sook-Lei Liew,
Alexandre Hutton,
Roland Wiest,
Johannes Kaesmacher,
Uta Hanning,
Arsany Hakim,
Richard Zubal,
Waldo Valenzuela,
David Robben,
Diana M. Sima,
Vincenzo Anania,
Arne Brys,
James A. Meakin,
Anne Mickan,
Gabriel Broocks,
Christian Heitkamp,
Shengbo Gao,
Kongming Liang,
Ziji Zhang,
Md Mahfuzur Rahman Siddiquee,
Andriy Myronenko,
Pooya Ashtari,
Sabine Van Huffel
, et al. (33 additional authors not shown)
Abstract:
Diffusion-weighted MRI (DWI) is essential for stroke diagnosis, treatment decisions, and prognosis. However, image and disease variability hinder the development of generalizable AI algorithms with clinical value. We address this gap by presenting a novel ensemble algorithm derived from the 2022 Ischemic Stroke Lesion Segmentation (ISLES) challenge. ISLES'22 provided 400 patient scans with ischemi…
▽ More
Diffusion-weighted MRI (DWI) is essential for stroke diagnosis, treatment decisions, and prognosis. However, image and disease variability hinder the development of generalizable AI algorithms with clinical value. We address this gap by presenting a novel ensemble algorithm derived from the 2022 Ischemic Stroke Lesion Segmentation (ISLES) challenge. ISLES'22 provided 400 patient scans with ischemic stroke from various medical centers, facilitating the development of a wide range of cutting-edge segmentation algorithms by the research community. Through collaboration with leading teams, we combined top-performing algorithms into an ensemble model that overcomes the limitations of individual solutions. Our ensemble model achieved superior ischemic lesion detection and segmentation accuracy on our internal test set compared to individual algorithms. This accuracy generalized well across diverse image and disease variables. Furthermore, the model excelled in extracting clinical biomarkers. Notably, in a Turing-like test, neuroradiologists consistently preferred the algorithm's segmentations over manual expert efforts, highlighting increased comprehensiveness and precision. Validation using a real-world external dataset (N=1686) confirmed the model's generalizability. The algorithm's outputs also demonstrated strong correlations with clinical scores (admission NIHSS and 90-day mRS) on par with or exceeding expert-derived results, underlining its clinical relevance. This study offers two key findings. First, we present an ensemble algorithm (https://github.com/Tabrisrei/ISLES22_Ensemble) that detects and segments ischemic stroke lesions on DWI across diverse scenarios on par with expert (neuro)radiologists. Second, we show the potential for biomedical challenge outputs to extend beyond the challenge's initial objectives, demonstrating their real-world clinical applicability.
△ Less
Submitted 3 April, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition
Authors:
SooHwan Eom,
Eunseop Yoon,
Hee Suk Yoon,
Chanwoo Kim,
Mark Hasegawa-Johnson,
Chang D. Yoo
Abstract:
In Automatic Speech Recognition (ASR) systems, a recurring obstacle is the generation of narrowly focused output distributions. This phenomenon emerges as a side effect of Connectionist Temporal Classification (CTC), a robust sequence learning tool that utilizes dynamic programming for sequence map**. While earlier efforts have tried to combine the CTC loss with an entropy maximization regulariz…
▽ More
In Automatic Speech Recognition (ASR) systems, a recurring obstacle is the generation of narrowly focused output distributions. This phenomenon emerges as a side effect of Connectionist Temporal Classification (CTC), a robust sequence learning tool that utilizes dynamic programming for sequence map**. While earlier efforts have tried to combine the CTC loss with an entropy maximization regularization term to mitigate this issue, they employed a constant weighting term on the regularization during the training, which we find may not be optimal. In this work, we introduce Adaptive Maximum Entropy Regularization (AdaMER), a technique that can modulate the impact of entropy regularization throughout the training process. This approach not only refines ASR model training but ensures that as training proceeds, predictions display the desired model confidence.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech
Authors:
Abhinav Garg,
Jiyeon Kim,
Sushil Khyalia,
Chanwoo Kim,
Dhananjaya Gowda
Abstract:
Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languag…
▽ More
Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
On the compression of shallow non-causal ASR models using knowledge distillation and tied-and-reduced decoder for low-latency on-device speech recognition
Authors:
Nagaraj Adiga,
**hwan Park,
Chintigari Shiva Kumar,
Shatrughan Singh,
Kyungmin Lee,
Chanwoo Kim,
Dhananjaya Gowda
Abstract:
Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation,…
▽ More
Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. The shared decoder is changed into a tied-and-reduced network. The cascaded two-pass model is further compressed using knowledge distillation using a Kullback-Leibler divergence loss on the model posteriors. We demonstrate a 50% reduction in the size of a 41 M parameter cascaded teacher model with no noticeable degradation in ASR accuracy and a 30% reduction in latency
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Central Angle Optimization for 360-degree Holographic 3D Content
Authors:
Hakdong Kim,
Minsung Yoon,
Cheongwon Kim
Abstract:
In this study, we propose a method to find an optimal central angle in deep learning-based depth map estimation used to produce realistic holographic content. The acquisition of RGB-depth map images as detailed as possible must be performed to generate holograms of high quality, despite the high computational cost. Therefore, we introduce a novel pipeline designed to analyze various values of cent…
▽ More
In this study, we propose a method to find an optimal central angle in deep learning-based depth map estimation used to produce realistic holographic content. The acquisition of RGB-depth map images as detailed as possible must be performed to generate holograms of high quality, despite the high computational cost. Therefore, we introduce a novel pipeline designed to analyze various values of central angles between adjacent camera viewpoints equidistant from the origin of an object-centered environment. Then we propose the optimal central angle to generate high-quality holographic content. The proposed pipeline comprises key steps such as comparing estimated depth maps and comparing reconstructed CGHs (Computer-Generated Holograms) from RGB images and estimated depth maps. We experimentally demonstrate and discuss the relationship between the central angle and the quality of digital holographic content.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Reinforcement Learning of Display Transfer Robots in Glass Flow Control Systems: A Physical Simulation-Based Approach
Authors:
Hwajong Lee,
Chan Kim,
Seong-Woo Kim
Abstract:
A flow control system is a critical concept for increasing the production capacity of manufacturing systems. To solve the scheduling optimization problem related to the flow control with the aim of improving productivity, existing methods depend on a heuristic design by domain human experts. Therefore, the methods require correction, monitoring, and verification by using real equipment. As system…
▽ More
A flow control system is a critical concept for increasing the production capacity of manufacturing systems. To solve the scheduling optimization problem related to the flow control with the aim of improving productivity, existing methods depend on a heuristic design by domain human experts. Therefore, the methods require correction, monitoring, and verification by using real equipment. As system designs increase in complexity, the monitoring time increases, which decreases the probability of arriving at the optimal design. As an alternative approach to the heuristic design of flow control systems, the use of deep reinforcement learning to solve the scheduling optimization problem has been considered. Although the existing research on reinforcement learning has yielded excellent performance in some areas, the applicability of the results to actual FAB such as display and semiconductor manufacturing processes is not evident so far. To this end, we propose a method to implement a physical simulation environment and devise a feasible flow control system design using a transfer robot in display manufacturing through reinforcement learning. We present a model and parameter setting to build a virtual environment for different display transfer robots, and training methods of reinforcement learning on the environment to obtain an optimal scheduling of glass flow control systems. Its feasibility was verified by using different types of robots used in the actual process.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis
Authors:
Jae-Sung Bae,
Joun Yeop Lee,
Ji-Hyun Lee,
Seongkyu Mun,
Taehwa Kang,
Hoon-Young Cho,
Chanwoo Kim
Abstract:
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts s…
▽ More
Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.
△ Less
Submitted 22 January, 2024; v1 submitted 5 October, 2023;
originally announced October 2023.
-
A multi-institutional pediatric dataset of clinical radiology MRIs by the Children's Brain Tumor Network
Authors:
Ariana M. Familiar,
Anahita Fathi Kazerooni,
Hannah Anderson,
Aliaksandr Lubneuski,
Karthik Viswanathan,
Rocky Breslow,
Nastaran Khalili,
Sina Bagheri,
Debanjan Haldar,
Meen Chul Kim,
Sherjeel Arif,
Rachel Madhogarhia,
Thinh Q. Nguyen,
Elizabeth A. Frenkel,
Zeinab Helili,
Jessica Harrison,
Keyvan Farahani,
Marius George Linguraru,
Ulas Bagci,
Yury Velichko,
Jeffrey Stevens,
Sarah Leary,
Robert M. Lober,
Stephani Campion,
Amy A. Smith
, et al. (15 additional authors not shown)
Abstract:
Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which…
▽ More
Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which require large datasets. To address this unmet need, we provide a multi-institutional, large-scale pediatric dataset of 23,101 multi-parametric MRI exams acquired through routine care for 1,526 brain tumor patients, as part of the Children's Brain Tumor Network. This includes longitudinal MRIs across various cancer diagnoses, with associated patient-level clinical information, digital pathology slides, as well as tissue genotype and omics data. To facilitate downstream analysis, treatment-naïve images for 370 subjects were processed and released through the NCI Childhood Cancer Data Initiative via the Cancer Data Service. Through ongoing efforts to continuously build these imaging repositories, our aim is to accelerate discovery and translational AI models with real-world data, to ultimately empower precision medicine for children.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
A novel approach for holographic 3D content generation without depth map
Authors:
Hakdong Kim,
Minkyu Jee,
Yurim Lee,
Kyudam Choi,
MinSung Yoon,
Cheongwon Kim
Abstract:
In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthe…
▽ More
In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthesize the volumetric digital holograms using only the given RGB image, so that we can overcome environments where RGB color and depth map images are partially provided. The proposed method uses only the input of RGB image to estimate its depth map and then generate its CGH sequentially. Through experiments, we demonstrate that the volumetric hologram generated through our proposed model is more accurate than that of competitive models, under the situation that only RGB color data can be provided.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Novel Smart N95 Filtering Facepiece Respirator with Real-time Adaptive Fit Functionality and Wireless Humidity Monitoring for Enhanced Wearable Comfort
Authors:
Kangkyu Kwon,
Yoon Jae Lee,
Yeongju Jung,
Ira Soltis,
Chanyeong Choi,
Yewon Na,
Lissette Romero,
Myung Chul Kim,
Nathan Rodeheaver,
Hodam Kim,
Michael S. Lloyd,
Ziqing Zhuang,
William King,
Susan Xu,
Seung-Hwan Ko,
**woo Lee,
Woon-Hong Yeo
Abstract:
The widespread emergence of the COVID-19 pandemic has transformed our lifestyle, and facial respirators have become an essential part of daily life. Nevertheless, the current respirators possess several limitations such as poor respirator fit because they are incapable of covering diverse human facial sizes and shapes, potentially diminishing the effect of wearing respirators. In addition, the cur…
▽ More
The widespread emergence of the COVID-19 pandemic has transformed our lifestyle, and facial respirators have become an essential part of daily life. Nevertheless, the current respirators possess several limitations such as poor respirator fit because they are incapable of covering diverse human facial sizes and shapes, potentially diminishing the effect of wearing respirators. In addition, the current facial respirators do not inform the user of the air quality within the smart facepiece respirator in case of continuous long-term use. Here, we demonstrate the novel smart N-95 filtering facepiece respirator that incorporates the humidity sensor and pressure sensory feedback-enabled self-fit adjusting functionality for the effective performance of the facial respirator to prevent the transmission of airborne pathogens. The laser-induced graphene (LIG) constitutes the humidity sensor, and the pressure sensor array based on the dielectric elastomeric sponge monitors the respirator contact on the face of the user, providing the sensory information for a closed-loop feedback mechanism. As a result of the self-fit adjusting mode along with elastomeric lining, the fit factor is increased by 3.20 and 5 times at average and maximum respectively. We expect that the experimental proof-of-concept of this work will offer viable solutions to the current commercial respirators to address the limitations.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
RADIO: Reference-Agnostic Dubbing Video Synthesis
Authors:
Dongyeun Lee,
Chaewon Kim,
Sangjoon Yu,
Jaejun Yoo,
Gyeong-Moon Park
Abstract:
One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework eng…
▽ More
One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame deviates significantly from the ground truth, our method outperforms state-of-the-art methods, highlighting its robustness.
△ Less
Submitted 6 November, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Authors:
Hyungchan Yoon,
Changhwan Kim,
Eunwoo Song,
Hyun-Wook Yoon,
Hong-Goo Kang
Abstract:
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we p…
▽ More
For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction
Authors:
Eunseop Yoon,
Hee Suk Yoon,
Dhananjaya Gowda,
SooHwan Eom,
Daehyeok Kim,
John Harvill,
Heting Gao,
Mark Hasegawa-Johnson,
Chanwoo Kim,
Chang D. Yoo
Abstract:
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or parag…
▽ More
Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or paragraph-level G2P can improve usability in real-world applications as it is better suited to perform on heteronyms and linking sounds between words, we find that using ByT5 for these scenarios is nontrivial. Since ByT5 operates on the character level, it requires longer decoding steps, which deteriorates the performance due to the exposure bias commonly observed in auto-regressive generation models. This paper shows that the performance of sentence-level and paragraph-level G2P can be improved by mitigating such exposure bias using our proposed loss-based sampling method.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization
Authors:
Gangwoo Kim,
Hajung Kim,
Lei Ji,
Seongsu Bae,
Chanhwi Kim,
Mujeen Sung,
Hyunjae Kim,
Kun Yan,
Eric Chang,
Jaewoo Kang
Abstract:
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively…
▽ More
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively learn the required knowledge and skills from limited resources in the domain. Demonstrating superior performance on the benchmark datasets provided by the BioNLP shared task, our model benefits from its training across multiple tasks and domains. With subtle techniques including ensemble and factual calibration, our system achieves first place on the RadSum23 leaderboard for the hidden test set.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Data Transmission and Communication via Electrolytic Flow Channel
Authors:
Khoi Ly,
Chong-Chan Kim,
Robert Shepherd
Abstract:
As an alternative approach to ionic data transmission with hydrogel as substrate, this work explores the possible applications of liquid electrolyte filling cavity of a stretchable, flexible elastomeric tubing, which is the primary ingredient used in redox flow battery systems. While hydrogel-based ionic impedance characterization and its data communication capability have been well studied, the m…
▽ More
As an alternative approach to ionic data transmission with hydrogel as substrate, this work explores the possible applications of liquid electrolyte filling cavity of a stretchable, flexible elastomeric tubing, which is the primary ingredient used in redox flow battery systems. While hydrogel-based ionic impedance characterization and its data communication capability have been well studied, the multifunctional use of redox flow battery electrolyte for data communication, in addition to powering, is novel, especially in the context of soft robotics. This work also describes a simple signal conditioning technique that addresses the signal decaying problem due to long-distance ionic data transmission. Finally, we describe the prototypical design of a decentralized data communication between two, or possibly more, systems of power and control. This work presents the concept of a decentralized control of a robot's hydraulic actuation system, that will allow for the system's functional robustness in the event of one of the two, or possibly more, modules lose power or become severed from the robot's body.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization
Authors:
Changhun Kim,
Joonhyung Park,
Ha** Shim,
Eunho Yang
Abstract:
Automatic speech recognition (ASR) models are frequently exposed to data distribution shifts in many real-world scenarios, leading to erroneous predictions. To tackle this issue, an existing test-time adaptation (TTA) method has recently been proposed to adapt the pre-trained ASR model on unlabeled test instances without source data. Despite decent performance gain, this work relies solely on naiv…
▽ More
Automatic speech recognition (ASR) models are frequently exposed to data distribution shifts in many real-world scenarios, leading to erroneous predictions. To tackle this issue, an existing test-time adaptation (TTA) method has recently been proposed to adapt the pre-trained ASR model on unlabeled test instances without source data. Despite decent performance gain, this work relies solely on naive greedy decoding and performs adaptation across timesteps at a frame level, which may not be optimal given the sequential nature of the model output. Motivated by this, we propose a novel TTA framework, dubbed SGEM, for general ASR models. To treat the sequential output, SGEM first exploits beam search to explore candidate output logits and selects the most plausible one. Then, it utilizes generalized entropy minimization and negative sampling as unsupervised objectives to adapt the model. SGEM achieves state-of-the-art performance for three mainstream ASR models under various domain shifts.
△ Less
Submitted 21 June, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Megahertz X-ray Multi-projection imaging
Authors:
Pablo Villanueva-Perez,
Valerio Bellucci,
Yuhe Zhang,
Sarlota Birnsteinova,
Rita Graceffa,
Luigi Adriano,
Eleni Myrto Asimakopoulou,
Ilia Petrov,
Zisheng Yao,
Marco Romagnoni,
Andrea Mazzolari,
Romain Letrun,
Chan Kim,
Jayanath C. P. Koliyadu,
Carsten Deiter,
Richard Bean,
Gabriele Giovanetti,
Luca Gelisio,
Tobias Ritschel,
Adrian Mancuso,
Henry N. Chapman,
Alke Meents,
Tokushi Sato,
Patrik Vagovic
Abstract:
X-ray time-resolved tomography is one of the most popular X-ray techniques to probe dynamics in three dimensions (3D). Recent developments in time-resolved tomography opened the possibility of recording kilohertz-rate 3D movies. However, tomography requires rotating the sample with respect to the X-ray beam, which prevents characterization of faster structural dynamics. Here, we present megahertz…
▽ More
X-ray time-resolved tomography is one of the most popular X-ray techniques to probe dynamics in three dimensions (3D). Recent developments in time-resolved tomography opened the possibility of recording kilohertz-rate 3D movies. However, tomography requires rotating the sample with respect to the X-ray beam, which prevents characterization of faster structural dynamics. Here, we present megahertz (MHz) X-ray multi-projection imaging (MHz-XMPI), a technique capable of recording volumetric information at MHz rates and micrometer resolution without scanning the sample. We achieved this by harnessing the unique megahertz pulse structure and intensity of the European X-ray Free-electron Laser with a combination of novel detection and reconstruction approaches that do not require sample rotations. Our approach enables generating multiple X-ray probes that simultaneously record several angular projections for each pulse in the megahertz pulse burst. We provide a proof-of-concept demonstration of the MHz-XMPI technique's capability to probe 4D (3D+time) information on stochastic phenomena and non-reproducible processes three orders of magnitude faster than state-of-the-art time-resolved X-ray tomography, by generating 3D movies of binary droplet collisions. We anticipate that MHz-XMPI will enable in-situ and operando studies that were impossible before, either due to the lack of temporal resolution or because the systems were opaque (such as for MHz imaging based on optical microscopy).
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Restoring Original Signal From Pile-up Signal using Deep Learning
Authors:
C. H. Kim,
S. Ahn,
K. Y. Chae,
J. Hooker,
G. V. Rogachev
Abstract:
Pile-up signals are frequently produced in experimental physics. They create inaccurate physics data with high uncertainty and cause various problems. Therefore, the correction to pile-up signals is crucially required. In this study, we implemented a deep learning method to restore the original signals from the pile-up signals. We showed that a deep learning model could accurately reconstruct the…
▽ More
Pile-up signals are frequently produced in experimental physics. They create inaccurate physics data with high uncertainty and cause various problems. Therefore, the correction to pile-up signals is crucially required. In this study, we implemented a deep learning method to restore the original signals from the pile-up signals. We showed that a deep learning model could accurately reconstruct the original signal waveforms from the pile-up waveforms. By substituting the pile-up signals with the original signals predicted by the model, the energy and timing resolutions of the data are notably enhanced. The model implementation significantly improved the quality of the particle identification plot and particle tracks. This method is applicable to similar problems, such as separating multiple signals or correcting pile-up signals with other types of noises and backgrounds.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video
Authors:
Minsu Kim,
Chae Won Kim,
Yong Man Ro
Abstract:
Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking…
▽ More
Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking face video). Since the drawbacks of audio-based forced alignment can be complemented using the visual information when the audio signal is under poor condition, we try to develop a novel video-based forced alignment method. However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in develo** visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). The proposed DVFA can align the input transcription (i.e., sentence) with the talking face video without accessing the speech audio. Moreover, by augmenting the alignment task with anomaly case detection, DVFA can detect mismatches between the input transcription and the input video while performing the alignment. Therefore, we can robustly align the text with the talking face video even if there exist error words in the text. Through extensive experiments, we show the effectiveness of the proposed DVFA not only in the alignment task but also in interpreting the outputs of VSR models.
△ Less
Submitted 26 February, 2023;
originally announced March 2023.
-
Context-Based Trit-Plane Coding for Progressive Image Compression
Authors:
Seungmin Jeon,
Kwang Pyo Choi,
Youngo Park,
Chang-Su Kim
Abstract:
Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly…
▽ More
Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly. Second, we develop the context-based distortion reduction module to refine partial latent tensors from the trit-planes and improve the reconstructed image quality. Third, we propose a retraining scheme for the decoder to attain better rate-distortion tradeoffs. Extensive experiments show that CTC outperforms the baseline trit-plane codec significantly in BD-rate on the Kodak lossless dataset, while increasing the time complexity only marginally. Our codes are available at https://github.com/seungminjeon-github/CTC.
△ Less
Submitted 13 March, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Stable Contact Guaranteeing Motion/Force Control for an Aerial Manipulator on an Arbitrarily Tilted Surface
Authors:
Jeonghyun Byun,
Byeongjun Kim,
Changhyeon Kim,
Donggeon David Oh,
H. ** Kim
Abstract:
This study aims to design a motion/force controller for an aerial manipulator which guarantees the tracking of time-varying motion/force trajectories as well as the stability during the transition between free and contact motions. To this end, we model the force exerted on the end-effector as the Kelvin-Voigt linear model and estimate its parameters by recursive least-squares estimator. Then, the…
▽ More
This study aims to design a motion/force controller for an aerial manipulator which guarantees the tracking of time-varying motion/force trajectories as well as the stability during the transition between free and contact motions. To this end, we model the force exerted on the end-effector as the Kelvin-Voigt linear model and estimate its parameters by recursive least-squares estimator. Then, the gains of the disturbance-observer (DOB)-based motion/force controller are calculated based on the stability conditions considering both the model uncertainties in the dynamic equation and switching between the free and contact motions. To validate the proposed controller, we conducted the time-varying motion/force tracking experiments with different approach speeds and orientations of the surface. The results show that our controller enables the aerial manipulator to track the time-varying motion/force trajectories.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Macro-block dropout for improved regularization in training end-to-end speech recognition models
Authors:
Chanwoo Kim,
Sathish Indurti,
**hwan Park,
Wonyong Sung
Abstract:
This paper proposes a new regularization algorithm referred to as macro-block dropout. The overfitting issue has been a difficult problem in training large neural network models. The dropout technique has proven to be simple yet very effective for regularization by preventing complex co-adaptations during training. In our work, we define a macro-block that contains a large number of units from the…
▽ More
This paper proposes a new regularization algorithm referred to as macro-block dropout. The overfitting issue has been a difficult problem in training large neural network models. The dropout technique has proven to be simple yet very effective for regularization by preventing complex co-adaptations during training. In our work, we define a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN). Rather than applying dropout to each unit, we apply random dropout to each macro-block. This algorithm has the effect of applying different drop out rates for each layer even if we keep a constant average dropout rate, which has better regularization effects. In our experiments using Recurrent Neural Network-Transducer (RNN-T), this algorithm shows relatively 4.30 % and 6.13 % Word Error Rates (WERs) improvement over the conventional dropout on LibriSpeech test-clean and test-other. With an Attention-based Encoder-Decoder (AED) model, this algorithm shows relatively 4.36 % and 5.85 % WERs improvement over the conventional dropout on the same test sets.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
LISA: Localized Image Stylization with Audio via Implicit Neural Representation
Authors:
Seung Hyun Lee,
Chanyoung Kim,
Wonmin Byeon,
Sang Ho Yoon,
**kyu Kim,
Sangpil Kim
Abstract:
We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a pa…
▽ More
We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space
Authors:
Jihwan Lee,
Jae-Sung Bae,
Seongkyu Mun,
Hee** Choi,
Joun Yeop Lee,
Hoon-Young Cho,
Chanwoo Kim
Abstract:
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply th…
▽ More
With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply the vowel space analysis method to explore L2 accents of cross-lingual TTS systems. Through the vowel space analysis, we observe the three followings: a) a parallel architecture (Glow-TTS) is less L2-accented than an auto-regressive one (Tacotron); b) L2 accents are more dominant in non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS systems share some phenomena with those of human L2 learners. Our findings imply that it is necessary for TTS systems to handle each language pair differently, depending on their linguistic characteristics such as non-shared vowels. They also hint that we can further incorporate linguistics knowledge in develo** cross-lingual TTS systems.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
WaGI : Wavelet-based GAN Inversion for Preserving High-frequency Image Details
Authors:
Seung-Jun Moon,
Chaewon Kim,
Gyeong-Moon Park
Abstract:
Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naiive low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both str…
▽ More
Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naiive low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., L2, is biased to reconstruct low-frequency features mainly. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables to handle high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to the existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Left Ventricle Contouring of Apical Three-Chamber Views on 2D Echocardiography
Authors:
Alberto Gomez,
Mihaela Porumb,
Angela Mumith,
Thierry Judge,
Shan Gao,
Woo-** Cho Kim,
Jorge Oliveira,
Agis Chartsias
Abstract:
We propose a new method to automatically contour the left ventricle on 2D echocardiographic images. Unlike most existing segmentation methods, which are based on predicting segmentation masks, we focus at predicting the endocardial contour and the key landmark points within this contour (basal points and apex). This provides a representation that is closer to how experts perform manual annotations…
▽ More
We propose a new method to automatically contour the left ventricle on 2D echocardiographic images. Unlike most existing segmentation methods, which are based on predicting segmentation masks, we focus at predicting the endocardial contour and the key landmark points within this contour (basal points and apex). This provides a representation that is closer to how experts perform manual annotations and hence produce results that are physiologically more plausible.
Our proposed method uses a two-headed network based on the U-Net architecture. One head predicts the 7 contour points, and the other head predicts a distance map to the contour. This approach was compared to the U-Net and to a point based approach, achieving performance gains of up to 30\% in terms of landmark localisation (<4.5mm) and distance to the ground truth contour (<3.5mm).
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Zero-shot Blind Image Denoising via Implicit Neural Representations
Authors:
Chaewon Kim,
Jaeho Lee,
**woo Shin
Abstract:
Recent denoising algorithms based on the "blind-spot" strategy show impressive blind image denoising performances, without utilizing any external dataset. While the methods excel in recovering highly contaminated images, we observe that such algorithms are often less effective under a low-noise or real noise regime. To address this gap, we propose an alternative denoising strategy that leverages t…
▽ More
Recent denoising algorithms based on the "blind-spot" strategy show impressive blind image denoising performances, without utilizing any external dataset. While the methods excel in recovering highly contaminated images, we observe that such algorithms are often less effective under a low-noise or real noise regime. To address this gap, we propose an alternative denoising strategy that leverages the architectural inductive bias of implicit neural representations (INRs), based on our two findings: (1) INR tends to fit the low-frequency clean image signal faster than the high-frequency noise, and (2) INR layers that are closer to the output play more critical roles in fitting higher-frequency parts. Building on these observations, we propose a denoising algorithm that maximizes the innate denoising capability of INRs by penalizing the growth of deeper layer weights. We show that our method outperforms existing zero-shot denoising methods under an extensive set of low-noise or real-noise scenarios.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
Authors:
Hyungchan Yoon,
Seyun Um,
Changwhan Kim,
Hong-Goo Kang
Abstract:
To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations…
▽ More
To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.
△ Less
Submitted 28 August, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Into-TTS : Intonation Template Based Prosody Control System
Authors:
Jihwan Lee,
Joun Yeop Lee,
Hee** Choi,
Seongkyu Mun,
Sangjun Park,
Jae-Sung Bae,
Chanwoo Kim
Abstract:
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervi…
▽ More
Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrap** speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.
△ Less
Submitted 6 November, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
RD-Optimized Trit-Plane Coding of Deep Compressed Image Latent Tensors
Authors:
Seungmin Jeon,
Jae-Han Lee,
Chang-Su Kim
Abstract:
DPICT is the first learning-based image codec supporting fine granular scalability. In this paper, we describe how to implement two key components of DPICT efficiently: trit-plane slicing and rate-distortion-optimized (RD-optimized) coding. In DPICT, we transform an image into a latent tensor, represent the tensor in ternary digits (trits), and encode the trits in the decreasing order of significa…
▽ More
DPICT is the first learning-based image codec supporting fine granular scalability. In this paper, we describe how to implement two key components of DPICT efficiently: trit-plane slicing and rate-distortion-optimized (RD-optimized) coding. In DPICT, we transform an image into a latent tensor, represent the tensor in ternary digits (trits), and encode the trits in the decreasing order of significance. For entropy encoding, it is necessary to compute the probability of each trit, which demands high time complexity in both the encoder and the decoder. To reduce the complexity, we develop a parallel computing scheme for the probabilities, which is described in detail with pseudo-codes. Moreover, we compare the trit-plane slicing in DPICT with the alternative bit-plane slicing. Experimental results show that the time complexity is reduced significantly by the parallel computing and that the trit-plane slicing provides better RD performances than the bit-plane slicing.
△ Less
Submitted 8 May, 2022; v1 submitted 25 March, 2022;
originally announced March 2022.
-
Attributable-Watermarking of Speech Generative Models
Authors:
Yongbaek Cho,
Changhoon Kim,
Yezhou Yang,
Yi Ren
Abstract:
Generative models are now capable of synthesizing images, speeches, and videos that are hardly distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation and IP theft. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Building on past…
▽ More
Generative models are now capable of synthesizing images, speeches, and videos that are hardly distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation and IP theft. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Building on past success of model attribution in the image domain, we discuss algorithmic improvements for generating user-end speech models that empirically achieve high attribution accuracy, while maintaining high generation quality. We show the trade off between attributability and generation quality under a variety of attacks on generated speech signals attempting to remove the watermarks, and the feasibility of learning robust watermarks against these attacks.
△ Less
Submitted 15 March, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Two-Pass End-to-End ASR Model Compression
Authors:
Nauman Dawalatabad,
Tushar Vatsal,
Ashutosh Gupta,
Sungsoo Kim,
Shatrughan Singh,
Dhananjaya Gowda,
Chanwoo Kim
Abstract:
Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-d…
▽ More
Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-device speech recognition.
In this work, we propose a simple and effective approach to reduce the size of the two-pass model for memory-constrained devices. We employ a popular knowledge distillation approach in three stages using the Teacher-Student training technique. In the first stage, we use a trained RNN-T model as a teacher model and perform knowledge distillation to train the student RNN-T model. The second stage uses the shared encoder and trains a LAS rescorer for student model using the trained RNN-T+LAS teacher model. Finally, we perform deep-finetuning for the student model with a shared RNN-T encoder, RNN-T decoder, and LAS rescorer. Our experimental results on standard LibriSpeech dataset show that our system can achieve a high compression rate of 55% without significant degradation in the WER compared to the two-pass teacher model.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
DPICT: Deep Progressive Image Compression Using Trit-Planes
Authors:
Jae-Han Lee,
Seungmin Jeon,
Kwang Pyo Choi,
Youngo Park,
Chang-Su Kim
Abstract:
We propose the deep progressive image compression using trit-planes (DPICT) algorithm, which is the first learning-based codec supporting fine granular scalability (FGS). First, we transform an image into a latent tensor using an analysis network. Then, we represent the latent tensor in ternary digits (trits) and encode it into a compressed bitstream trit-plane by trit-plane in the decreasing orde…
▽ More
We propose the deep progressive image compression using trit-planes (DPICT) algorithm, which is the first learning-based codec supporting fine granular scalability (FGS). First, we transform an image into a latent tensor using an analysis network. Then, we represent the latent tensor in ternary digits (trits) and encode it into a compressed bitstream trit-plane by trit-plane in the decreasing order of significance. Moreover, within each trit-plane, we sort the trits according to their rate-distortion priorities and transmit more important information first. Since the compression network is less optimized for the cases of using fewer trit-planes, we develop a postprocessing network for refining reconstructed images at low rates. Experimental results show that DPICT outperforms conventional progressive codecs significantly, while enabling FGS transmission. Codes are available at https://github.com/jaehanlee-mcl/DPICT.
△ Less
Submitted 6 May, 2022; v1 submitted 12 December, 2021;
originally announced December 2021.
-
Sound-Guided Semantic Image Manipulation
Authors:
Seung Hyun Lee,
Wonseok Roh,
Wonmin Byeon,
Sang Ho Yoon,
Chan Young Kim,
**kyu Kim,
Sangpil Kim
Abstract:
The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a fra…
▽ More
The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent optimization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix text and audio modalities, which enrich the variety of the image modification. We verify the effectiveness of our sound-guided image manipulation quantitatively and qualitatively. We also show that our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification. The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
Noise-resistant reconstruction algorithm based on the sinogram pattern
Authors:
Byung Chun Kim,
Hyunju Lee,
Kyungtaek Jun
Abstract:
We introduce a new CT image reconstruction algorithm that is less affected by various artifacts. The new reconstruction algorithm is a method of minimizing the difference between synchrotron X-ray tomography data and sinograms generated using Radon transform of CT images. The CT image is iteratively updated to reduce the difference from the sinogram of the data. This method can obtain clean CT ima…
▽ More
We introduce a new CT image reconstruction algorithm that is less affected by various artifacts. The new reconstruction algorithm is a method of minimizing the difference between synchrotron X-ray tomography data and sinograms generated using Radon transform of CT images. The CT image is iteratively updated to reduce the difference from the sinogram of the data. This method can obtain clean CT images from the projection data, which can create ring artifacts or metal artifacts. Also, even if the sample size is larger than the CCD and/or the projection image does not satisfy the Beer-Lambert law, a clean CT image can be reconstructed. Our new reconstruction algorithm can also be applied to fan beam CT or cone beam CT
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages
Authors:
Jiyeon Kim,
Mehul Kumar,
Dhananjaya Gowda,
Abhinav Garg,
Chanwoo Kim
Abstract:
In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-traine…
▽ More
In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS augmentation, and SSL respectively. In the first stage, we use transfer learning from a well-trained English model. This primarily helps in learning the acoustic information from a resource-rich language. This stage achieves around 24% relative Word Error Rate (WER) reduction over the baseline. In stage two, We utilize unlabeled text data via TTS data-augmentation to incorporate language information into the model. We also explore freezing the acoustic encoder at this stage. TTS data augmentation helps us further reduce the WER by ~ 21% relatively. Finally, In stage three we reduce the WER by another 4% relative by using SSL from unlabeled audio data. Overall, our two-pass speech recognition system with a Monotonic Chunkwise Attention (MoChA) in the first pass and a full-attention in the second pass achieves a WER reduction of ~ 42% relative to the baseline.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
A comparison of streaming models and data augmentation methods for robust speech recognition
Authors:
Jiyeon Kim,
Mehul Kumar,
Dhananjaya Gowda,
Abhinav Garg,
Chanwoo Kim
Abstract:
In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variabil…
▽ More
In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques. On the other hand, RNN-T models perform better than MoChA models in terms of latency, inference time, and the stability of training. Additionally, RNN-T models are generally more robust against noise and reverberation. All these advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems
Authors:
Mohd Abbas Zaidi,
Beomseok Lee,
Sangha Kim,
Chanwoo Kim
Abstract:
Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. These decisions depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same acro…
▽ More
Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. These decisions depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Decision Attentive Regularization (DAR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.
△ Less
Submitted 17 June, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Online Distributed Trajectory Planning for Quadrotor Swarm with Feasibility Guarantee using Linear Safe Corridor
Authors:
Jungwon Park,
Dabin Kim,
Gyeong Chan Kim,
Dahyun Oh,
H. ** Kim
Abstract:
This paper presents a new online multi-agent trajectory planning algorithm that guarantees to generate safe, dynamically feasible trajectories in a cluttered environment. The proposed algorithm utilizes a linear safe corridor (LSC) to formulate the distributed trajectory optimization problem with only feasible constraints, so it does not resort to slack variables or soft constraints to avoid optim…
▽ More
This paper presents a new online multi-agent trajectory planning algorithm that guarantees to generate safe, dynamically feasible trajectories in a cluttered environment. The proposed algorithm utilizes a linear safe corridor (LSC) to formulate the distributed trajectory optimization problem with only feasible constraints, so it does not resort to slack variables or soft constraints to avoid optimization failure. We adopt a priority-based goal planning method to prevent the deadlock without an additional procedure to decide which robot to yield. The proposed algorithm can compute the trajectories for 60 agents on average 15.5 ms per agent with an Intel i7 laptop and shows a similar flight distance and distance compared to the baselines based on soft constraints. We verified that the proposed method can reach the goal without deadlock in both the random forest and the indoor space, and we validated the safety and operability of the proposed algorithm through a real flight test with ten quadrotors in a maze-like environment.
△ Less
Submitted 3 January, 2022; v1 submitted 18 September, 2021;
originally announced September 2021.
-
IceNet for Interactive Contrast Enhancement
Authors:
Keunsoo Ko,
Chang-Su Kim
Abstract:
A CNN-based interactive contrast enhancement algorithm, called IceNet, is proposed in this work, which enables a user to adjust image contrast easily according to his or her preference. Specifically, a user provides a parameter for controlling the global brightness and two types of scribbles to darken or brighten local regions in an image. Then, given these annotations, IceNet estimates a gamma ma…
▽ More
A CNN-based interactive contrast enhancement algorithm, called IceNet, is proposed in this work, which enables a user to adjust image contrast easily according to his or her preference. Specifically, a user provides a parameter for controlling the global brightness and two types of scribbles to darken or brighten local regions in an image. Then, given these annotations, IceNet estimates a gamma map for the pixel-wise gamma correction. Finally, through color restoration, an enhanced image is obtained. The user may provide annotations iteratively to obtain a satisfactory image. IceNet is also capable of producing a personalized enhanced image automatically, which can serve as a basis for further adjustment if so desired. Moreover, to train IceNet effectively and reliably, we propose three differentiable losses. Extensive experiments show that IceNet can provide users with satisfactorily enhanced images.
△ Less
Submitted 25 December, 2021; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Learning JPEG Compression Artifacts for Image Manipulation Detection and Localization
Authors:
Myung-Joon Kwon,
Seung-Hun Nam,
In-Jae Yu,
Heung-Kyu Lee,
Changick Kim
Abstract:
Detecting and localizing image manipulation are necessary to counter malicious use of image editing techniques. Accordingly, it is essential to distinguish between authentic and tampered regions by analyzing intrinsic statistics in an image. We focus on JPEG compression artifacts left during image acquisition and editing. We propose a convolutional neural network (CNN) that uses discrete cosine tr…
▽ More
Detecting and localizing image manipulation are necessary to counter malicious use of image editing techniques. Accordingly, it is essential to distinguish between authentic and tampered regions by analyzing intrinsic statistics in an image. We focus on JPEG compression artifacts left during image acquisition and editing. We propose a convolutional neural network (CNN) that uses discrete cosine transform (DCT) coefficients, where compression artifacts remain, to localize image manipulation. Standard CNNs cannot learn the distribution of DCT coefficients because the convolution throws away the spatial coordinates, which are essential for DCT coefficients. We illustrate how to design and train a neural network that can learn the distribution of DCT coefficients. Furthermore, we introduce Compression Artifact Tracing Network (CAT-Net) that jointly uses image acquisition artifacts and compression artifacts. It significantly outperforms traditional and deep neural network-based methods in detecting and localizing tampered regions.
△ Less
Submitted 25 May, 2022; v1 submitted 29 August, 2021;
originally announced August 2021.
-
Streaming end-to-end speech recognition with jointly trained neural feature enhancement
Authors:
Chanwoo Kim,
Abhinav Garg,
Dhananjaya Gowda,
Seongkyu Mun,
Changwoo Han
Abstract:
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples,…
▽ More
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers gradually increases. With GREL, the portion of the Mean Squared Error (MSE) loss for the enhanced output gradually reduces as training proceeds. In experimental results on the LibriSpeech corpus and noisy far-field test sets, the proposed model with GAEF-GREL training strategies shows significantly better results than the conventional multi-style training approach.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Deep Learning-based High-precision Depth Map Estimation from Missing Viewpoints for 360 Degree Digital Holography
Authors:
Hakdong Kim,
Heonyeong Lim,
Minkyu Jee,
Yurim Lee,
Jisoo Jeong,
Kyudam Choi,
MinSung Yoon,
Cheongwon Kim
Abstract:
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance o…
▽ More
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance of depth map estimation as loss function, and utilizes the bilinear interpolation in up sampling layer with the Relu as activation function. We design and prepare a total of 8,192 multi-view images, each resolution of 640 by 360 for the deep learning study. The proposed model estimates depth maps through extracting features, up sampling. For quantitative assessment, we compare the estimated depth maps with the ground truths by using the PSNR, ACC, and RMSE. We also compare the CGH patterns made from estimated depth maps with ones made from ground truths. Furthermore, we demonstrate the experimental results to test the quality of estimated depth maps through directly reconstructing holographic 3D image scenes from the CGHs.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Ensemble Transfer Learning of Elastography and B-mode Breast Ultrasound Images
Authors:
Sampa Misra,
Seungwan Jeon,
Ravi Managuli,
Seiyon Lee,
Gyuwon Kim,
Seungchul Lee,
Richard G Barr,
Chulhong Kim
Abstract:
Computer-aided detection (CAD) of benign and malignant breast lesions becomes increasingly essential in breast ultrasound (US) imaging. The CAD systems rely on imaging features identified by the medical experts for their performance, whereas deep learning (DL) methods automatically extract features from the data. The challenge of the DL is the insufficiency of breast US images available to train t…
▽ More
Computer-aided detection (CAD) of benign and malignant breast lesions becomes increasingly essential in breast ultrasound (US) imaging. The CAD systems rely on imaging features identified by the medical experts for their performance, whereas deep learning (DL) methods automatically extract features from the data. The challenge of the DL is the insufficiency of breast US images available to train the DL models. Here, we present an ensemble transfer learning model to classify benign and malignant breast tumors using B-mode breast US (B-US) and strain elastography breast US (SE-US) images. This model combines semantic features from AlexNet & ResNet models to classify benign from malignant tumors. We use both B-US and SE-US images to train the model and classify the tumors. We retrospectively gathered 85 patients' data, with 42 benign and 43 malignant cases confirmed with the biopsy. Each patient had multiple B-US and their corresponding SE-US images, and the total dataset contained 261 B-US images and 261 SE-US images. Experimental results show that our ensemble model achieves a sensitivity of 88.89% and specificity of 91.10%. These diagnostic performances of the proposed method are equivalent to or better than manual identification. Thus, our proposed ensemble learning method would facilitate detecting early breast cancer, reliably improving patient care.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Decentralized Attribution of Generative Models
Authors:
Changhoon Kim,
Yi Ren,
Yezhou Yang
Abstract:
Growing applications of generative models have led to new threats such as malicious personation and digital copyright infringement. One solution to these threats is model attribution, i.e., the identification of user-end models where the contents under question are generated from. Existing studies showed empirical feasibility of attribution through a centralized classifier trained on all user-end…
▽ More
Growing applications of generative models have led to new threats such as malicious personation and digital copyright infringement. One solution to these threats is model attribution, i.e., the identification of user-end models where the contents under question are generated from. Existing studies showed empirical feasibility of attribution through a centralized classifier trained on all user-end models. However, this approach is not scalable in reality as the number of models ever grows. Neither does it provide an attributability guarantee. To this end, this paper studies decentralized attribution, which relies on binary classifiers associated with each user-end model. Each binary classifier is parameterized by a user-specific key and distinguishes its associated model distribution from the authentic data distribution. We develop sufficient conditions of the keys that guarantee an attributability lower bound. Our method is validated on MNIST, CelebA, and FFHQ datasets. We also examine the trade-off between generation quality and robustness of attribution against adversarial post-processes.
△ Less
Submitted 28 April, 2021; v1 submitted 26 October, 2020;
originally announced October 2020.
-
Multiview Variational Deep Learning with Application to Scalable Indoor Localization
Authors:
Minseuk Kim,
Changjun Kim,
Dongsoo Han,
June-Koo Kevin Rhee
Abstract:
Radio channel state information (CSI) measured with many receivers is a good resource for localizing a transmit device with machine learning with a discriminative model. However, CSI localization is nontrivial when the radio map is complicated, such as in building corridors. This paper introduces a view-selective deep learning (VSDL) system for indoor localization using CSI of WiFi. The multiview…
▽ More
Radio channel state information (CSI) measured with many receivers is a good resource for localizing a transmit device with machine learning with a discriminative model. However, CSI localization is nontrivial when the radio map is complicated, such as in building corridors. This paper introduces a view-selective deep learning (VSDL) system for indoor localization using CSI of WiFi. The multiview training with CSI obtained from multiple groups of access points (APs) generates latent features on supervised variational deep network. This information is then applied to an additional network for dominant view classification to enhance the regression accuracy of localization. As non-informative latent features from multiple views are rejected, we can achieve a localization accuracy of 0.77 m, which outperforms by 30 $\%$ the best known accuracy in practical applications in a real building environment. To the best of our knowledge, this is the first approach to apply variational inference and to construct a scalable system for radio localization. Furthermore, our work investigates a methodology for supervised learning with multiview data where informative and non-informative views coexist.
△ Less
Submitted 25 August, 2020;
originally announced August 2020.