Skip to main content

Showing 1–50 of 199 results for author: Kim, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.17310  [pdf, other

    eess.AS

    High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

    Authors: Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim

    Abstract: We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target v… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech2024

  2. arXiv:2406.16716  [pdf, other

    eess.AS cs.CR cs.SD

    One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

    Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim

    Abstract: As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafid… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  3. arXiv:2406.15225  [pdf, other

    cs.AI cs.RO eess.SP

    Deep UAV Path Planning with Assured Connectivity in Dense Urban Setting

    Authors: Jiyong Oh, Syed M. Raza, Lusungu J. Mwasinga, Moonseong Kim, Hyunseung Choo

    Abstract: Unmanned Ariel Vehicle (UAV) services with 5G connectivity is an emerging field with numerous applications. Operator-controlled UAV flights and manual static flight configurations are major limitations for the wide adoption of scalability of UAV services. Several services depend on excellent UAV connectivity with a cellular network and maintaining it is challenging in predetermined flight paths. T… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: 5 pages, 4 figures, Published in the 2024 IEEE Network Operations and Management Symposium (NOMS 2024)

  4. arXiv:2406.12688  [pdf, other

    eess.AS eess.SP

    Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech Generation

    Authors: Miseul Kim, Soo-Whan Chung, Youna Ji, Hong-Goo Kang, Min-Seok Choi

    Abstract: This paper introduces a novel task in generative speech processing, Acoustic Scene Transfer (AST), which aims to transfer acoustic scenes of speech signals to diverse environments. AST promises an immersive experience in speech perception by adapting the acoustic scene behind speech signals to desired environments. We propose AST-LDM for the AST task, which generates speech signals accompanied by… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  5. arXiv:2406.12254  [pdf, other

    eess.IV cs.CV

    Enhancing Single-Slice Segmentation with 3D-to-2D Unpaired Scan Distillation

    Authors: Xin Yu, Qi Yang, Han Liu, Ho Hin Lee, Yucheng Tang, Lucas W. Remedios, Michael Kim, Shunxing Bao, Ann Xenobia Moore, Luigi Ferrucci, Bennett A. Landman

    Abstract: 2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmenta… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  6. arXiv:2406.09025  [pdf, other

    eess.SP

    Site-Specific Radio Channel Representation -- Current State and Future Applications

    Authors: Thomas Zemen, Jorge Gomez-Ponce, Aniruddha Chandra, Michael Walter, Enes Aksoy, Ruisi He, David Matolak, Minseok Kim, Jun-ichi Takada, Sana Salous, Reinaldo Valenzuela, Andreas F. Molisch

    Abstract: A site-specific radio channel representation considers the surroundings of the communication system through the environment geometry, such as buildings, vegetation, and mobile objects including their material and surface properties. In this article, we focus on communication technologies for 5G and beyond that are increasingly able to exploit the specific environment geometry for both communicatio… ▽ More

    Submitted 18 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures, submitted to the IEEE Communication Magazine

  7. arXiv:2406.08328  [pdf, other

    eess.AS

    Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation

    Authors: Tsun-An Hsieh, Heeyoul Choi, Minje Kim

    Abstract: Recent studies highlight the potential of textual modalities in conditioning the speech separation model's inference process. However, regularization-based methods remain underexplored despite their advantages of not requiring auxiliary text data during the test time. To address this gap, we introduce a timed text-based regularization (TTR) method that uses language model-derived semantics to impr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  8. arXiv:2406.05965  [pdf, other

    eess.AS cs.AI

    MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

    Authors: Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung ** Choi, Nam Soo Kim

    Abstract: In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancin… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  9. arXiv:2405.02996  [pdf, other

    cs.SD cs.AI eess.AS

    RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification

    Authors: June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung

    Abstract: Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrain… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: Accepted EMBC 2024

  10. arXiv:2405.01681  [pdf, other

    eess.SY

    Accounting for the Effects of Probabilistic Uncertainty During Fast Charging of Lithium-ion Batteries

    Authors: Minsu Kim, Joachim Schaeffer, Marc D. Berliner, Berta Pedret Sagnier, Rolf Findeisen, Richard D. Braatz

    Abstract: Batteries are nonlinear dynamical systems that can be modeled by Porous Electrode Theory models. The aim of optimal fast charging is to reduce the charging time while kee** battery degradation low. Most past studies assume that model parameters and ambient temperature are a fixed known value and that all PET model parameters are perfectly known. In real battery operation, however, the ambient te… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 6 pages, 5 figures, accepted for ACC 2024

  11. arXiv:2404.16065  [pdf, other

    cs.HC eess.SP

    mmWave Wearable Antenna for Interaction with VR Devices

    Authors: Haksun Son, Song Min Kim

    Abstract: The VR industry is one of the most promising industries for the near future, as it can provide a more immersive connection between people and the virtual world. Currently, VR devices interact with people using inconvenient controllers or cameras that perform poorly in dark environments. Interaction through millimeter-wave wearable devices has the potential to conveniently track human behavior rega… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  12. arXiv:2404.03154  [pdf, ps, other

    eess.SP

    Age-of-Information-Aware Distributed Task Offloading and Resource Allocation in Mobile Edge Computing Networks

    Authors: Minwoo Kim, Jonggyu Jang, Youngchol Choi, Hyun Jong Yang

    Abstract: The growth in artificial intelligence (AI) technology has attracted substantial interests in age-of-information (AoI)-aware task offloading of mobile edge computing (MEC)-namely, minimizing service latency. Additionally, the use of MEC systems poses an additional problem arising from limited battery resources of MDs. This paper tackles the pressing challenge of AoI-aware distributed task offloadin… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 17 pages, 8 figures

  13. arXiv:2404.01816  [pdf, other

    eess.IV cs.CV cs.HC

    Rethinking Annotator Simulation: Realistic Evaluation of Whole-Body PET Lesion Interactive Segmentation Methods

    Authors: Zdravko Marinov, Moon Kim, Jens Kleesiek, Rainer Stiefelhagen

    Abstract: Interactive segmentation plays a crucial role in accelerating the annotation, particularly in domains requiring specialized expertise such as nuclear medicine. For example, annotating lesions in whole-body Positron Emission Tomography (PET) images can require over an hour per volume. While previous works evaluate interactive segmentation models through either real user studies or simulated annotat… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 10 pages, 5 figures, 1 table

  14. Personalized Neural Speech Codec

    Authors: Inseon Jang, Haici Yang, Wootaek Lim, Seungkwon Beack, Minje Kim

    Abstract: In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can b… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 991-995

  15. A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?

    Authors: Kahyun Choi, Minje Kim

    Abstract: This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we deve… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1296-1300

  16. arXiv:2403.19132  [pdf, ps, other

    eess.SP

    Meta-Heuristic Fronthaul Bit Allocation for Cell-free Massive MIMO Systems

    Authors: Minje Kim, In-soo Kim, Junil Choi

    Abstract: Limited capacity of fronthaul links in a cell-free massive multiple-input multiple-output (MIMO) system can cause quantization errors at a central processing unit (CPU) during data transmission, complicating the centralized rate optimization problem. Addressing this challenge, we propose a harmony search (HS)-based algorithm that renders the combinatorial non-convex problem tractable. One of the d… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 16 pages, 13 figures, accepted to IEEE Transactions on Wireless Communications (TWC)

  17. arXiv:2403.18992  [pdf

    eess.IV

    Tractography with T1-weighted MRI and associated anatomical constraints on clinical quality diffusion MRI

    Authors: Tian Yu, Yunhe Li, Michael E. Kim, Chenyu Gao, Qi Yang, Leon Y. Cai, Susane M. Resnick, Lori L. Beason-Held, Daniel C. Moyer, Kurt G. Schilling, Bennett A. Landman

    Abstract: Diffusion MRI (dMRI) streamline tractography, the gold standard for in vivo estimation of brain white matter (WM) pathways, has long been considered indicative of macroscopic relationships with WM microstructure. However, recent advances in tractography demonstrated that convolutional recurrent neural networks (CoRNN) trained with a teacher-student framework have the ability to learn and propagate… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  18. arXiv:2403.09967  [pdf, other

    eess.SP

    NR-Surface: NextG-ready $μ$W-reconfigurable mmWave Metasurface

    Authors: Minseok Kim, Namjo Ahn, Song Min Kim

    Abstract: Metasurface has recently emerged as an economic solution to expand mmWave coverage. However, their pervasive deployment remains a challenge, mainly due to the difficulty in reaching the tight 260ns NR synchronization requirement and real-time wireless reconfiguration while maintaining multi-year battery life. This paper presents NR-Surface, the first real-time reconfigurable metasurface fully comp… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: 17 pages, 28 figures, to be published in NSDI '24

  19. arXiv:2403.08187  [pdf, other

    cs.CL cs.SD eess.AS

    Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

    Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

    Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 12 pages, 2 figures

    ACM Class: I.2.7

  20. arXiv:2402.16307  [pdf, ps, other

    eess.SP

    Analyzing Downlink Coverage in Clustered Low Earth Orbit Satellite Constellations: A Stochastic Geometry Approach

    Authors: Miyeon Lee, Sucheol Kim, Minje Kim, Dong-Hyun Jung, Junil Choi

    Abstract: Satellite networks are emerging as vital solutions for global connectivity beyond 5G. As companies such as SpaceX, OneWeb, and Amazon are poised to launch a large number of satellites in low Earth orbit, the heightened inter-satellite interference caused by mega-constellations has become a significant concern. To address this challenge, recent works have introduced the concept of satellite cluster… ▽ More

    Submitted 29 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: submitted to IEEE Transactions on Communications

  21. arXiv:2402.16021  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  22. arXiv:2402.15151  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

    Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

    Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More

    Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: An Erratum was added on the last page of this paper

  23. arXiv:2401.09802  [pdf, other

    eess.AS cs.CV cs.SD

    Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

    Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se ** Park, Yong Man Ro

    Abstract: This paper explores sentence-level Multilingual Visual Speech Recognition with a single model for the first time. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, the proposed visual speech unit is obtained by discretizing the visual spee… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  24. arXiv:2401.06798  [pdf

    q-bio.NC eess.IV

    Evaluation of Mean Shift, ComBat, and CycleGAN for Harmonizing Brain Connectivity Matrices Across Sites

    Authors: Hanliang Xu, Nancy R. Newlin, Michael E. Kim, Chenyu Gao, Praitayini Kanakaraj, Aravind R. Krishnan, Lucas W. Remedios, Nazirah Mohd Khairi, Kimberly Pechman, Derek Archer, Timothy J. Hohman, Angela L. Jefferson, The BIOCARD Study Team, Ivana Isgum, Yuankai Huo, Daniel Moyer, Kurt G. Schilling, Bennett A. Landman

    Abstract: Connectivity matrices derived from diffusion MRI (dMRI) provide an interpretable and generalizable way of understanding the human brain connectome. However, dMRI suffers from inter-site and between-scanner variation, which impedes analysis across datasets to improve robustness and reproducibility of results. To evaluate different harmonization approaches on connectivity matrices, we compared graph… ▽ More

    Submitted 24 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: 11 pages, 5 figures, to be published in SPIE Medical Imaging 2024: Image Processing

  25. arXiv:2401.03567  [pdf, other

    eess.AS cs.SD

    Hyperbolic Distance-Based Speech Separation

    Authors: Darius Petermann, Minje Kim

    Abstract: In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the Poincaré ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: To be published at ICASSP2024, 14th of April 2024, Seoul, South Korea. Copyright (c) 2023 IEEE. 5 pages, 2 figures, 3 tables

  26. arXiv:2401.03060  [pdf

    eess.IV cs.CV

    Super-resolution multi-contrast unbiased eye atlases with deep probabilistic refinement

    Authors: Ho Hin Lee, Adam M. Saunders, Michael E. Kim, Samuel W. Remedios, Lucas W. Remedios, Yucheng Tang, Qi Yang, Xin Yu, Shunxing Bao, Chloe Cho, Louise A. Mawn, Tonia S. Rex, Kevin L. Schey, Blake E. Dewey, Jeffrey M. Spraggins, Jerry L. Prince, Yuankai Huo, Bennett A. Landman

    Abstract: Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference. Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details… ▽ More

    Submitted 14 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

    Comments: Revised for submission to SPIE Journal of Medical Imaging. 26 pages, 6 figures

  27. arXiv:2401.01498  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

    Authors: Minchan Kim, Myeonghun Jeong, Byoung ** Choi, Semin Kim, Joun Yeop Lee, Nam Soo Kim

    Abstract: We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token trans… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  28. arXiv:2401.01099  [pdf, other

    eess.AS cs.AI cs.LG

    Efficient Parallel Audio Generation using Group Masked Language Modeling

    Authors: Myeonghun Jeong, Minchan Kim, Joun Yeop Lee, Nam Soo Kim

    Abstract: We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  29. arXiv:2312.13615  [pdf, other

    eess.AS cs.SD eess.SP

    Self-supervised Complex Network for Machine Sound Anomaly Detection

    Authors: Miseul Kim, Minh Tri Ho, Hong-Goo Kang

    Abstract: In this paper, we propose an anomaly detection algorithm for machine sounds with a deep complex network trained by self-supervision. Using the fact that phase continuity information is crucial for detecting abnormalities in time-series signals, our proposed algorithm utilizes the complex spectrum as an input and performs complex number arithmetic throughout the entire process. Since the usefulness… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Published in EUSIPCO 2021

  30. arXiv:2312.13603  [pdf, other

    eess.AS cs.SD

    Style Modeling for Multi-Speaker Articulation-to-Speech

    Authors: Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang

    Abstract: In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our p… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages, Accepted to ICASSP 2023

  31. arXiv:2312.13600  [pdf, other

    eess.AS cs.SD

    BrainTalker: Low-Resource Brain-to-Speech Synthesis with Transfer Learning using Wav2Vec 2.0

    Authors: Miseul Kim, Zhenyu Piao, Jihyun Lee, Hong-Goo Kang

    Abstract: Decoding spoken speech from neural activity in the brain is a fast-emerging research topic, as it could enable communication for people who have difficulties with producing audible speech. For this task, electrocorticography (ECoG) is a common method for recording brain activity with high temporal resolution and high spatial precision. However, due to the risky surgical procedure required for obta… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages. Accepted to BHI 2023

  32. arXiv:2312.09572  [pdf, other

    eess.AS cs.CL cs.SD

    IR-UWB Radar-Based Contactless Silent Speech Recognition of Vowels, Consonants, Words, and Phrases

    Authors: Sunghwa Lee, Younghoon Shin, Myungjong Kim, Jiwon Seo

    Abstract: Several sensing techniques have been proposed for silent speech recognition (SSR); however, many of these methods require invasive processes or sensor attachment to the skin using adhesive tape or glue, rendering them unsuitable for frequent use in daily life. By contrast, impulse radio ultra-wideband (IR-UWB) radar can operate without physical contact with users' articulators and related body par… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Submitted to IEEE Access

  33. arXiv:2312.08136  [pdf, other

    cs.CV eess.IV

    ProNeRF: Learning Efficient Projection-Aware Ray Sampling for Fine-Grained Implicit Neural Radiance Fields

    Authors: Juan Luis Gonzalez Bello, Minh-Quan Viet Bui, Munchurl Kim

    Abstract: Recent advances in neural rendering have shown that, albeit slow, implicit compact models can learn a scene's geometries and view-dependent appearances from multiple views. To maintain such a small memory footprint but achieve faster inference times, recent works have adopted `sampler' networks that adaptively sample a small subset of points along each ray in the implicit neural radiance fields. A… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Visit our project website at https://kaist-viclab.github.io/pronerf-site/

  34. arXiv:2312.08071  [pdf, other

    cs.CV eess.IV

    Novel View Synthesis with View-Dependent Effects from a Single Image

    Authors: Juan Luis Gonzalez Bello, Munchurl Kim

    Abstract: In this paper, we firstly consider view-dependent effects into single image-based novel view synthesis (NVS) problems. For this, we propose to exploit the camera motion priors in NVS to model view-dependent appearance or effects (VDE) as the negative disparity in the scene. By recognizing specularities "follow" the camera motion, we infuse VDEs into the input images by aggregating input pixel colo… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Visit our website https://kaist-viclab.github.io/monovde-site

  35. arXiv:2312.02512  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

    Authors: Jeongsoo Choi, Se ** Park, Minsu Kim, Yong Man Ro

    Abstract: This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast… ▽ More

    Submitted 26 March, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Code & Demo: https://choijeongsoo.github.io/av2av

  36. arXiv:2311.15683  [pdf

    eess.AS cs.SD eess.SP

    Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency

    Authors: Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung-Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti

    Abstract: Our research presents a wearable Silent Speech Interface (SSI) technology that excels in device comfort, time-energy efficiency, and speech decoding accuracy for real-world use. We developed a biocompatible, durable textile choker with an embedded graphene-based strain sensor, capable of accurately detecting subtle throat movements. This sensor, surpassing other strain sensors in sensitivity by 42… ▽ More

    Submitted 7 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: 5 figures in the article; 11 figures and 4 tables in supplementary information

    Journal ref: npj Flexible Electronics (2024)

  37. arXiv:2311.14482  [pdf, other

    eess.IV cs.AI cs.CV cs.HC

    Sliding Window FastEdit: A Framework for Lesion Annotation in Whole-body PET Images

    Authors: Matthias Hadlich, Zdravko Marinov, Moon Kim, Enrico Nasca, Jens Kleesiek, Rainer Stiefelhagen

    Abstract: Deep learning has revolutionized the accurate segmentation of diseases in medical imaging. However, achieving such results requires training with numerous manual voxel annotations. This requirement presents a challenge for whole-body Positron Emission Tomography (PET) imaging, where lesions are scattered throughout the body. To tackle this problem, we introduce SW-FastEdit - an interactive segment… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: 5 pages, 2 figures, 4 tables

  38. arXiv:2311.08330  [pdf, other

    eess.AS cs.SD

    Generative De-Quantization for Neural Speech Codec via Latent Diffusion

    Authors: Haici Yang, Inseon Jang, Minje Kim

    Abstract: In low-bitrate speech coding, end-to-end speech coding networks aim to learn compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learnin… ▽ More

    Submitted 15 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Submitted to ICASSP 2024

  39. arXiv:2311.04468  [pdf

    eess.IV q-bio.NC

    A human brain atlas of chi-separation for normative iron and myelin distributions

    Authors: Kyeongseon Min, Beomseok Sohn, Woo Jung Kim, Chae Jung Park, Soohwa Song, Dong Hoon Shin, Kyung Won Chang, Na-Young Shin, Minjun Kim, Hyeong-Geol Shin, Phil Hyu Lee, Jongho Lee

    Abstract: Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility map** technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opene… ▽ More

    Submitted 2 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 19 pages, 9 figures

  40. arXiv:2311.03500  [pdf

    eess.IV cs.CV q-bio.NC

    Predicting Age from White Matter Diffusivity with Residual Learning

    Authors: Chenyu Gao, Michael E. Kim, Ho Hin Lee, Qi Yang, Nazirah Mohd Khairi, Praitayini Kanakaraj, Nancy R. Newlin, Derek B. Archer, Angela L. Jefferson, Warren D. Taylor, Brian D. Boyd, Lori L. Beason-Held, Susan M. Resnick, The BIOCARD Study Team, Yuankai Huo, Katherine D. Van Schaik, Kurt G. Schilling, Daniel Moyer, Ivana Išgum, Bennett A. Landman

    Abstract: Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for develo** biomarkers that are sensitive to such deviations. Complementary to structural analysis,… ▽ More

    Submitted 21 January, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: SPIE Medical Imaging: Image Processing. San Diego, CA. February 2024 (accepted as poster presentation)

  41. arXiv:2311.02898  [pdf, other

    eess.AS cs.LG

    Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

    Authors: Minchan Kim, Myeonghun Jeong, Byoung ** Choi, Dongjune Lee, Nam Soo Kim

    Abstract: We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semant… ▽ More

    Submitted 8 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted at ASRU2023

  42. arXiv:2311.00364  [pdf, other

    eess.AS cs.SD physics.bio-ph

    C2C: Cough to COVID-19 Detection in BHI 2023 Data Challenge

    Authors: Woo-** Chung, Miseul Kim, Hong-Goo Kang

    Abstract: This report describes our submission to BHI 2023 Data Competition: Sensor challenge. Our Audio Alchemists team designed an acoustic-based COVID-19 diagnosis system, Cough to COVID-19 (C2C), and won the 1st place in the challenge. C2C involves three key contributions: pre-processing of input signals, cough-related representation extraction leveraging Wav2vec2.0, and data augmentation. Through exper… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 1st place winning paper from the BHI 2023 Data Challenge Competition: Sensor Informatics

  43. arXiv:2310.05934  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

    Authors: Se ** Park, Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: Speech-driven 3D facial animation has gained significant attention for its ability to create realistic and expressive facial animations in 3D space based on speech. Learning-based methods have shown promising progress in achieving accurate facial motion synchronized with speech. However, one-to-many nature of speech-to-3D facial synthesis has not been fully explored: while the lip accurately synch… ▽ More

    Submitted 23 August, 2023; originally announced October 2023.

  44. arXiv:2310.04010  [pdf, other

    cs.CV cs.AI eess.IV

    Excision And Recovery: Visual Defect Obfuscation Based Self-Supervised Anomaly Detection Strategy

    Authors: YeongHyeon Park, Sungho Kang, Myung ** Kim, Yeonho Lee, Hyeong Seok Kim, Juneho Yi

    Abstract: Due to scarcity of anomaly situations in the early manufacturing stage, an unsupervised anomaly detection (UAD) approach is widely adopted which only uses normal samples for training. This approach is based on the assumption that the trained UAD model will accurately reconstruct normal patterns but struggles with unseen anomalous patterns. To enhance the UAD performance, reconstruction-by-inpainti… ▽ More

    Submitted 9 November, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: 10 pages, 5 figures, 5 tables

  45. arXiv:2310.01413  [pdf

    eess.IV cs.AI cs.CV

    A multi-institutional pediatric dataset of clinical radiology MRIs by the Children's Brain Tumor Network

    Authors: Ariana M. Familiar, Anahita Fathi Kazerooni, Hannah Anderson, Aliaksandr Lubneuski, Karthik Viswanathan, Rocky Breslow, Nastaran Khalili, Sina Bagheri, Debanjan Haldar, Meen Chul Kim, Sherjeel Arif, Rachel Madhogarhia, Thinh Q. Nguyen, Elizabeth A. Frenkel, Zeinab Helili, Jessica Harrison, Keyvan Farahani, Marius George Linguraru, Ulas Bagci, Yury Velichko, Jeffrey Stevens, Sarah Leary, Robert M. Lober, Stephani Campion, Amy A. Smith , et al. (15 additional authors not shown)

    Abstract: Pediatric brain and spinal cancers remain the leading cause of cancer-related death in children. Advancements in clinical decision-support in pediatric neuro-oncology utilizing the wealth of radiology imaging data collected through standard care, however, has significantly lagged other domains. Such data is ripe for use with predictive analytics such as artificial intelligence (AI) methods, which… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  46. arXiv:2309.12566  [pdf, other

    cs.RO eess.SY math.OC

    Recent Advances in Path Integral Control for Trajectory Optimization: An Overview in Theoretical and Algorithmic Perspectives

    Authors: Muhammad Kazim, JunGee Hong, Min-Gyeom Kim, Kwang-Ki K. Kim

    Abstract: This paper presents a tutorial overview of path integral (PI) control approaches for stochastic optimal control and trajectory optimization. We concisely summarize the theoretical development of path integral control to compute a solution for stochastic optimal control and provide algorithmic descriptions of the cross-entropy (CE) method, an open-loop controller using the receding horizon scheme k… ▽ More

    Submitted 1 December, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: 16 pages, 9 figures

    MSC Class: 68T40; 13P25 ACM Class: I.2.9; I.2.8; G.1.6; G.4

  47. arXiv:2309.12047  [pdf, other

    cs.CV cs.GR eess.IV

    Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

    Authors: Kiseok Choi, Inchul Kim, Dongyoung Choi, Julio Marco, Diego Gutierrez, Min H. Kim

    Abstract: Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully… ▽ More

    Submitted 25 September, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

    Journal ref: Proceedings of ACM SIGGRAPH Asia 2023 (December 2023)

  48. arXiv:2309.08535  [pdf, other

    cs.CV cs.AI eess.AS

    Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

    Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

    Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More

    Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  49. arXiv:2309.08531  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

    Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

    Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  50. arXiv:2309.07926  [pdf, other

    eess.IV cs.CV

    COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability

    Authors: Jongmin Park, Jooyoung Lee, Munchurl Kim

    Abstract: Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-bas… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: Accepted in ICCV 2023