Search | arXiv e-print repository

doi 10.1109/ICASSP48485.2024.10447582

A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?

Abstract: This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we deve… ▽ More This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we develop signal processing methods, which are tailored to capture the unique acoustic characteristics of poetry reading based on their silence patterns, temporal variations of local pitch, and beat stability. Our large-scale statistical analyses on three big corpora, each of which consists of narration (LibriSpeech), singing voice (Intonation), and poetry reading (from The Poetry Foundation), discover that poetry reading does share some musical characteristics with singing voice, although it may also resemble narrative speech. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1296-1300

arXiv:2404.00678 [pdf, other]

OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees

Authors: Hakyeong Kim, Andreas Meuleman, Hyeonjoong Jang, James Tompkin, Min H. Kim

Abstract: We present a method to reconstruct indoor and outdoor static scene geometry and appearance from an omnidirectional video moving in a small circular sweep. This setting is challenging because of the small baseline and large depth ranges, making it difficult to find ray crossings. To better constrain the optimization, we estimate geometry as a signed distance field within a spherical binoctree data… ▽ More We present a method to reconstruct indoor and outdoor static scene geometry and appearance from an omnidirectional video moving in a small circular sweep. This setting is challenging because of the small baseline and large depth ranges, making it difficult to find ray crossings. To better constrain the optimization, we estimate geometry as a signed distance field within a spherical binoctree data structure and use a complementary efficient tree traversal strategy based on a breadth-first search for sampling. Unlike regular grids or trees, the shape of this structure well-matches the camera setting, creating a better memory-quality trade-off. From an initial depth estimate, the binoctree is adaptively subdivided throughout the optimization; previous methods use a fixed depth that leaves the scene undersampled. In comparison with three neural optimization methods and two non-neural methods, ours shows decreased geometry error on average, especially in a detailed scene, while significantly reducing the required number of voxels to represent such details. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

arXiv:2404.00676 [pdf, other]

OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos

Authors: Dongyoung Choi, Hyeonjoong Jang, Min H. Kim

Abstract: Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only… ▽ More Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views, removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video, and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions, we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain, our method uses multi-resolution neural feature planes for precise segmentation, which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics, especially in scenarios with complex real-world scenes. In particular, our approach eliminates the need for manual interaction, such as drawing motion masks by hand and additional pose estimation, making it a highly effective and efficient solution. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

arXiv:2403.20225 [pdf, other]

MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark

Authors: Sanghyun Woo, Kwanyong Park, Inkyu Shin, Myungchul Kim, In So Kweon

Abstract: Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are e… ▽ More Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Accepted on CVPR 2024

arXiv:2403.19904 [pdf, other]

Fully Geometric Panoramic Localization

Authors: Junho Kim, Jiwon Jeong, Young Min Kim

Abstract: We introduce a lightweight and accurate localization method that only utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map, our approach localizes a panorama image, taking advantage of the holistic 360 view. The system mitigates potential privacy breaches or domain discrepancies by avoiding trained or hand-crafted visual descriptors. However, as lines alone can be ambiguous, we expres… ▽ More We introduce a lightweight and accurate localization method that only utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map, our approach localizes a panorama image, taking advantage of the holistic 360 view. The system mitigates potential privacy breaches or domain discrepancies by avoiding trained or hand-crafted visual descriptors. However, as lines alone can be ambiguous, we express distinctive yet compact spatial contexts from relationships between lines, namely the dominant directions of parallel lines and the intersection between non-parallel lines. The resulting representations are efficient in processing time and memory compared to conventional visual descriptor-based methods. Given the groups of dominant line directions and their intersections, we accelerate the search process to test thousands of pose candidates in less than a millisecond without sacrificing accuracy. We empirically show that the proposed 2D-3D matching can localize panoramas for challenging scenes with similar structures, dramatic domain shifts or illumination changes. Our fully geometric approach does not involve extensive parameter tuning or neural network training, making it a practical algorithm that can be readily deployed in the real world. Project page including the code is available through this link: https://82magnolia.github.io/fgpl/. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.19132 [pdf, ps, other]

Meta-Heuristic Fronthaul Bit Allocation for Cell-free Massive MIMO Systems

Authors: Minje Kim, In-soo Kim, Junil Choi

Abstract: Limited capacity of fronthaul links in a cell-free massive multiple-input multiple-output (MIMO) system can cause quantization errors at a central processing unit (CPU) during data transmission, complicating the centralized rate optimization problem. Addressing this challenge, we propose a harmony search (HS)-based algorithm that renders the combinatorial non-convex problem tractable. One of the d… ▽ More Limited capacity of fronthaul links in a cell-free massive multiple-input multiple-output (MIMO) system can cause quantization errors at a central processing unit (CPU) during data transmission, complicating the centralized rate optimization problem. Addressing this challenge, we propose a harmony search (HS)-based algorithm that renders the combinatorial non-convex problem tractable. One of the distinctive features of our algorithm is its hierarchical structure: it first allocates resources at the access point (AP) level and subsequently optimizes for user equipment (UE), ensuring a more efficient and structured approach to resource allocation. Our proposed algorithm deals with rigorous conditions, such as asymmetric fronthaul bit allocation and distinct quantization error levels at each AP, which were not considered in previous works. We derive a closed-form expression of signal-to-interference-plusnoise ratio (SINR), in which additive quantization noise model (AQNM) based distortion error is taken into account, to define the mathematical expression of spectral efficiency (SE) for each UE. Also, we provide analyses on computational complexity and convergence to investigate the practicality of proposed algorithm. By leveraging various performance metrics such as total SE and max-min fairness, we demonstrate that the proposed algorithm can adaptively optimize the fronthaul bit allocation depending on system requirements. Finally, simulation results show that the proposed algorithm can achieve satisfactory performance while maintaining low computational complexity, as compared to the exhaustive search method △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: 16 pages, 13 figures, accepted to IEEE Transactions on Wireless Communications (TWC)

arXiv:2403.18992 [pdf]

Tractography with T1-weighted MRI and associated anatomical constraints on clinical quality diffusion MRI

Authors: Tian Yu, Yunhe Li, Michael E. Kim, Chenyu Gao, Qi Yang, Leon Y. Cai, Susane M. Resnick, Lori L. Beason-Held, Daniel C. Moyer, Kurt G. Schilling, Bennett A. Landman

Abstract: Diffusion MRI (dMRI) streamline tractography, the gold standard for in vivo estimation of brain white matter (WM) pathways, has long been considered indicative of macroscopic relationships with WM microstructure. However, recent advances in tractography demonstrated that convolutional recurrent neural networks (CoRNN) trained with a teacher-student framework have the ability to learn and propagate… ▽ More Diffusion MRI (dMRI) streamline tractography, the gold standard for in vivo estimation of brain white matter (WM) pathways, has long been considered indicative of macroscopic relationships with WM microstructure. However, recent advances in tractography demonstrated that convolutional recurrent neural networks (CoRNN) trained with a teacher-student framework have the ability to learn and propagate streamlines directly from T1 and anatomical contexts. Training for this network has previously relied on high-resolution dMRI. In this paper, we generalize the training mechanism to traditional clinical resolution data, which allows generalizability across sensitive and susceptible study populations. We train CoRNN on a small subset of the Baltimore Longitudinal Study of Aging (BLSA), which better resembles clinical protocols. Then, we define a metric, termed the epsilon ball seeding method, to compare T1 tractography and traditional diffusion tractography at the streamline level. Under this metric, T1 tractography generated by CoRNN reproduces diffusion tractography with approximately two millimeters of error. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.16167 [pdf, other]

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Authors: Minchan Kim, Minyeong Kim, Junik Bae, Suhwan Choi, Sungkyung Kim, Buru Chang

Abstract: Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization a… ▽ More Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs. △ Less

Submitted 5 May, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.16158 [pdf, other]

Korean Bio-Medical Corpus (KBMC) for Medical Named Entity Recognition

Authors: Sungjoo Byun, Jiseung Hong, Sumin Park, Dongjun Jang, Jean Seo, Minseok Kim, Chaeyoung Oh, Hyopil Shin

Abstract: Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase… ▽ More Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase in medical NER performance compared to models trained on general Korean NER datasets. This research underscores the significant benefits and importance of using specialized tools and datasets, like ChatGPT, to enhance language processing in specialized fields such as healthcare. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Journal ref: LREC-COLING 2024

arXiv:2403.14852 [pdf, other]

KeyPoint Relative Position Encoding for Face Recognition

Authors: Minchul Kim, Yiyang Su, Feng Liu, Anil Jain, Xiaoming Liu

Abstract: In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin… ▽ More In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: To appear in CVPR2024

arXiv:2403.14191 [pdf, other]

doi 10.1016/j.compbiomed.2024.108241

PECI-Net: Bolus segmentation from video fluoroscopic swallowing study images using preprocessing ensemble and cascaded inference

Authors: Dougho Park, Younghun Kim, Harim Kang, Junmyeoung Lee, **young Choi, Taeyeon Kim, Sangeok Lee, Seokil Son, Minsol Kim, Injung Kim

Abstract: Bolus segmentation is crucial for the automated detection of swallowing disorders in videofluoroscopic swallowing studies (VFSS). However, it is difficult for the model to accurately segment a bolus region in a VFSS image because VFSS images are translucent, have low contrast and unclear region boundaries, and lack color information. To overcome these challenges, we propose PECI-Net, a network arc… ▽ More Bolus segmentation is crucial for the automated detection of swallowing disorders in videofluoroscopic swallowing studies (VFSS). However, it is difficult for the model to accurately segment a bolus region in a VFSS image because VFSS images are translucent, have low contrast and unclear region boundaries, and lack color information. To overcome these challenges, we propose PECI-Net, a network architecture for VFSS image analysis that combines two novel techniques: the preprocessing ensemble network (PEN) and the cascaded inference network (CIN). PEN enhances the sharpness and contrast of the VFSS image by combining multiple preprocessing algorithms in a learnable way. CIN reduces ambiguity in bolus segmentation by using context from other regions through cascaded inference. Moreover, CIN prevents undesirable side effects from unreliably segmented regions by referring to the context in an asymmetric way. In experiments, PECI-Net exhibited higher performance than four recently developed baseline models, outperforming TernausNet, the best among the baseline models, by 4.54\% and the widely used UNet by 10.83\%. The results of the ablation studies confirm that CIN and PEN are effective in improving bolus segmentation performance. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 20 pages, 8 figures,

Journal ref: Computers in Biology and Medicine (2024)

arXiv:2403.12862 [pdf, other]

Epistemology of Language Models: Do Language Models Have Holistic Knowledge?

Authors: Minsu Kim, James Thorne

Abstract: This paper investigates the inherent knowledge in language models from the perspective of epistemological holism. The purpose of this paper is to explore whether LLMs exhibit characteristics consistent with epistemological holism. These characteristics suggest that core knowledge, such as general scientific knowledge, each plays a specific role, serving as the foundation of our knowledge system an… ▽ More This paper investigates the inherent knowledge in language models from the perspective of epistemological holism. The purpose of this paper is to explore whether LLMs exhibit characteristics consistent with epistemological holism. These characteristics suggest that core knowledge, such as general scientific knowledge, each plays a specific role, serving as the foundation of our knowledge system and being difficult to revise. To assess these traits related to holism, we created a scientific reasoning dataset and examined the epistemology of language models through three tasks: Abduction, Revision, and Argument Generation. In the abduction task, the language models explained situations while avoiding revising the core knowledge. However, in other tasks, the language models were revealed not to distinguish between core and peripheral knowledge, showing an incomplete alignment with holistic knowledge principles. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12794 [pdf, other]

Optical Atomic Clock Interrogation Via an Integrated Spiral Cavity Laser

Authors: William Loh, David Reens, Dave Kharas, Alkesh Sumant, Connor Belanger, Ryan T. Maxson, Alexander Medeiros, William Setzer, Dodd Gray, Kyle DeBry, Colin D. Bruzewicz, Jason Plant, John Liddell, Gavin N. West, Sagar Doshi, Matthew Roychowdhury, May Kim, Danielle Braje, Paul W. Juodawlkis, John Chiaverini, Robert McConnell

Abstract: Optical atomic clocks have demonstrated revolutionary advances in precision timekee**, but their applicability to the real world is critically dependent on whether such clocks can operate outside a laboratory setting. The challenge to clock portability stems from the many obstacles not only in miniaturizing the underlying components of the clock $-$ namely the ultrastable laser, the frequency co… ▽ More Optical atomic clocks have demonstrated revolutionary advances in precision timekee**, but their applicability to the real world is critically dependent on whether such clocks can operate outside a laboratory setting. The challenge to clock portability stems from the many obstacles not only in miniaturizing the underlying components of the clock $-$ namely the ultrastable laser, the frequency comb, and the atomic reference itself $-$ but also in making the clock resilient to environmental fluctuations. Photonic integration offers one compelling solution to simultaneously address the problems of miniaturization and ruggedization, but brings with it a new set of challenges in recreating the functionality of an optical clock using chip-scale building blocks. The clock laser used for atom interrogation is one particular point of uncertainty, as the performance of the meticulously-engineered bulk-cavity stabilized lasers would be exceptionally difficult to transfer to chip. Here we demonstrate that a chip-integrated ultrahigh quality factor (Q) spiral cavity, when interfaced with a 1348 nm seed laser, reaches a fractional frequency instability of $7.5 \times 10^{-14}$, meeting the stability requirements for interrogating the narrow-linewidth transition of $^{88}$Sr$^+$ upon frequency doubling to 674 nm. In addition to achieving the record for laser stability on chip, we use this laser to showcase the operation of a Sr-ion clock with short-term instability averaging down as $3.9 \times 10^{-14} / \sqrtτ$, where $τ$ is the averaging time. Our demonstration of an optical atomic clock interrogated by an integrated spiral cavity laser opens the door for future advanced clock systems to be entirely constructed using lightweight, portable, and mass-manufacturable integrated optics and electronics. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.11472 [pdf, other]

Accelerating String-Key Learned Index Structures via Memoization-based Incremental Training

Authors: Minsu Kim, **woo Hwang, Guseul Heo, Seiyeon Cho, Divya Mahajan, Jongse Park

Abstract: Learned indexes use machine learning models to learn the map**s between keys and their corresponding positions in key-value indexes. These indexes use the map** information as training data. Learned indexes require frequent retrainings of their models to incorporate the changes introduced by update queries. To efficiently retrain the models, existing learned index systems often harness a linea… ▽ More Learned indexes use machine learning models to learn the map**s between keys and their corresponding positions in key-value indexes. These indexes use the map** information as training data. Learned indexes require frequent retrainings of their models to incorporate the changes introduced by update queries. To efficiently retrain the models, existing learned index systems often harness a linear algebraic QR factorization technique that performs matrix decomposition. This factorization approach processes all key-position pairs during each retraining, resulting in compute operations that grow linearly with the total number of keys and their lengths. Consequently, the retrainings create a severe performance bottleneck, especially for variable-length string keys, while the retrainings are crucial for maintaining high prediction accuracy and in turn, ensuring low query service latency. To address this performance problem, we develop an algorithm-hardware co-designed string-key learned index system, dubbed SIA. In designing SIA, we leverage a unique algorithmic property of the matrix decomposition-based training method. Exploiting the property, we develop a memoization-based incremental training scheme, which only requires computation over updated keys, while decomposition results of non-updated keys from previous computations can be reused. We further enhance SIA to offload a portion of this training process to an FPGA accelerator to not only relieve CPU resources for serving index queries (i.e., inference), but also accelerate the training itself. Our evaluation shows that compared to ALEX, LIPP, and SIndex, a state-of-the-art learned index systems, SIA-accelerated learned indexes offer 2.6x and 3.4x higher throughput on the two real-world benchmark suites, YCSB and Twitter cache trace, respectively. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted at VLDB '24; 12 pages + 2 pages (ref), 18 figures, 2 tables

arXiv:2403.11399 [pdf, other]

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

Authors: Dongjae Shin, Hyeonseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim

Abstract: The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and c… ▽ More The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on015 these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches. △ Less

Submitted 1 April, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.11382 [pdf, other]

Topological singularity-induced self-energy in strongly correlated fermion systems

Authors: Byungkyun Kang, Zachary Brown, Myoung-Hwan Kim, Hyunsoo Kim, Chul Hong Park

Abstract: Employing ab initio many-body perturbation theory combined with dynamical mean field theory, we discovered that in strongly correlated topological semimetals HoPtBi and PrAlGe, which exhibit topological singular points in the vicinity of the Fermi level, the formation of 4$f$ quasiparticles are forbidden. We show that blocking hybridization channels at the topological singular point effectively en… ▽ More Employing ab initio many-body perturbation theory combined with dynamical mean field theory, we discovered that in strongly correlated topological semimetals HoPtBi and PrAlGe, which exhibit topological singular points in the vicinity of the Fermi level, the formation of 4$f$ quasiparticles are forbidden. We show that blocking hybridization channels at the topological singular point effectively enhances on-site Coulomb repulsion, resulting in a substantial self-energy. This renders the topological singular point incompatible with the presence of strongly correlated electrons at the Fermi level. In contrast to the Kondo effect, our findings suggest that the topological quasiparticles in close proximity to the singular points do not hybridize with 4$f$ electrons due to the self-energy, thus hindering the manifestation of heavy-fermion behavior when the singular points persist at the Fermi level. △ Less

Submitted 6 May, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.10494 [pdf, other]

Lifelong LERF: Local 3D Semantic Inventory Monitoring Using FogROS2

Authors: Adam Rashid, Chung Min Kim, Justin Kerr, Letian Fu, Kush Hari, Ayah Ahmad, Kaiyuan Chen, Huang Huang, Marcus Gualtieri, Michael Wang, Christian Juette, Nan Tian, Liu Ren, Ken Goldberg

Abstract: Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes… ▽ More Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: See project webpage at: https://sites.google.com/berkeley.edu/lifelonglerf/home

arXiv:2403.09967 [pdf, other]

NR-Surface: NextG-ready $μ$W-reconfigurable mmWave Metasurface

Authors: Minseok Kim, Namjo Ahn, Song Min Kim

Abstract: Metasurface has recently emerged as an economic solution to expand mmWave coverage. However, their pervasive deployment remains a challenge, mainly due to the difficulty in reaching the tight 260ns NR synchronization requirement and real-time wireless reconfiguration while maintaining multi-year battery life. This paper presents NR-Surface, the first real-time reconfigurable metasurface fully comp… ▽ More Metasurface has recently emerged as an economic solution to expand mmWave coverage. However, their pervasive deployment remains a challenge, mainly due to the difficulty in reaching the tight 260ns NR synchronization requirement and real-time wireless reconfiguration while maintaining multi-year battery life. This paper presents NR-Surface, the first real-time reconfigurable metasurface fully compliant with the NR standard, operating at 242.7 $μ$W for a 2.1-year lifetime on an AA battery. NR-Surface incorporates (i) a new extremely low-power (14KHz sampling) reconfiguration interface, NarrowBand Packet Unit (NBPU), for synchronization and real-time reconfiguration, and (ii) a highly responsive and low-leakage metasurface designed for low-duty cycled operation, by carefully leveraging the structure and the periodicity of the NR beam management procedure in the NR standard. NR-Surface is prototyped and evaluated end-to-end with NR BS built on srsRAN to demonstrate diverse usage scenarios including multiple NR-Surface per BS, multiple UE per NR-Surface, and 3D beamforming. Around-the-corner UE evaluations showcase NR-Surface efficacy under different user mobility patterns (20.3dB gain) and dynamic blockage (22.2dB gain). △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 17 pages, 28 figures, to be published in NSDI '24

arXiv:2403.09508 [pdf, other]

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Authors: Jeonghyeok Do, Munchurl Kim

Abstract: Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent… ▽ More Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Please visit our project page at https://jeonghyeokdo.github.io/SkateFormer_site/

arXiv:2403.08827 [pdf, other]

Locational Scenario-based Pricing in a Bilateral Distribution Energy Market under Uncertainty

Authors: Hien Thanh Doan, Minsoo Kim, Keunju Song, Hongseok Kim

Abstract: In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This… ▽ More In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This method involves iterative improvements in economic dispatch and optimal power flow, aiming to minimize operational costs by incorporating quantile forecasting. Then, we present a real-time market and payment problem to handle optimization in real-time decision-making and payment calculation. We assess the effectiveness of our proposed method against benchmark results and conduct a test using data from 50 real households to demonstrate its practicality. Furthermore, we compare our method with existing studies in the field across two different seasons of the year. In the summer season, our method decreases optimality gap by 60% compared to the baseline, and in the winter season, it reduces optimality gap by 67%. Moreover, our proposed method mitigates the congestion of distribution network by 16.7\% within a day caused by uncertain energy, which is a crucial aspect for implementing energy markets in the real world. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2403.08302 [pdf, other]

Online Multi-Contact Feedback Model Predictive Control for Interactive Robotic Tasks

Authors: Seo Wook Han, Maged Iskandar, **oh Lee, Min Jun Kim

Abstract: In this paper, we propose a model predictive control (MPC) that accomplishes interactive robotic tasks, in which multiple contacts may occur at unknown locations. To address such scenarios, we made an explicit contact feedback loop in the MPC framework. An algorithm called Multi-Contact Particle Filter with Exploration Particle (MCP-EP) is employed to establish real-time feedback of multi-contact… ▽ More In this paper, we propose a model predictive control (MPC) that accomplishes interactive robotic tasks, in which multiple contacts may occur at unknown locations. To address such scenarios, we made an explicit contact feedback loop in the MPC framework. An algorithm called Multi-Contact Particle Filter with Exploration Particle (MCP-EP) is employed to establish real-time feedback of multi-contact information. Then the interaction locations and forces are accommodated in the MPC framework via a spring contact model. Moreover, we achieved real-time control for a 7 degrees of freedom robot without any simplifying assumptions by employing a Differential-Dynamic-Programming algorithm. We achieved 6.8kHz, 1.9kHz, and 1.8kHz update rates of the MPC for 0, 1, and 2 contacts, respectively. This allows the robot to handle unexpected contacts in real time. Real-world experiments show the effectiveness of the proposed method in various scenarios. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: This paper has been accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, 2024

arXiv:2403.08277 [pdf, other]

VIGFace: Virtual Identity Generation Model for Face Image Synthesis

Authors: Minsoo Kim, Min-Cheol Sagong, Gi Pyo Nam, Junghyun Cho, Ig-Jae Kim

Abstract: Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Initially, we train the face recognition model using a real face dataset and cre… ▽ More Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Initially, we train the face recognition model using a real face dataset and create a feature space for both real and virtual IDs where virtual prototypes are orthogonal to other prototypes. Subsequently, we generate synthetic images by using the diffusion model based on the feature space. Our proposed framework provides two significant benefits. Firstly, it allows for creating virtual facial images without concerns about portrait rights, guaranteeing that the generated virtual face images are clearly differentiated from existing individuals. Secondly, it serves as an effective augmentation method by incorporating real existing images. Further experiments demonstrate the efficacy of our framework, achieving state-of-the-art results from both perspectives without any external data. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.08272 [pdf, other]

RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education

Authors: Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Tak Yeon Lee, So-Yeon Ahn, Alice Oh

Abstract: The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 212 college students in English as Foreign Language (EFL) writing courses. During the study, studen… ▽ More The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 212 college students in English as Foreign Language (EFL) writing courses. During the study, students engaged in dialogues with ChatGPT to revise their essays. RECIPE4U includes comprehensive records of these interactions, including conversation logs, students' intent, students' self-rated satisfaction, and students' essay edit histories. In particular, we annotate the students' utterances in RECIPE4U with 13 intention labels based on our coding schemes. We establish baseline results for two subtasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. As a foundational step, we explore student-ChatGPT interaction patterns through RECIPE4U and analyze them by focusing on students' dialogue, essay data statistics, and students' essay edits. We further illustrate potential applications of RECIPE4U dataset for enhancing the incorporation of LLMs in educational frameworks. RECIPE4U is publicly available at https://zeunie.github.io/RECIPE4U/. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2309.13243

arXiv:2403.08262 [pdf, other]

BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image

Authors: Minje Kim, Tae-Kyun Kim

Abstract: Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video… ▽ More Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video or multiple images as input. In this paper, we propose a novel method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the first end-to-end trainable method for relightable, pose-free texture reconstruction of two interacting hands taking only a single RGB image, by three novel components: 1) bi-directional (left $\leftrightarrow$ right) texture reconstruction using the texture symmetry of left / right hands, 2) utilizing a texture parametric model for hand texture recovery, and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image, then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets, our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/yunmin**2/BiTT △ Less

Submitted 25 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024, Project Page: https://yunmin**2.github.io/projects/bitt/

arXiv:2403.08256 [pdf, other]

IG-FIQA: Improving Face Image Quality Assessment through Intra-class Variance Guidance robust to Inaccurate Pseudo-Labels

Authors: Minsoo Kim, Gi Pyo Nam, Haksub Kim, Haesol Park, Ig-Jae Kim

Abstract: In the realm of face image quality assesment (FIQA), method based on sample relative classification have shown impressive performance. However, the quality scores used as pseudo-labels assigned from images of classes with low intra-class variance could be unrelated to the actual quality in this method. To address this issue, we present IG-FIQA, a novel approach to guide FIQA training, introducing… ▽ More In the realm of face image quality assesment (FIQA), method based on sample relative classification have shown impressive performance. However, the quality scores used as pseudo-labels assigned from images of classes with low intra-class variance could be unrelated to the actual quality in this method. To address this issue, we present IG-FIQA, a novel approach to guide FIQA training, introducing a weight parameter to alleviate the adverse impact of these classes. This method involves estimating sample intra-class variance at each iteration during training, ensuring minimal computational overhead and straightforward implementation. Furthermore, this paper proposes an on-the-fly data augmentation methodology for improved generalization performance in FIQA. On various benchmark datasets, our proposed method, IG-FIQA, achieved novel state-of-the-art (SOTA) performance. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.08187 [pdf, other]

Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 12 pages, 2 figures

ACM Class: I.2.7

arXiv:2403.07041 [pdf, other]

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

Authors: Minsu Kim, Sanghyeok Choi, Hyeonah Kim, Jiwoo Son, **kyoo Park, Yoshua Bengio

Abstract: This paper introduces the Generative Flow Ant Colony Sampler (GFACS), a neural-guided probabilistic search algorithm for solving combinatorial optimization (CO). GFACS integrates generative flow networks (GFlowNets), an emerging amortized inference method, with ant colony optimization (ACO), a promising probabilistic search algorithm. Specifically, we use GFlowNets to learn a constructive policy i… ▽ More This paper introduces the Generative Flow Ant Colony Sampler (GFACS), a neural-guided probabilistic search algorithm for solving combinatorial optimization (CO). GFACS integrates generative flow networks (GFlowNets), an emerging amortized inference method, with ant colony optimization (ACO), a promising probabilistic search algorithm. Specifically, we use GFlowNets to learn a constructive policy in combinatorial spaces for enhancing ACO by providing an informed prior distribution over decision variables conditioned on input graph instances. Furthermore, we introduce a novel off-policy training algorithm for scaling conditional GFlowNets into large-scale combinatorial spaces by leveraging local search and shared energy normalization. Our experimental results demonstrate that GFACS outperforms baseline ACO algorithms in seven CO tasks and is competitive with problem-specific heuristics for vehicle routing problems. △ Less

Submitted 22 May, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 23 pages, 5 figures

arXiv:2403.05252 [pdf, other]

Quantum error cancellation in photonic systems -- undoing photon losses

Authors: Adam Taylor, Gabriele Bressanini, Hyukjoon Kwon, M. S. Kim

Abstract: Real photonic devices are subject to photon losses that can decohere quantum information encoded in the system. In the absence of full fault tolerance, quantum error mitigation techniques have been introduced to help manage errors in noisy quantum devices. In this work, we introduce an error mitigation protocol inspired by probabilistic error cancellation (a popular error mitigation technique in d… ▽ More Real photonic devices are subject to photon losses that can decohere quantum information encoded in the system. In the absence of full fault tolerance, quantum error mitigation techniques have been introduced to help manage errors in noisy quantum devices. In this work, we introduce an error mitigation protocol inspired by probabilistic error cancellation (a popular error mitigation technique in discrete variable systems) for continuous variable systems. We show that our quantum error cancellation protocol can undo photon losses in expectation value estimation tasks. To do this, we analytically derive the (non-physical) inverse photon loss channel and decompose it into a sum over physically realisable channels with potentially negative coefficients. The bias of our ideal expectation value estimator can be made arbitrarily small at the cost of increasing the sampling overhead. The protocol requires a noiseless amplification followed by a series of photon-subtractions. While these operations can be implemented probabilistically, for certain classes of initial state one can avoid the burden of carrying out the amplification and photon-subtractions by leveraging Monte-Carlo methods to give an unbiased estimate of the ideal expectation value. We validate our proposed mitigation protocol by simulating the scheme on squeezed vacuum states, cat states and entangled coherent states. △ Less

Submitted 28 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: Comments welcome. 22 pages, 10 figures

arXiv:2403.04460 [pdf, other]

Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset

Authors: Min** Kim, Minju Kim, Hana Kim, Beong-woo Kwak, Soyeon Chun, Hyunseo Kim, SeongKu Kang, Youngjae Yu, **young Yeo, Dongha Lee

Abstract: Conversational recommender system is an emerging area that has garnered an increasing interest in the community, especially with the advancements in large language models (LLMs) that enable diverse reasoning over conversational input. Despite the progress, the field has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user prefer… ▽ More Conversational recommender system is an emerging area that has garnered an increasing interest in the community, especially with the advancements in large language models (LLMs) that enable diverse reasoning over conversational input. Despite the progress, the field has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user preferences and explanations for recommendations, hindering high-quality recommendations. To address such challenges, we present a novel conversational recommendation dataset named PEARL, synthesized with persona- and knowledge-augmented LLM simulators. We obtain detailed persona and knowledge from real-world reviews and construct a large-scale dataset with over 57k dialogues. Our experimental results demonstrate that utterances in PEARL include more specific user preferences, show expertise in the target domain, and provide recommendations more relevant to the dialogue context than those in prior datasets. △ Less

Submitted 8 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Published at ACL 2024 Findings

arXiv:2403.03919 [pdf, other]

Multi-parameter quantum estimation of single- and two-mode pure Gaussian states

Authors: Gabriele Bressanini, Marco G. Genoni, M. S. Kim, Matteo G. A. Paris

Abstract: We discuss the ultimate precision bounds on the multiparameter estimation of single- and two-mode pure Gaussian states. By leveraging on previous approaches that focused on the estimation of a complex displacement only, we derive the Holevo Cramér-Rao bound (HCRB) for both displacement and squeezing parameter characterizing single and two-mode squeezed states. In the single-mode scenario, we obtai… ▽ More We discuss the ultimate precision bounds on the multiparameter estimation of single- and two-mode pure Gaussian states. By leveraging on previous approaches that focused on the estimation of a complex displacement only, we derive the Holevo Cramér-Rao bound (HCRB) for both displacement and squeezing parameter characterizing single and two-mode squeezed states. In the single-mode scenario, we obtain an analytical bound and find that it degrades monotonically as the squeezing increases. Furthermore, we prove that heterodyne detection is nearly optimal in the large squeezing limit, but in general the optimal measurement must include non-Gaussian resources. On the other hand, in the two-mode setting, the HCRB improves as the squeezing parameter grows and we show that it can be attained using double-homodyne detection. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03693 [pdf, other]

Operational Space and Plasma Performance with an RMP-ELM Suppressed Edge

Authors: C. Paz-Soldan, S. Gu, N. Leuthold, P. Lunia, P. Xie, M. W. Kim, S. K. Kim, N. C. Logan, J. -K. Park, W. Suttrop, Y. Sun, D. B. Weisberg, M. Willensdorfer, the ASDEX-Upgrade, DIII-D, EAST, KSTAR Teams

Abstract: The operational space and global performance of plasmas with edge-localized modes (ELMs) suppressed by resonant magnetic perturbations (RMPs) are surveyed by comparing AUG, DIII-D, EAST, and KSTAR stationary operating points. RMP-ELM suppression is achieved over a range of plasma currents, toroidal fields, and RMP toroidal mode numbers. Consistent operational windows in edge safety factor are foun… ▽ More The operational space and global performance of plasmas with edge-localized modes (ELMs) suppressed by resonant magnetic perturbations (RMPs) are surveyed by comparing AUG, DIII-D, EAST, and KSTAR stationary operating points. RMP-ELM suppression is achieved over a range of plasma currents, toroidal fields, and RMP toroidal mode numbers. Consistent operational windows in edge safety factor are found across devices, while windows in plasma sha** parameters are distinct. Accessed pedestal parameters reveal a quantitatively similar pedestal-top density limit for RMP-ELM suppression in all devices of just over 3x1019 m-3. This is surprising given the wide variance of many engineering parameters and edge collisionalities, and poses a challenge to extrapolation of the regime. Wide ranges in input power, confinement time, and stored energy are observed, with the achieved triple product found to scale like the product of current, field, and radius. Observed energy confinement scaling with engineering parameters for RMP-ELM suppressed plasmas are presented and compared with expectations from established H and L-mode scalings, including treatment of uncertainty analysis. Different scaling exponents for individual engineering parameters are found as compared to the established scalings. However, extrapolation to next-step tokamaks ITER and SPARC find overall consistency within uncertainties with the established scalings, finding no obvious performance penalty when extrapolating from the assembled multi-device RMP-ELM suppressed database. Overall this work identifies common physics for RMP-ELM suppression and highlights the need to pursue this no-ELM regime at higher magnetic field and different plasma physical size. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: 22 pages, 11 figures

arXiv:2403.03368 [pdf, other]

Leveraging Federated Learning for Automatic Detection of Clopidogrel Treatment Failures

Authors: Samuel Kim, Min Sang Kim

Abstract: The effectiveness of clopidogrel, a widely used antiplatelet medication, varies significantly among individuals, necessitating the development of precise predictive models to optimize patient care. In this study, we leverage federated learning strategies to address clopidogrel treatment failure detection. Our research harnesses the collaborative power of multiple healthcare institutions, allowing… ▽ More The effectiveness of clopidogrel, a widely used antiplatelet medication, varies significantly among individuals, necessitating the development of precise predictive models to optimize patient care. In this study, we leverage federated learning strategies to address clopidogrel treatment failure detection. Our research harnesses the collaborative power of multiple healthcare institutions, allowing them to jointly train machine learning models while safeguarding sensitive patient data. Utilizing the UK Biobank dataset, which encompasses a vast and diverse population, we partitioned the data based on geographic centers and evaluated the performance of federated learning. Our results show that while centralized training achieves higher Area Under the Curve (AUC) values and faster convergence, federated learning approaches can substantially narrow this performance gap. Our findings underscore the potential of federated learning in addressing clopidogrel treatment failure detection, offering a promising avenue for enhancing patient care through personalized treatment strategies while respecting data privacy. This study contributes to the growing body of research on federated learning in healthcare and lays the groundwork for secure and privacy-preserving predictive models for various medical conditions. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.03004 [pdf, other]

Ultralight vector dark matter search using data from the KAGRA O3GK run

Authors: The LIGO Scientific Collaboration, the Virgo Collaboration, the KAGRA Collaboration, A. G. Abac, R. Abbott, H. Abe, I. Abouelfettouh, F. Acernese, K. Ackley, C. Adamcewicz, S. Adhicary, N. Adhikari, R. X. Adhikari, V. K. Adkins, V. B. Adya, C. Affeldt, D. Agarwal, M. Agathos, O. D. Aguiar, I. Aguilar, L. Aiello, A. Ain, P. Ajith, T. Akutsu, S. Albanesi , et al. (1778 additional authors not shown)

Abstract: Among the various candidates for dark matter (DM), ultralight vector DM can be probed by laser interferometric gravitational wave detectors through the measurement of oscillating length changes in the arm cavities. In this context, KAGRA has a unique feature due to differing compositions of its mirrors, enhancing the signal of vector DM in the length change in the auxiliary channels. Here we prese… ▽ More Among the various candidates for dark matter (DM), ultralight vector DM can be probed by laser interferometric gravitational wave detectors through the measurement of oscillating length changes in the arm cavities. In this context, KAGRA has a unique feature due to differing compositions of its mirrors, enhancing the signal of vector DM in the length change in the auxiliary channels. Here we present the result of a search for $U(1)_{B-L}$ gauge boson DM using the KAGRA data from auxiliary length channels during the first joint observation run together with GEO600. By applying our search pipeline, which takes into account the stochastic nature of ultralight DM, upper bounds on the coupling strength between the $U(1)_{B-L}$ gauge boson and ordinary matter are obtained for a range of DM masses. While our constraints are less stringent than those derived from previous experiments, this study demonstrates the applicability of our method to the lower-mass vector DM search, which is made difficult in this measurement by the short observation time compared to the auto-correlation time scale of DM. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 20 pages, 5 figures

Report number: LIGO-P2300250

arXiv:2403.02944 [pdf, other]

Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity

Authors: Hagyeong Lee, Minkyu Kim, Jun-Hyuk Kim, Seungeon Kim, Dokwan Oh, Jaeho Lee

Abstract: Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we pr… ▽ More Recent advances in text-guided image compression have shown great potential to enhance the perceptual quality of reconstructed images. These methods, however, tend to have significantly degraded pixel-wise fidelity, limiting their practicality. To fill this gap, we develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity. In particular, we propose a compression framework that leverages text information mainly by text-adaptive encoding and training with joint image-text loss. By doing so, we avoid decoding based on text-guided generative models -- known for high generative diversity -- and effectively utilize the semantic information of text at a global level. Experimental results on various datasets show that our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions. In particular, our method outperforms all baselines in terms of LPIPS, with some room for even more improvements when we use more carefully generated captions. △ Less

Submitted 21 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: The first two authors contributed equally

arXiv:2403.02734 [pdf, ps, other]

doi 10.1016/j.apsusc.2024.159801

Strain tunable electronic ground states in two-dimensional iridate thin films

Authors: Donghan Kim, Byungmin Sohn, Yeonjae Lee, Jeongkeun Song, Mi Kyung Kim, Minjae Kim, Tae Won Noh, Changyoung Kim

Abstract: Quantum phases of matter such as superconducting, ferromagnetic and Wigner crystal states are often driven by the two-dimensionality (2D) of correlated systems. Meanwhile, spin-orbit coupling (SOC) is a fundamental element leading to nontrivial topology which gives rise to quantum phenomena such as the large anomalous Hall effect and nontrivial superconductivity. However, the search for controllab… ▽ More Quantum phases of matter such as superconducting, ferromagnetic and Wigner crystal states are often driven by the two-dimensionality (2D) of correlated systems. Meanwhile, spin-orbit coupling (SOC) is a fundamental element leading to nontrivial topology which gives rise to quantum phenomena such as the large anomalous Hall effect and nontrivial superconductivity. However, the search for controllable platforms with both 2D and SOC has been relatively overlooked so far. Here, we control and study the electronic ground states of iridate ultrathin films having both 2D and SOC by angle-resolved photoemission spectroscopy (ARPES) and dynamical mean field theory (DMFT) calculations. The metallicity of SrIrO$_3$ ultrathin films is controlled down to a monolayer by dimensional and strain manipulation. Our results suggest that the iridate ultrathin films can be a controllable 2D SOC platform exhibiting a variety of phenomena for future functional devices. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 7 pages, 4 figures

arXiv:2403.01861 [pdf, other]

doi 10.1109/LRA.2024.3375117

AiSDF: Structure-aware Neural Signed Distance Fields in Indoor Scenes

Authors: Jaehoon Jang, Inha Lee, Minje Kim, Kyungdon Joo

Abstract: Indoor scenes we are living in are visually homogenous or textureless, while they inherently have structural forms and provide enough structural priors for 3D scene reconstruction. Motivated by this fact, we propose a structure-aware online signed distance fields (SDF) reconstruction framework in indoor scenes, especially under the Atlanta world (AW) assumption. Thus, we dub this incremental SDF r… ▽ More Indoor scenes we are living in are visually homogenous or textureless, while they inherently have structural forms and provide enough structural priors for 3D scene reconstruction. Motivated by this fact, we propose a structure-aware online signed distance fields (SDF) reconstruction framework in indoor scenes, especially under the Atlanta world (AW) assumption. Thus, we dub this incremental SDF reconstruction for AW as AiSDF. Within the online framework, we infer the underlying Atlanta structure of a given scene and then estimate planar surfel regions supporting the Atlanta structure. This Atlanta-aware surfel representation provides an explicit planar map for a given scene. In addition, based on these Atlanta planar surfel regions, we adaptively sample and constrain the structural regularity in the SDF reconstruction, which enables us to improve the reconstruction quality by maintaining a high-level structure while enhancing the details of a given scene. We evaluate the proposed AiSDF on the ScanNet and ReplicaCAD datasets, where we demonstrate that the proposed framework is capable of reconstructing fine details of objects implicitly, as well as structures explicitly in room-scale scenes. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 8 pages, 6 figures, Accepted to IEEE RA-L (First two authors contributed equally)

Journal ref: IEEE Robotics and Automation Letters (RA-L), vol. 9, no. 5, pp. 4106-4113, 2024

arXiv:2403.01469 [pdf, other]

KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

Authors: Sunjun Kweon, Byung** Choi, Minkyu Kim, Rae Woong Park, Edward Choi

Abstract: We introduce KorMedMCQA, the first Korean multiple-choice question answering (MCQA) benchmark derived from Korean healthcare professional licensing examinations, covering from the year 2012 to year 2023. This dataset consists of a selection of questions from the license examinations for doctors, nurses, and pharmacists, featuring a diverse array of subjects. We conduct baseline experiments on vari… ▽ More We introduce KorMedMCQA, the first Korean multiple-choice question answering (MCQA) benchmark derived from Korean healthcare professional licensing examinations, covering from the year 2012 to year 2023. This dataset consists of a selection of questions from the license examinations for doctors, nurses, and pharmacists, featuring a diverse array of subjects. We conduct baseline experiments on various large language models, including proprietary/open-source, multilingual/Korean-additional pretrained, and clinical context pretrained models, highlighting the potential for further enhancements. We make our data publicly available on HuggingFace (https://huggingface.co/datasets/sean0042/KorMedMCQA) and provide a evaluation script via LM-Harness, inviting further exploration and advancement in Korean healthcare environments. △ Less

Submitted 5 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

arXiv:2403.00398 [pdf, other]

Learning Quadrupedal Locomotion with Impaired Joints Using Random Joint Masking

Authors: Mincheol Kim, Ukcheol Shin, Jung-Yup Kim

Abstract: Quadrupedal robots have played a crucial role in various environments, from structured environments to complex harsh terrains, thanks to their agile locomotion ability. However, these robots can easily lose their locomotion functionality if damaged by external accidents or internal malfunctions. In this paper, we propose a novel deep reinforcement learning framework to enable a quadrupedal robot t… ▽ More Quadrupedal robots have played a crucial role in various environments, from structured environments to complex harsh terrains, thanks to their agile locomotion ability. However, these robots can easily lose their locomotion functionality if damaged by external accidents or internal malfunctions. In this paper, we propose a novel deep reinforcement learning framework to enable a quadrupedal robot to walk with impaired joints. The proposed framework consists of three components: 1) a random joint masking strategy for simulating impaired joint scenarios, 2) a joint state estimator to predict an implicit status of current joint condition based on past observation history, and 3) progressive curriculum learning to allow a single network to conduct both normal gait and various joint-impaired gaits. We verify that our framework enables the Unitree's Go1 robot to walk under various impaired joint conditions in real-world indoor and outdoor environments. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: Appear to ICRA 2024, Project page: https://sites.google.com/view/learning-impaired-joints-loco

arXiv:2402.18778 [pdf, other]

X-ResQ: Reverse Annealing for Quantum MIMO Detection with Flexible Parallelism

Authors: Minsung Kim, Abhishek Kumar Singh, Davide Venturelli, John Kaewell, Kyle Jamieson

Abstract: Quantum Annealing (QA)-accelerated MIMO detection is an emerging research approach in the context of NextG wireless networks. The opportunity is to enable large MIMO systems and thus improve wireless performance. The approach aims to leverage QA to expedite the computation required for theoretically optimal but computationally-demanding Maximum Likelihood detection to overcome the limitations of t… ▽ More Quantum Annealing (QA)-accelerated MIMO detection is an emerging research approach in the context of NextG wireless networks. The opportunity is to enable large MIMO systems and thus improve wireless performance. The approach aims to leverage QA to expedite the computation required for theoretically optimal but computationally-demanding Maximum Likelihood detection to overcome the limitations of the currently deployed linear detectors. This paper presents X-ResQ, a QA-based MIMO detector system featuring fine-grained quantum task parallelism that is uniquely enabled by the Reverse Annealing (RA) protocol. Unlike prior designs, X-ResQ has many desirable system properties for a parallel QA detector and has effectively improved detection performance as more qubits are assigned. In our evaluations on a state-of-the-art quantum annealer, fully parallel X-ResQ achieves near-optimal throughput (over 10 bits/s/Hz) for $4\times6$ MIMO with 16-QAM using six levels of parallelism with 240 qubits and $220~μ$s QA compute time, achieving 2.5--5$\times$ gains compared against other tested detectors. For more comprehensive evaluations, we implement and evaluate X-ResQ in the non-quantum digital setting. This non-quantum X-ResQ demonstration showcases the potential to realize ultra-large $1024\times1024$ MIMO, significantly outperforming other MIMO detectors, including the state-of-the-art RA detector classically implemented in the same way. △ Less

Submitted 9 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 22 pages

arXiv:2402.18372 [pdf, other]

FedUV: Uniformity and Variance for Heterogeneous Federated Learning

Authors: Ha Min Son, Moon-Hyun Kim, Tai-Myoung Chung, Chao Huang, Xin Liu

Abstract: Federated learning is a promising framework to train neural networks with widely distributed data. However, performance degrades heavily with heterogeneously distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias, some finding success freezing the final layer as an orthogonal classifier. We investigate the training dynamics of the class… ▽ More Federated learning is a promising framework to train neural networks with widely distributed data. However, performance degrades heavily with heterogeneously distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias, some finding success freezing the final layer as an orthogonal classifier. We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values. We find that there are differences when training in IID and non-IID settings. Based on this finding, we introduce two regularization terms for local training to continuously emulate IID settings: (1) variance in the dimension-wise probability distribution of the classifier and (2) hyperspherical uniformity of representations of the encoder. These regularizations promote local models to act as if it were in an IID setting regardless of the local data distribution, thus offsetting proneness to bias while being flexible to the data. On extensive experiments in both label-shift and feature-shift settings, we verify that our method achieves highest performance by a large margin especially in highly non-IID cases in addition to being scalable to larger models and datasets. △ Less

Submitted 1 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 11 pages, 4 figures, 5 tables, to appear at CVPR 2024

arXiv:2402.18351 [pdf, other]

LatentSwap: An Efficient Latent Code Map** Framework for Face Swap**

Authors: Changho Choi, Minho Kim, Junhyeok Lee, Hyoung-Kyu Song, Younggeun Kim, Seungryong Kim

Abstract: We propose LatentSwap, a simple face swap** framework generating a face swap latent code of a given generator. Utilizing randomly sampled latent codes, our framework is light and does not require datasets besides employing the pre-trained models, with the training procedure also being fast and straightforward. The loss objective consists of only three terms, and can effectively control the face… ▽ More We propose LatentSwap, a simple face swap** framework generating a face swap latent code of a given generator. Utilizing randomly sampled latent codes, our framework is light and does not require datasets besides employing the pre-trained models, with the training procedure also being fast and straightforward. The loss objective consists of only three terms, and can effectively control the face swap results between source and target images. By attaching a pre-trained GAN inversion model independent to the model and using the StyleGAN2 generator, our model produces photorealistic and high-resolution images comparable to other competitive face swap models. We show that our framework is applicable to other generators such as StyleNeRF, paving a way to 3D-aware face swap** and is also compatible with other downstream StyleGAN2 generator tasks. The source code and models can be found at \url{https://github.com/usingcolor/LatentSwap}. △ Less

Submitted 2 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 9 pages, 11 figures

arXiv:2402.17984 [pdf, other]

Sampling low-fidelity outputs for estimation of high-fidelity density and its tails

Authors: Minji Kim, Vladas Pipiras, Kevin O'Connor, Themistoklis Sapsis

Abstract: In a multifidelity setting, data are available under the same conditions from two (or more) sources, e.g. computer codes, one being lower-fidelity but computationally cheaper, and the other higher-fidelity and more expensive. This work studies for which low-fidelity outputs, one should obtain high-fidelity outputs, if the goal is to estimate the probability density function of the latter, especial… ▽ More In a multifidelity setting, data are available under the same conditions from two (or more) sources, e.g. computer codes, one being lower-fidelity but computationally cheaper, and the other higher-fidelity and more expensive. This work studies for which low-fidelity outputs, one should obtain high-fidelity outputs, if the goal is to estimate the probability density function of the latter, especially when it comes to the distribution tails and extremes. It is suggested to approach this problem from the perspective of the importance sampling of low-fidelity outputs according to some proposal distribution, combined with special considerations for the distribution tails based on extreme value theory. The notion of an optimal proposal distribution is introduced and investigated, in both theory and simulations. The approach is motivated and illustrated with an application to estimate the probability density function of record extremes of ship motions, obtained through two computer codes of different fidelities. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 32 pages, 11 figures, 2 tables

arXiv:2402.17267 [pdf, other]

Narrowband THz Emission from a Plasma Oscillator Imbedded in a Plasma Density Gradient

Authors: Manoj Kumar, Bernhard Ersfeld, Jaeho Lee, Dohyun Park, Seungyun Kim, Inhyuk Nam, Minseok Kim, Seong** Jeon, Dino A. Jaroszynski, Hyyong Suk, Min Sup Hur

Abstract: A novel method is presented for generating radiation using the beat wave associated with a bi-frequency laser pulse, to excite plasma oscillations in a plasma slab with a density gradient. By resonantly exciting a plasma wave, it can be localised and transformed into a plasma oscillator that produces a beam of radially polarised terahertz radiation. Particle-in-cell simulations and analytic theory… ▽ More A novel method is presented for generating radiation using the beat wave associated with a bi-frequency laser pulse, to excite plasma oscillations in a plasma slab with a density gradient. By resonantly exciting a plasma wave, it can be localised and transformed into a plasma oscillator that produces a beam of radially polarised terahertz radiation. Particle-in-cell simulations and analytic theory are used to demonstrate its main characteristics, which includes narrow bandwidth. The radiator should have useful applications such as terahertz-band particle accelerators and pump-probe experiments. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.16785 [pdf, other]

CARTE: Pretraining and Transfer for Tabular Learning

Authors: Myung Jun Kim, Léo Grinsztajn, Gaël Varoquaux

Abstract: Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching),… ▽ More Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data. △ Less

Submitted 31 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.16768 [pdf, other]

Synthesis, structural and magnetic characterizations of Li$_4$Cu$_{1-x}$Ni$_x$TeO$_6$ ( $x$ = 0, 0.1, 0.2, 0.5, and 1)

Authors: Ashiwini Balodhi, Brianna Billingsley, Tai Kong, Min Gyu Kim

Abstract: We investigated the effect of Ni do** in a recently proposed quantum spin liquid (QSL) candidate Li$_4$CuTeO$_6$. We performed a comprehensive study on the structural and magnetic properties. We find that the anti-site disorder between Li$^+$ and Cu$^{2+}$ persists until 50\% Ni do** in which Ni and Cu occupy different crystallographic sites. As a result, while Cu sits in both triangular and h… ▽ More We investigated the effect of Ni do** in a recently proposed quantum spin liquid (QSL) candidate Li$_4$CuTeO$_6$. We performed a comprehensive study on the structural and magnetic properties. We find that the anti-site disorder between Li$^+$ and Cu$^{2+}$ persists until 50\% Ni do** in which Ni and Cu occupy different crystallographic sites. As a result, while Cu sits in both triangular and honeycomb layers in Li$_4$CuTeO$_6$, Ni forms only honeycomb layer in Li$_4$NiTeO$_6$ and Li$_4$Cu$_{0.5}$Ni$_{0.5}$TeO$_6$. Our magnetic susceptibility measurements show that the Weiss temperature decreases from -145.68 K in Li$_4$CuTeO$_6$ to -6.15 K in Li$_4$NiTeO$_6$ as Ni do** increases, and find no hint of magnetic ordering or freezing down to 1.8 K. Our analysis implies the existence of abundant low energy excitations in these materials. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 7 pages, 4 figures

arXiv:2402.16307 [pdf, ps, other]

Analyzing Downlink Coverage in Clustered Low Earth Orbit Satellite Constellations: A Stochastic Geometry Approach

Authors: Miyeon Lee, Sucheol Kim, Minje Kim, Dong-Hyun Jung, Junil Choi

Abstract: Satellite networks are emerging as vital solutions for global connectivity beyond 5G. As companies such as SpaceX, OneWeb, and Amazon are poised to launch a large number of satellites in low Earth orbit, the heightened inter-satellite interference caused by mega-constellations has become a significant concern. To address this challenge, recent works have introduced the concept of satellite cluster… ▽ More Satellite networks are emerging as vital solutions for global connectivity beyond 5G. As companies such as SpaceX, OneWeb, and Amazon are poised to launch a large number of satellites in low Earth orbit, the heightened inter-satellite interference caused by mega-constellations has become a significant concern. To address this challenge, recent works have introduced the concept of satellite cluster networks where multiple satellites in a cluster collaborate to enhance the network performance. In order to investigate the performance of these networks, we propose mathematical analyses by modeling the locations of satellites and users using Poisson point processes, building on the success of stochastic geometry-based analyses for satellite networks. In particular, we suggest the lower and upper bounds of the coverage probability as functions of the system parameters, including satellite density, satellite altitude, satellite cluster area, path loss exponent, and Nakagami parameter $m$. We validate the analytical expressions by comparing them with simulation results. Our analyses can be used to design reliable satellite cluster networks by effectively estimating the impact of system parameters on the coverage performance. △ Less

Submitted 29 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: submitted to IEEE Transactions on Communications

arXiv:2402.16021 [pdf, other]

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2402.15643 [pdf, other]

doi 10.1145/3613904.3642636

Artful Path to Healing: Using Machine Learning for Visual Art Recommendation to Prevent and Reduce Post-Intensive Care

Authors: Bereket A. Yilma, Chan Mi Kim, Gerald C. Cupchik, Luis A. Leiva

Abstract: Staying in the intensive care unit (ICU) is often traumatic, leading to post-intensive care syndrome (PICS), which encompasses physical, psychological, and cognitive impairments. Currently, there are limited interventions available for PICS. Studies indicate that exposure to visual art may help address the psychological aspects of PICS and be more effective if it is personalized. We develop Machin… ▽ More Staying in the intensive care unit (ICU) is often traumatic, leading to post-intensive care syndrome (PICS), which encompasses physical, psychological, and cognitive impairments. Currently, there are limited interventions available for PICS. Studies indicate that exposure to visual art may help address the psychological aspects of PICS and be more effective if it is personalized. We develop Machine Learning-based Visual Art Recommendation Systems (VA RecSys) to enable personalized therapeutic visual art experiences for post-ICU patients. We investigate four state-of-the-art VA RecSys engines, evaluating the relevance of their recommendations for therapeutic purposes compared to expert-curated recommendations. We conduct an expert pilot test and a large-scale user study (n=150) to assess the appropriateness and effectiveness of these recommendations. Our results suggest all recommendations enhance temporal affective states. Visual and multimodal VA RecSys engines compare favourably with expert-curated recommendations, indicating their potential to support the delivery of personalized art therapy for PICS prevention and treatment. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI 24)

arXiv:2402.15151 [pdf, other]

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data. △ Less

Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: An Erratum was added on the last page of this paper

arXiv:2402.15046 [pdf, other]

CARBD-Ko: A Contextually Annotated Review Benchmark Dataset for Aspect-Level Sentiment Classification in Korean

Authors: Dongjun Jang, Jean Seo, Sungjoo Byun, Taekyoung Kim, Minseok Kim, Hyopil Shin

Abstract: This paper explores the challenges posed by aspect-based sentiment classification (ABSC) within pretrained language models (PLMs), with a particular focus on contextualization and hallucination issues. In order to tackle these challenges, we introduce CARBD-Ko (a Contextually Annotated Review Benchmark Dataset for Aspect-Based Sentiment Classification in Korean), a benchmark dataset that incorpora… ▽ More This paper explores the challenges posed by aspect-based sentiment classification (ABSC) within pretrained language models (PLMs), with a particular focus on contextualization and hallucination issues. In order to tackle these challenges, we introduce CARBD-Ko (a Contextually Annotated Review Benchmark Dataset for Aspect-Based Sentiment Classification in Korean), a benchmark dataset that incorporates aspects and dual-tagged polarities to distinguish between aspect-specific and aspect-agnostic sentiment classification. The dataset consists of sentences annotated with specific aspects, aspect polarity, aspect-agnostic polarity, and the intensity of aspects. To address the issue of dual-tagged aspect polarities, we propose a novel approach employing a Siamese Network. Our experimental findings highlight the inherent difficulties in accurately predicting dual-polarities and underscore the significance of contextualized sentiment analysis models. The CARBD-Ko dataset serves as a valuable resource for future research endeavors in aspect-level sentiment classification. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Showing 101–150 of 2,932 results for author: Kim, M