Search | arXiv e-print repository

arXiv:2406.08222 [pdf]

A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion

Authors: Sha Luo, Sang Jung Kim, Zening Duan, Kai** Chen

Abstract: In the evolving landscape of computer vision (CV) technologies, the automatic detection and interpretation of gender and emotion in images is a critical area of study. This paper investigates social biases in CV models, emphasizing the limitations of traditional evaluation metrics such as precision, recall, and accuracy. These metrics often fall short in capturing the complexities of gender and em… ▽ More In the evolving landscape of computer vision (CV) technologies, the automatic detection and interpretation of gender and emotion in images is a critical area of study. This paper investigates social biases in CV models, emphasizing the limitations of traditional evaluation metrics such as precision, recall, and accuracy. These metrics often fall short in capturing the complexities of gender and emotion, which are fluid and culturally nuanced constructs. Our study proposes a sociotechnical framework for evaluating CV models, incorporating both technical performance measures and considerations of social fairness. Using a dataset of 5,570 images related to vaccination and climate change, we empirically compared the performance of various CV models, including traditional models like DeepFace and FER, and generative models like GPT-4 Vision. Our analysis involved manually validating the gender and emotional expressions in a subset of images to serve as benchmarks. Our findings reveal that while GPT-4 Vision outperforms other models in technical accuracy for gender classification, it exhibits discriminatory biases, particularly in response to transgender and non-binary personas. Furthermore, the model's emotion detection skew heavily towards positive emotions, with a notable bias towards associating female images with happiness, especially when prompted by male personas. These findings underscore the necessity of develo** more comprehensive evaluation criteria that address both validity and discriminatory biases in CV models. Our proposed framework provides guidelines for researchers to critically assess CV tools, ensuring their application in communication research is both ethical and effective. The significant contribution of this study lies in its emphasis on a sociotechnical approach, advocating for CV technologies that support social good and mitigate biases rather than perpetuate them. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.01079 [pdf, other]

Object Aware Egocentric Online Action Detection

Authors: Joungbin An, Yunsu Park, Hyolim Kang, Seon Joo Kim

Abstract: Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus f… ▽ More Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus fail to capitalize on the unique perspectives inherent to egocentric videos. To address this gap, we introduce an Object-Aware Module that integrates egocentric-specific priors into existing OAD frameworks, enhancing first-person footage interpretation. Utilizing object-specific details and temporal dynamics, our module improves scene understanding in detecting actions. Validated extensively on the Epic-Kitchens 100 dataset, our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements, marking an important step forward in adapting action detection systems to egocentric video analysis. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: CVPR First Joint Egocentric Vision Workshop 2024

arXiv:2406.01020 [pdf, other]

CLIP-Guided Attribute Aware Pretraining for Generalizable Image Quality Assessment

Authors: Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim

Abstract: In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalabi… ▽ More In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we carefully select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability. We will make the code available for access. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.06284 [pdf, other]

Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention

Authors: Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim, Sang-Chul Lee

Abstract: Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However, deep learning-based medical image analyses tend to overlook the importance of frequency variance, which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally, various models fail to account for the potential information loss that can arise… ▽ More Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However, deep learning-based medical image analyses tend to overlook the importance of frequency variance, which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally, various models fail to account for the potential information loss that can arise from multi-task learning under deep supervision, a factor that can impair the model representation ability. To address these challenges, we propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical image segmentation, which comprises two key components: a Multi-Frequency in Multi-Scale Attention (MFMSA) block and Ensemble Sub-Decoding Module (E-SDM). The MFMSA block refines the process of spatial feature extraction, particularly in capturing boundary features, by incorporating multi-frequency and multi-scale features, thereby offering informative cues for tissue outline and anatomical structures. Moreover, we propose E-SDM to mitigate information loss in multi-task learning with deep supervision, especially during substantial upsampling from low resolution. We evaluate the segmentation performance of MADGNet across six modalities and fifteen datasets. Through extensive experiments, we demonstrate that MADGNet consistently outperforms state-of-the-art models across various modalities, showcasing superior segmentation performance. This affirms MADGNet as a robust solution for medical image segmentation that excels in diverse imaging scenarios. Our MADGNet code is available in GitHub Link. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: Accepted in Computer Vision and Pattern Recognition (CVPR) 2024

arXiv:2402.18277 [pdf, other]

Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing

Authors: Dongyoung Kim, **woo Kim, Junsang Yu, Seon Joo Kim

Abstract: White balance (WB) algorithms in many commercial cameras assume single and uniform illumination, leading to undesirable results when multiple lighting sources with different chromaticities exist in the scene. Prior research on multi-illuminant WB typically predicts illumination at the pixel level without fully gras** the scene's actual lighting conditions, including the number and color of light… ▽ More White balance (WB) algorithms in many commercial cameras assume single and uniform illumination, leading to undesirable results when multiple lighting sources with different chromaticities exist in the scene. Prior research on multi-illuminant WB typically predicts illumination at the pixel level without fully gras** the scene's actual lighting conditions, including the number and color of light sources. This often results in unnatural outcomes lacking in overall consistency. To handle this problem, we present a deep white balancing model that leverages the slot attention, where each slot is in charge of representing individual illuminants. This design enables the model to generate chromaticities and weight maps for individual illuminants, which are then fused to compose the final illumination map. Furthermore, we propose the centroid-matching loss, which regulates the activation of each slot based on the color range, thereby enhancing the model to separate illumination more effectively. Our method achieves the state-of-the-art performance on both single- and multi-illuminant WB benchmarks, and also offers additional information such as the number of illuminants in the scene and their chromaticity. This capability allows for illumination editing, an application not feasible with prior methods. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: CVPR 2024

arXiv:2312.04885 [pdf, other]

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

Authors: Hanjung Kim, Jaehyun Kang, Miran Heo, Sukjun Hwang, Seoung Wug Oh, Seon Joo Kim

Abstract: In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect assoc… ▽ More In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE. △ Less

Submitted 8 March, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: Technical report

arXiv:2310.08929 [pdf, other]

Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning

Authors: **woo Kim, Janghyuk Choi, Jaehyun Kang, Changyeon Lee, Ho-** Choi, Seon Joo Kim

Abstract: The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent stu… ▽ More The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical studies and theoretical validation confirm the effectiveness of our approach, offering a novel capability for interpretable and sustainable control of object representations. △ Less

Submitted 1 March, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

arXiv:2308.10269 [pdf, other]

Domain Reduction Strategy for Non Line of Sight Imaging

Authors: Hyunbo Shim, In Cho, Daekyu Kwon, Seon Joo Kim

Abstract: This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under various setups. Our method is built upon the observation that photons returning from each point in hidden volumes can be independently computed if the interactions between hidden surfaces are trivially ignored. We model the generalized light propagation function t… ▽ More This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under various setups. Our method is built upon the observation that photons returning from each point in hidden volumes can be independently computed if the interactions between hidden surfaces are trivially ignored. We model the generalized light propagation function to accurately represent the transients as a linear combination of these functions. Moreover, our proposed method includes a domain reduction procedure to exclude empty areas of the hidden volumes from the set of propagation functions, thereby improving computational efficiency of the optimization. We demonstrate the effectiveness of the method in various NLOS scenarios, including non-planar relay wall, sparse scanning patterns, confocal and non-confocal, and surface geometry reconstruction. Experiments conducted on both synthetic and real-world data clearly support the superiority and the efficiency of the proposed method in general NLOS scenarios. △ Less

Submitted 20 August, 2023; originally announced August 2023.

arXiv:2303.17842 [pdf, other]

Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning

Authors: **woo Kim, Janghyuk Choi, Ho-** Choi, Seon Joo Kim

Abstract: Object-centric learning (OCL) aspires general and compositional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how t… ▽ More Object-centric learning (OCL) aspires general and compositional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how to disentangle a given scene than videos or multi-view images do. Hence, owing to the difficulty of applying inductive biases, OCL for single-view images remains challenging, resulting in inconsistent learning of object-centric representation. To this end, we introduce a novel OCL framework for single-view images, SLot Attention via SHepherding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. The new modules, Attention Refining Kernel (ARK) and Intermediate Point Predictor and Encoder (IPPE), respectively, prevent slots from being distracted by the background noise and indicate locations for slots to focus on to facilitate learning of object-centric representation. We also propose a weak semi-supervision approach for OCL, whilst our proposed framework can be used without any assistant annotation during the inference. Experiments show that our proposed method enables consistent learning of object-centric representation and achieves strong performance across four datasets. Code is available at \url{https://github.com/object-understanding/SLASH}. △ Less

Submitted 31 March, 2023; originally announced March 2023.

arXiv:2301.06244 [pdf, other]

Haptic Transparency and Interaction Force Control for a Lower-Limb Exoskeleton

Authors: Emek Barış Küçüktabak, Yue Wen, Sangjoon J. Kim, Matthew Short, Daniel Ludvig, Levi Hargrove, Eric Perreault, Kevin Lynch, Jose Pons

Abstract: Controlling the interaction forces between a human and an exoskeleton is crucial for providing transparency or adjusting assistance or resistance levels. However, it is an open problem to control the interaction forces of lower-limb exoskeletons designed for unrestricted overground walking. For these types of exoskeletons, it is challenging to implement force/torque sensors at every contact betwee… ▽ More Controlling the interaction forces between a human and an exoskeleton is crucial for providing transparency or adjusting assistance or resistance levels. However, it is an open problem to control the interaction forces of lower-limb exoskeletons designed for unrestricted overground walking. For these types of exoskeletons, it is challenging to implement force/torque sensors at every contact between the user and the exoskeleton for direct force measurement. Moreover, it is important to compensate for the exoskeleton's whole-body gravitational and dynamical forces, especially for heavy lower-limb exoskeletons. Previous works either simplified the dynamic model by treating the legs as independent double pendulums, or they did not close the loop with interaction force feedback. The proposed whole-exoskeleton closed-loop compensation (WECC) method calculates the interaction torques during the complete gait cycle by using whole-body dynamics and joint torque measurements on a hip-knee exoskeleton. Furthermore, it uses a constrained optimization scheme to track desired interaction torques in a closed loop while considering physical and safety constraints. We evaluated the haptic transparency and dynamic interaction torque tracking of WECC control on three subjects. We also compared the performance of WECC with a controller based on a simplified dynamic model and a passive version of the exoskeleton. The WECC controller results in a consistently low absolute interaction torque error during the whole gait cycle for both zero and nonzero desired interaction torques. In contrast, the simplified controller yields poor performance in tracking desired interaction torques during the stance phase. △ Less

Submitted 22 January, 2024; v1 submitted 15 January, 2023; originally announced January 2023.

Comments: 19 pages, 13 figures. Accepted for publication in the IEEE Transactions on Robotics (T-RO)

arXiv:2211.09385 [pdf, other]

ComMU: Dataset for Combinatorial Music Generation

Authors: Lee Hyun, Taehyun Kim, Hyolim Kang, Minjoo Ki, Hyeonchan Hwang, Kwanho Park, Sharang Han, Seon Joo Kim

Abstract: Commercial adoption of automatic music composition requires the capability of generating diverse and high-quality music suitable for the desired context (e.g., music for romantic movies, action games, restaurants, etc.). In this paper, we introduce combinatorial music generation, a new task to create varying background music based on given conditions. Combinatorial music generation creates short s… ▽ More Commercial adoption of automatic music composition requires the capability of generating diverse and high-quality music suitable for the desired context (e.g., music for romantic movies, action games, restaurants, etc.). In this paper, we introduce combinatorial music generation, a new task to create varying background music based on given conditions. Combinatorial music generation creates short samples of music with rich musical metadata, and combines them to produce a complete music. In addition, we introduce ComMU, the first symbolic music dataset consisting of short music samples and their corresponding 12 musical metadata for combinatorial music generation. Notable properties of ComMU are that (1) dataset is manually constructed by professional composers with an objective guideline that induces regularity, and (2) it has 12 musical metadata that embraces composers' intentions. Our results show that we can generate diverse high-quality music only with metadata, and that our unique metadata such as track-role and extended chord quality improves the capacity of the automatic composition. We highly recommend watching our video before reading the paper (https://pozalabs.github.io/ComMU). △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 19 pages, 12 figures

arXiv:2211.08834 [pdf, other]

A Generalized Framework for Video Instance Segmentation

Authors: Miran Heo, Sukjun Hwang, Jeongseok Hyun, Hanjung Kim, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim

Abstract: The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized fram… ▽ More The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS. △ Less

Submitted 24 March, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: CVPR 2023

arXiv:2211.06023 [pdf, other]

Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks

Authors: Hyolim Kang, Hanjung Kim, Joungbin An, Minsu Cho, Seon Joo Kim

Abstract: Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end fine-tuning, they commonly require an overload of hi… ▽ More Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end fine-tuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2210.02526 [pdf, other]

"No, they did not": Dialogue response dynamics in pre-trained language models

Authors: Sanghee J. Kim, Lang Yu, Allyson Ettinger

Abstract: A critical component of competence in language is being able to identify relevant components of an utterance and reply appropriately. In this paper we examine the extent of such dialogue response sensitivity in pre-trained language models, conducting a series of experiments with a particular focus on sensitivity to dynamics involving phenomena of at-issueness and ellipsis. We find that models show… ▽ More A critical component of competence in language is being able to identify relevant components of an utterance and reply appropriately. In this paper we examine the extent of such dialogue response sensitivity in pre-trained language models, conducting a series of experiments with a particular focus on sensitivity to dynamics involving phenomena of at-issueness and ellipsis. We find that models show clear sensitivity to a distinctive role of embedded clauses, and a general preference for responses that target main clause content of prior utterances. However, the results indicate mixed and generally weak trends with respect to capturing the full range of dynamics involved in targeting at-issue versus not-at-issue content. Additionally, models show fundamental limitations in grasp of the dynamics governing ellipsis, and response selections show clear interference from superficial factors that outweigh the influence of principled discourse constraints. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Comments: 12 pages, 8 figures, COLING 2022, see https://github.com/sangheek16/dialogue-response-dynamics for codes and material

arXiv:2207.10391 [pdf, other]

Error Compensation Framework for Flow-Guided Video Inpainting

Authors: Jaeyeon Kang, Seoung Wug Oh, Seon Joo Kim

Abstract: The key to video inpainting is to use correlation information from as many reference frames as possible. Existing flow-based propagation methods split the video synthesis process into multiple steps: flow completion -> pixel propagation -> synthesis. However, there is a significant drawback that the errors in each step continue to accumulate and amplify in the next step. To this end, we propose an… ▽ More The key to video inpainting is to use correlation information from as many reference frames as possible. Existing flow-based propagation methods split the video synthesis process into multiple steps: flow completion -> pixel propagation -> synthesis. However, there is a significant drawback that the errors in each step continue to accumulate and amplify in the next step. To this end, we propose an Error Compensation Framework for Flow-guided Video Inpainting (ECFVI), which takes advantage of the flow-based method and offsets its weaknesses. We address the weakness with the newly designed flow completion module and the error compensation network that exploits the error guidance map. Our approach greatly improves the temporal consistency and the visual quality of the completed videos. Experimental results show the superior performance of our proposed method with the speed up of x6, compared to the state-of-the-art methods. In addition, we present a new benchmark dataset for evaluation by supplementing the weaknesses of existing test datasets. △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: ECCV2022 accepted

arXiv:2206.06817 [pdf]

doi 10.1016/j.ijheatmasstransfer.2023.124900

Residual-based physics-informed transfer learning: A hybrid method for accelerating long-term CFD simulations via deep learning

Authors: Joongoo Jeon, Juhyeong Lee, Ricardo Vinuesa, Sung Joong Kim

Abstract: While a big wave of artificial intelligence (AI) has propagated to the field of computational fluid dynamics (CFD) acceleration studies, recent research has highlighted that the development of AI techniques that reconciles the following goals remains our primary task: (1) accurate prediction of unseen (future) time series in long-term CFD simulations (2) acceleration of simulations (3) an acceptab… ▽ More While a big wave of artificial intelligence (AI) has propagated to the field of computational fluid dynamics (CFD) acceleration studies, recent research has highlighted that the development of AI techniques that reconciles the following goals remains our primary task: (1) accurate prediction of unseen (future) time series in long-term CFD simulations (2) acceleration of simulations (3) an acceptable amount of training data and time (4) within a multiple PDEs condition. In this study, we propose a residual-based physics-informed transfer learning (RePIT) strategy to achieve these four objectives using ML-CFD hybrid computation. Our hypothesis is that long-term CFD simulation is feasible with the hybrid method where CFD and AI alternately calculate time series while monitoring the first principle's residuals. The feasibility of RePIT strategy was verified through a CFD case study on natural convection. In a single training approach, a residual scale change occurred around 100th timestep, resulting in predicted time series exhibiting non-physical patterns as well as a significant deviations from the ground truth. Conversely, RePIT strategy maintained the residuals within the defined range and demonstrated good accuracy throughout the entire simulation period. The maximum error from the ground truth was below 0.4 K for temperature and 0.024 m/s for x-axis velocity. Furthermore, the average time for 1 timestep by the ML-GPU and CFD-CPU calculations was 0.171 s and 0.015 s, respectively. Including the parameter-updating time, the simulation was accelerated by a factor of 1.9. In conclusion, our RePIT strategy is a promising technique to reduce the cost of CFD simulations in industry. However, more vigorous optimization and improvement studies are still necessary. △ Less

Submitted 26 November, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: 36 pages, 10 figures

Journal ref: International Journal of Heat and Mass Transfer 220 (2024) 124900

arXiv:2206.04403 [pdf, other]

VITA: Video Instance Segmentation via Object Token Association

Authors: Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim

Abstract: We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector a… ▽ More We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector as a means of distilling object-specific contexts into object tokens. VITA accomplishes video-level understanding by associating frame-level object tokens without using spatio-temporal backbone features. By effectively building relationships between objects using the condensed information, VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 & 2021, and 19.6 AP on OVIS. Moreover, thanks to its object token-based structure that is disjoint from the backbone features, VITA shows several practical advantages that previous offline VIS methods have not explored - handling long and high-resolution videos with a common GPU, and freezing a frame-level detector trained on image domain. Code is available at https://github.com/sukjunhwang/VITA. △ Less

Submitted 20 October, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

arXiv:2206.02116 [pdf, other]

Cannot See the Forest for the Trees: Aggregating Multiple Viewpoints to Better Classify Objects in Videos

Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

Abstract: Recently, both long-tailed recognition and object tracking have made great advances individually. TAO benchmark presented a mixture of the two, long-tailed object tracking, in order to further reflect the aspect of the real-world. To date, existing solutions have adopted detectors showing robustness in long-tailed distributions, which derive per-frame results. Then, they used tracking algorithms t… ▽ More Recently, both long-tailed recognition and object tracking have made great advances individually. TAO benchmark presented a mixture of the two, long-tailed object tracking, in order to further reflect the aspect of the real-world. To date, existing solutions have adopted detectors showing robustness in long-tailed distributions, which derive per-frame results. Then, they used tracking algorithms that combine the temporally independent detections to finalize tracklets. However, as the approaches did not take temporal changes in scenes into account, inconsistent classification results in videos led to low overall performance. In this paper, we present a set classifier that improves accuracy of classifying tracklets by aggregating information from multiple viewpoints contained in a tracklet. To cope with sparse annotations in videos, we further propose augmentation of tracklets that can maximize data efficiency. The set classifier is plug-and-playable to existing object trackers, and highly improves the performance of long-tailed object tracking. By simply attaching our method to QDTrack on top of ResNet-101, we achieve the new state-of-the-art, 19.9% and 15.7% TrackAP_50 on TAO validation and test sets, respectively. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: Accepted to CVPR 2022

arXiv:2201.01479 [pdf, other]

Gradient-based Bit Encoding Optimization for Noise-Robust Binary Memristive Crossbar

Authors: Youngeun Kim, Hyunsoo Kim, Seijoon Kim, Sang Joon Kim, Priyadarshini Panda

Abstract: Binary memristive crossbars have gained huge attention as an energy-efficient deep learning hardware accelerator. Nonetheless, they suffer from various noises due to the analog nature of the crossbars. To overcome such limitations, most previous works train weight parameters with noise data obtained from a crossbar. These methods are, however, ineffective because it is difficult to collect noise d… ▽ More Binary memristive crossbars have gained huge attention as an energy-efficient deep learning hardware accelerator. Nonetheless, they suffer from various noises due to the analog nature of the crossbars. To overcome such limitations, most previous works train weight parameters with noise data obtained from a crossbar. These methods are, however, ineffective because it is difficult to collect noise data in large-volume manufacturing environment where each crossbar has a large device/circuit level variation. Moreover, we argue that there is still room for improvement even though these methods somewhat improve accuracy. This paper explores a new perspective on mitigating crossbar noise in a more generalized way by manipulating input binary bit encoding rather than training the weight of networks with respect to noise data. We first mathematically show that the noise decreases as the number of binary bit encoding pulses increases when representing the same amount of information. In addition, we propose Gradient-based Bit Encoding Optimization (GBO) which optimizes a different number of pulses at each layer, based on our in-depth analysis that each layer has a different level of noise sensitivity. The proposed heterogeneous layer-wise bit encoding scheme achieves high noise robustness with low computational cost. Our experimental results on public benchmark datasets show that GBO improves the classification accuracy by ~5-40% in severe noise scenarios. △ Less

Submitted 5 January, 2022; originally announced January 2022.

Comments: Accepted to DATE2022

arXiv:2112.04177 [pdf, other]

VISOLO: Grid-Based Space-Time Aggregation for Efficient Online Video Instance Segmentation

Authors: Su Ho Han, Sukjun Hwang, Seoung Wug Oh, Yeonchool Park, Hyunwoo Kim, Min-Jung Kim, Seon Joo Kim

Abstract: For online video instance segmentation (VIS), fully utilizing the information from previous frames in an efficient manner is essential for real-time applications. Most previous methods follow a two-stage approach requiring additional computations such as RPN and RoIAlign, and do not fully exploit the available information in the video for all subtasks in VIS. In this paper, we propose a novel sing… ▽ More For online video instance segmentation (VIS), fully utilizing the information from previous frames in an efficient manner is essential for real-time applications. Most previous methods follow a two-stage approach requiring additional computations such as RPN and RoIAlign, and do not fully exploit the available information in the video for all subtasks in VIS. In this paper, we propose a novel single-stage framework for online VIS built based on the grid structured feature representation. The grid-based features allow us to employ fully convolutional networks for real-time processing, and also to easily reuse and share features within different components. We also introduce cooperatively operating modules that aggregate information from available frames, in order to enrich the features for all subtasks in VIS. Our design fully takes advantage of previous information in a grid form for all tasks in VIS in an efficient way, and we achieved the new state-of-the-art accuracy (38.6 AP and 36.9 AP) and speed (40.0 FPS) on YouTube-VIS 2019 and 2021 datasets among online VIS methods. The code is available at https://github.com/SuHoHan95/VISOLO. △ Less

Submitted 30 March, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

arXiv:2111.14799 [pdf, other]

UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection

Authors: Hyolim Kang, **woo Kim, Taehyun Kim, Seon Joo Kim

Abstract: Generic Event Boundary Detection (GEBD) is a newly suggested video understanding task that aims to find one level deeper semantic boundaries of events. Bridging the gap between natural human perception and video understanding, it has various potential applications, including interpretable and semantically valid video parsing. Still at an early development stage, existing GEBD solvers are simple ex… ▽ More Generic Event Boundary Detection (GEBD) is a newly suggested video understanding task that aims to find one level deeper semantic boundaries of events. Bridging the gap between natural human perception and video understanding, it has various potential applications, including interpretable and semantically valid video parsing. Still at an early development stage, existing GEBD solvers are simple extensions of relevant video understanding tasks, disregarding GEBD's distinctive characteristics. In this paper, we propose a novel framework for unsupervised/supervised GEBD, by using the Temporal Self-similarity Matrix (TSM) as the video representation. The new Recursive TSM Parsing (RTP) algorithm exploits local diagonal patterns in TSM to detect boundaries, and it is combined with the Boundary Contrastive (BoCo) loss to train our encoder to generate more informative TSMs. Our framework can be applied to both unsupervised and supervised settings, with both achieving state-of-the-art performance by a huge margin in GEBD benchmark. Especially, our unsupervised method outperforms the previous state-of-the-art "supervised" model, implying its exceptional efficacy. △ Less

Submitted 29 November, 2021; v1 submitted 29 November, 2021; originally announced November 2021.

arXiv:2106.11549 [pdf, other]

Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning Approach

Authors: Hyolim Kang, **woo Kim, Kyungmin Kim, Taehyun Kim, Seon Joo Kim

Abstract: Generic Event Boundary Detection (GEBD) is a newly introduced task that aims to detect "general" event boundaries that correspond to natural human perception. In this paper, we introduce a novel contrastive learning based approach to deal with the GEBD. Our intuition is that the feature similarity of the video snippet would significantly vary near the event boundaries, while remaining relatively t… ▽ More Generic Event Boundary Detection (GEBD) is a newly introduced task that aims to detect "general" event boundaries that correspond to natural human perception. In this paper, we introduce a novel contrastive learning based approach to deal with the GEBD. Our intuition is that the feature similarity of the video snippet would significantly vary near the event boundaries, while remaining relatively the same in the remaining part of the video. In our model, Temporal Self-similarity Matrix (TSM) is utilized as an intermediate representation which takes on a role as an information bottleneck. With our model, we achieved significant performance boost compared to the given baselines. Our code is available at https://github.com/hello-**woo/LOVEU-CVPR2021. △ Less

Submitted 22 June, 2021; originally announced June 2021.

arXiv:2106.03299 [pdf, other]

Video Instance Segmentation using Inter-Frame Communication Transformers

Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

Abstract: We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose… ▽ More We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code will be made available. △ Less

Submitted 6 June, 2021; originally announced June 2021.

arXiv:2105.14584 [pdf, other]

Polygonal Point Set Tracking

Authors: Gunhee Nam, Miran Heo, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim

Abstract: In this paper, we propose a novel learning-based polygonal point set tracking method. Compared to existing video object segmentation~(VOS) methods that propagate pixel-wise object mask information, we propagate a polygonal point set over frames. Specifically, the set is defined as a subset of points in the target contour, and our goal is to track corresponding points on the target contour. Those… ▽ More In this paper, we propose a novel learning-based polygonal point set tracking method. Compared to existing video object segmentation~(VOS) methods that propagate pixel-wise object mask information, we propagate a polygonal point set over frames. Specifically, the set is defined as a subset of points in the target contour, and our goal is to track corresponding points on the target contour. Those outputs enable us to apply various visual effects such as motion tracking, part deformation, and texture map**. To this end, we propose a new method to track the corresponding points between frames by the global-local alignment with delicately designed losses and regularization terms. We also introduce a novel learning strategy using synthetic and VOS datasets that makes it possible to tackle the problem without develo** the point correspondence dataset. Since the existing datasets are not suitable to validate our method, we build a new polygonal point set tracking dataset and demonstrate the superior performance of our method over the baselines and existing contour-based VOS methods. In addition, we present visual-effects applications of our method on part distortion and text map**. △ Less

Submitted 30 May, 2021; originally announced May 2021.

Comments: 14 pages, 10 figures, 6 tables

arXiv:2105.03332 [pdf]

doi 10.1002/er.7879

Finite volume method network for acceleration of unsteady computational fluid dynamics: non-reacting and reacting flows

Authors: Joongoo Jeon, Juhyeong Lee, Sung Joong Kim

Abstract: Despite rapid improvements in the performance of central processing unit (CPU), the calculation cost of simulating chemically reacting flow using CFD remains infeasible in many cases. The application of the convolutional neural networks (CNNs) specialized in image processing in flow field prediction has been studied, but the need to develop a neural netweork design fitted for CFD is recently emerg… ▽ More Despite rapid improvements in the performance of central processing unit (CPU), the calculation cost of simulating chemically reacting flow using CFD remains infeasible in many cases. The application of the convolutional neural networks (CNNs) specialized in image processing in flow field prediction has been studied, but the need to develop a neural netweork design fitted for CFD is recently emerged. In this study, a neural network model introducing the finite volume method (FVM) with a unique network architecture and physics-informed loss function was developed to accelerate CFD simulations. The developed network model, considering the nature of the CFD flow field where the identical governing equations are applied to all grids, can predict the future fields with only two previous fields unlike the CNNs requiring many field images (>10,000). The performance of this baseline model was evaluated using CFD time series data from non-reacting flow and reacting flow simulation; counterflow and hydrogen flame with 20 detailed chemistries. Consequently, we demonstrated that (1) the FVM-based network architecture provided improved accuracy of multistep time series prediction compared to the previous MLP model (2) the physic-informed loss function prevented non-physical overfitting problem and ultimately reduced the error in time series prediction (3) observing the calculated residuals in an unsupervised manner could indirectly estimate the network accuracy. Additionally, under the reacting flow dataset, the computational speed of this network model was measured to be about 10 times faster than that of the CFD solver. △ Less

Submitted 24 July, 2023; v1 submitted 7 May, 2021; originally announced May 2021.

Journal ref: International Journal of Energy Research, 2022

arXiv:2104.08030 [pdf, other]

doi 10.1016/j.patcog.2021.107954

Temporally smooth online action detection using cycle-consistent future anticipation

Authors: Young Hwi Kim, Seonghyeon Nam, Seon Joo Kim

Abstract: Many video understanding tasks work in the offline setting by assuming that the input video is given from the start to the end. However, many real-world problems require the online setting, making a decision immediately using only the current and the past frames of videos such as in autonomous driving and surveillance systems. In this paper, we present a novel solution for online action detection… ▽ More Many video understanding tasks work in the offline setting by assuming that the input video is given from the start to the end. However, many real-world problems require the online setting, making a decision immediately using only the current and the past frames of videos such as in autonomous driving and surveillance systems. In this paper, we present a novel solution for online action detection by using a simple yet effective RNN-based networks called the Future Anticipation and Temporally Smoothing network (FATSnet). The proposed network consists of a module for anticipating the future that can be trained in an unsupervised manner with the cycle-consistency loss, and another component for aggregating the past and the future for temporally smooth frame-by-frame predictions. We also propose a solution to relieve the performance loss when running RNN-based models on very long sequences. Evaluations on TVSeries, THUMOS14, and BBDB show that our method achieve the state-of-the-art performances compared to the previous works on online action detection. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Comments: Accepted by Pattern Recognition

Journal ref: Pattern Recognition, Volume 116, August 2021, 107954

arXiv:2102.05790 [pdf]

doi 10.1038/s41565-021-00967-4

Nonlocal metasurfaces for spectrally decoupled wavefront manipulation and eye tracking

Authors: Jung-Hwan Song, Jorik van de Groep, Soo ** Kim, Mark L. Brongersma

Abstract: Metasurface-based optical elements typically manipulate light waves by imparting space-variant changes in the amplitude and phase with a dense array of scattering nanostructures. The highly-localized and low optical-quality-factor (Q) modes of nanostructures are beneficial for wavefront-sha** as they afford quasi-local control over the electromagnetic fields. However, many emerging imaging, sens… ▽ More Metasurface-based optical elements typically manipulate light waves by imparting space-variant changes in the amplitude and phase with a dense array of scattering nanostructures. The highly-localized and low optical-quality-factor (Q) modes of nanostructures are beneficial for wavefront-sha** as they afford quasi-local control over the electromagnetic fields. However, many emerging imaging, sensing, communication, display, and non-linear optics applications instead require flat, high-Q optical elements that provide notable energy storage and a much higher degree of spectral control over the wavefront. Here, we demonstrate high-Q, nonlocal metasurfaces with atomically-thin metasurface elements that offer notably enhanced light-matter interaction and fully-decoupled optical functions at different wavelengths. We illustrate a possible use of such a flat optic in eye tracking for eye-wear. Here, a metasurface patterned on a regular pair of eye-glasses provides an unperturbed view of the world across the visible spectrum and redirects near-infrared light to a camera to allow imaging of the eye. △ Less

Submitted 10 February, 2021; originally announced February 2021.

arXiv:2102.01163 [pdf]

Visual Framing of Science Conspiracy Videos: Integrating Machine Learning with Communication Theories to Study the Use of Color and Brightness

Authors: Kai** Chen, Sang Jung Kim, Qiantong Gao, Sebastian Raschka

Abstract: Recent years have witnessed an explosion of science conspiracy videos on the Internet, challenging science epistemology and public understanding of science. Scholars have started to examine the persuasion techniques used in conspiracy messages such as uncertainty and fear yet, little is understood about the visual narratives, especially how visual narratives differ in videos that debunk conspiraci… ▽ More Recent years have witnessed an explosion of science conspiracy videos on the Internet, challenging science epistemology and public understanding of science. Scholars have started to examine the persuasion techniques used in conspiracy messages such as uncertainty and fear yet, little is understood about the visual narratives, especially how visual narratives differ in videos that debunk conspiracies versus those that propagate conspiracies. This paper addresses this gap in understanding visual framing in conspiracy videos through analyzing millions of frames from conspiracy and counter-conspiracy YouTube videos using computational methods. We found that conspiracy videos tended to use lower color variance and brightness, especially in thumbnails and earlier parts of the videos. This paper also demonstrates how researchers can integrate textual and visual features in machine learning models to study conspiracies on social media and discusses the implications of computational modeling for scholars interested in studying visual manipulation in the digital era. The analysis of visual and textual features presented in this paper could be useful for future studies focused on designing systems to identify conspiracy content on the Internet. △ Less

Submitted 17 November, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

arXiv:2012.01632 [pdf, other]

Single-shot Path Integrated Panoptic Segmentation

Authors: Sukjun Hwang, Seoung Wug Oh, Seon Joo Kim

Abstract: Panoptic segmentation, which is a novel task of unifying instance segmentation and semantic segmentation, has attracted a lot of attention lately. However, most of the previous methods are composed of multiple pathways with each pathway specialized to a designated segmentation task. In this paper, we propose to resolve panoptic segmentation in single-shot by integrating the execution flows. With t… ▽ More Panoptic segmentation, which is a novel task of unifying instance segmentation and semantic segmentation, has attracted a lot of attention lately. However, most of the previous methods are composed of multiple pathways with each pathway specialized to a designated segmentation task. In this paper, we propose to resolve panoptic segmentation in single-shot by integrating the execution flows. With the integrated pathway, a unified feature map called Panoptic-Feature is generated, which includes the information of both things and stuffs. Panoptic-Feature becomes more sophisticated by auxiliary problems that guide to cluster pixels that belong to the same instance and differentiate between objects of different classes. A collection of convolutional filters, where each filter represents either a thing or stuff, is applied to Panoptic-Feature at once, materializing the single-shot panoptic segmentation. Taking the advantages of both top-down and bottom-up approaches, our method, named SPINet, enjoys high efficiency and accuracy on major panoptic segmentation benchmarks: COCO and Cityscapes. △ Less

Submitted 3 December, 2020; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: 10 pages, 5 figures

arXiv:2007.08786 [pdf, other]

Cross-Identity Motion Transfer for Arbitrary Objects through Pose-Attentive Video Reassembling

Authors: Subin Jeon, Seonghyeon Nam, Seoung Wug Oh, Seon Joo Kim

Abstract: We propose an attention-based networks for transferring motions between arbitrary objects. Given a source image(s) and a driving video, our networks animate the subject in the source images according to the motion in the driving video. In our attention mechanism, dense similarities between the learned keypoints in the source and the driving images are computed in order to retrieve the appearance i… ▽ More We propose an attention-based networks for transferring motions between arbitrary objects. Given a source image(s) and a driving video, our networks animate the subject in the source images according to the motion in the driving video. In our attention mechanism, dense similarities between the learned keypoints in the source and the driving images are computed in order to retrieve the appearance information from the source images. Taking a different approach from the well-studied war** based models, our attention-based model has several advantages. By reassembling non-locally searched pieces from the source contents, our approach can produce more realistic outputs. Furthermore, our system can make use of multiple observations of the source appearance (e.g. front and sides of faces) to make the results more accurate. To reduce the training-testing discrepancy of the self-supervised learning, a novel cross-identity training scheme is additionally introduced. With the training scheme, our networks is trained to transfer motions between different subjects, as in the real testing scenario. Experimental results validate that our method produces visually pleasing results in various object domains, showing better performances compared to previous works. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2006.12314 [pdf]

Always-On, Sub-300-nW, Event-Driven Spiking Neural Network based on Spike-Driven Clock-Generation and Clock- and Power-Gating for an Ultra-Low-Power Intelligent Device

Authors: Dewei Wang, Pavan Kumar Chundi, Sung Justin Kim, Minhao Yang, Joao Pedro Cerqueira, Joonsung Kang, Seungchul Jung, Sangjoon Kim, Mingoo Seok

Abstract: Always-on artificial intelligent (AI) functions such as keyword spotting (KWS) and visual wake-up tend to dominate total power consumption in ultra-low power devices. A key observation is that the signals to an always-on function are sparse in time, which a spiking neural network (SNN) classifier can leverage for power savings, because the switching activity and power consumption of SNNs tend to s… ▽ More Always-on artificial intelligent (AI) functions such as keyword spotting (KWS) and visual wake-up tend to dominate total power consumption in ultra-low power devices. A key observation is that the signals to an always-on function are sparse in time, which a spiking neural network (SNN) classifier can leverage for power savings, because the switching activity and power consumption of SNNs tend to scale with spike rate. Toward this goal, we present a novel SNN classifier architecture for always-on functions, demonstrating sub-300nW power consumption at the competitive inference accuracy for a KWS and other always-on classification workloads. △ Less

Submitted 23 June, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

arXiv:2005.01056 [pdf, other]

NTIRE 2020 Challenge on Perceptual Extreme Super-Resolution: Methods and Results

Authors: Kai Zhang, Shuhang Gu, Radu Timofte, Taizhang Shang, Qiuju Dai, Shengchen Zhu, Tong Yang, Yandong Guo, Younghyun Jo, Sejong Yang, Seon Joo Kim, Lin Zha, Jiande Jiang, Xinbo Gao, Wen Lu, **g Liu, Kwang** Yoon, Taegyun Jeon, Kazutoshi Akita, Takeru Ooba, Norimichi Ukita, Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Dongliang He , et al. (38 additional authors not shown)

Abstract: This paper reviews the NTIRE 2020 challenge on perceptual extreme super-resolution with focus on proposed solutions and results. The challenge task was to super-resolve an input image with a magnification factor 16 based on a set of prior examples of low and corresponding high resolution images. The goal is to obtain a network design capable to produce high resolution results with the best percept… ▽ More This paper reviews the NTIRE 2020 challenge on perceptual extreme super-resolution with focus on proposed solutions and results. The challenge task was to super-resolve an input image with a magnification factor 16 based on a set of prior examples of low and corresponding high resolution images. The goal is to obtain a network design capable to produce high resolution results with the best perceptual quality and similar to the ground truth. The track had 280 registered participants, and 19 teams submitted the final results. They gauge the state-of-the-art in single image super-resolution. △ Less

Submitted 3 May, 2020; originally announced May 2020.

Comments: CVPRW 2020

arXiv:2004.02432 [pdf, other]

Deep Space-Time Video Upsampling Networks

Authors: Jaeyeon Kang, Younghyun Jo, Seoung Wug Oh, Peter Vajda, Seon Joo Kim

Abstract: Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, i… ▽ More Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, independently. This is highly inefficient as heavy deep neural networks (DNN) are involved in each solution. To this end, we propose an end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework. In our framework, a novel weighting scheme is proposed to fuse input frames effectively without explicit motion compensation for efficient processing of videos. The results show better results both quantitatively and qualitatively, while reducing the computation time (x7 faster) and the number of parameters (30%) compared to baselines. △ Less

Submitted 9 August, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: ECCV2020 accepted

arXiv:2003.09171 [pdf, other]

DMV: Visual Object Tracking via Part-level Dense Memory and Voting-based Retrieval

Authors: Gunhee Nam, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim

Abstract: We propose a novel memory-based tracker via part-level dense memory and voting-based retrieval, called DMV. Since deep learning techniques have been introduced to the tracking field, Siamese trackers have attracted many researchers due to the balance between speed and accuracy. However, most of them are based on a single template matching, which limits the performance as it restricts the accessibl… ▽ More We propose a novel memory-based tracker via part-level dense memory and voting-based retrieval, called DMV. Since deep learning techniques have been introduced to the tracking field, Siamese trackers have attracted many researchers due to the balance between speed and accuracy. However, most of them are based on a single template matching, which limits the performance as it restricts the accessible in-formation to the initial target features. In this paper, we relieve this limitation by maintaining an external memory that saves the tracking record. Part-level retrieval from the memory also liberates the information from the template and allows our tracker to better handle the challenges such as appearance changes and occlusions. By updating the memory during tracking, the representative power for the target object can be enhanced without online learning. We also propose a novel voting mechanism for the memory reading to filter out unreliable information in the memory. We comprehensively evaluate our tracker on OTB-100,TrackingNet, GOT-10k, LaSOT, and UAV123, which show that our method yields comparable results to the state-of-the-art methods. △ Less

Submitted 20 March, 2020; originally announced March 2020.

Comments: 19 pages, 9 figures

arXiv:2003.09124 [pdf, other]

Learning the Loss Functions in a Discriminative Space for Video Restoration

Authors: Younghyun Jo, Jaeyeon Kang, Seoung Wug Oh, Seonghyeon Nam, Peter Vajda, Seon Joo Kim

Abstract: With more advanced deep network architectures and learning schemes such as GANs, the performance of video restoration algorithms has greatly improved recently. Meanwhile, the loss functions for optimizing deep neural networks remain relatively unchanged. To this end, we propose a new framework for building effective loss functions by learning a discriminative space specific to a video restoration… ▽ More With more advanced deep network architectures and learning schemes such as GANs, the performance of video restoration algorithms has greatly improved recently. Meanwhile, the loss functions for optimizing deep neural networks remain relatively unchanged. To this end, we propose a new framework for building effective loss functions by learning a discriminative space specific to a video restoration task. Our framework is similar to GANs in that we iteratively train two networks - a generator and a loss network. The generator learns to restore videos in a supervised fashion, by following ground truth features through the feature matching in the discriminative space learned by the loss network. In addition, we also introduce a new relation loss in order to maintain the temporal consistency in output videos. Experiments on video superresolution and deblurring show that our method generates visually more pleasing videos with better quantitative perceptual metric values than the other state-of-the-art methods. △ Less

Submitted 20 March, 2020; originally announced March 2020.

Comments: 24 pages

arXiv:1910.02027 [pdf, other]

Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

Authors: Yunji Kim, Seonghyeon Nam, In Cho, Seon Joo Kim

Abstract: We propose a deep video prediction model conditioned on a single image and an action class. To generate future frames, we first detect keypoints of a moving object and predict future motion as a sequence of keypoints. The input image is then translated following the predicted keypoints sequence to compose future frames. Detecting the keypoints is central to our algorithm, and our method is trained… ▽ More We propose a deep video prediction model conditioned on a single image and an action class. To generate future frames, we first detect keypoints of a moving object and predict future motion as a sequence of keypoints. The input image is then translated following the predicted keypoints sequence to compose future frames. Detecting the keypoints is central to our algorithm, and our method is trained to detect the keypoints of arbitrary objects in an unsupervised manner. Moreover, the detected keypoints of the original videos are used as pseudo-labels to learn the motion of objects. Experimental results show that our method is successfully applied to various datasets without the cost of labeling keypoints in videos. The detected keypoints are similar to human-annotated labels, and prediction results are more realistic compared to the previous methods. △ Less

Submitted 4 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019

arXiv:1908.11587 [pdf, other]

Copy-and-Paste Networks for Deep Video Inpainting

Authors: Sungho Lee, Seoung Wug Oh, DaeYeun Won, Seon Joo Kim

Abstract: We present a novel deep learning based algorithm for video inpainting. Video inpainting is a process of completing corrupted or missing regions in videos. Video inpainting has additional challenges compared to image inpainting due to the extra temporal information as well as the need for maintaining the temporal coherency. We propose a novel DNN-based framework called the Copy-and-Paste Networks f… ▽ More We present a novel deep learning based algorithm for video inpainting. Video inpainting is a process of completing corrupted or missing regions in videos. Video inpainting has additional challenges compared to image inpainting due to the extra temporal information as well as the need for maintaining the temporal coherency. We propose a novel DNN-based framework called the Copy-and-Paste Networks for video inpainting that takes advantage of additional information in other frames of the video. The network is trained to copy corresponding contents in reference frames and paste them to fill the holes in the target frame. Our network also includes an alignment network that computes affine matrices between frames for the alignment, enabling the network to take information from more distant frames for robustness. Our method produces visually pleasing and temporally coherent results while running faster than the state-of-the-art optimization-based method. In addition, we extend our framework for enhancing over/under exposed frames in videos. Using this enhancement technique, we were able to significantly improve the lane detection accuracy on road videos. △ Less

Submitted 30 August, 2019; originally announced August 2019.

Comments: ICCV 2019

arXiv:1908.08718 [pdf, other]

Onion-Peel Networks for Deep Video Completion

Authors: Seoung Wug Oh, Sungho Lee, Joon-Young Lee, Seon Joo Kim

Abstract: We propose the onion-peel networks for video completion. Given a set of reference images and a target image with holes, our network fills the hole by referring the contents in the reference images. Our onion-peel network progressively fills the hole from the hole boundary enabling it to exploit richer contextual information for the missing regions every step. Given a sufficient number of recurrenc… ▽ More We propose the onion-peel networks for video completion. Given a set of reference images and a target image with holes, our network fills the hole by referring the contents in the reference images. Our onion-peel network progressively fills the hole from the hole boundary enabling it to exploit richer contextual information for the missing regions every step. Given a sufficient number of recurrences, even a large hole can be inpainted successfully. To attend to the missing information visible in the reference images, we propose an asymmetric attention block that computes similarities between the hole boundary pixels in the target and the non-hole pixels in the references in a non-local manner. With our attention block, our network can have an unlimited spatial-temporal window size and fill the holes with globally coherent contents. In addition, our framework is applicable to the image completion guided by the reference images without any modification, which is difficult to do with the previous methods. We validate that our method produces visually pleasing image and video inpainting results in realistic test cases. △ Less

Submitted 23 August, 2019; originally announced August 2019.

Comments: ICCV 2019

arXiv:1904.09791 [pdf, other]

Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks

Authors: Seoung Wug Oh, Joon-Young Lee, Ning Xu, Seon Joo Kim

Abstract: We present a deep learning method for the interactive video object segmentation. Our method is built upon two core operations, interaction and propagation, and each operation is conducted by Convolutional Neural Networks. The two networks are connected both internally and externally so that the networks are trained jointly and interact with each other to solve the complex video object segmentation… ▽ More We present a deep learning method for the interactive video object segmentation. Our method is built upon two core operations, interaction and propagation, and each operation is conducted by Convolutional Neural Networks. The two networks are connected both internally and externally so that the networks are trained jointly and interact with each other to solve the complex video object segmentation problem. We propose a new multi-round training scheme for the interactive video object segmentation so that the networks can learn how to understand the user's intention and update incorrect estimations during the training. At the testing time, our method produces high-quality results and also runs fast enough to work with users interactively. We evaluated the proposed method quantitatively on the interactive track benchmark at the DAVIS Challenge 2018. We outperformed other competing methods by a significant margin in both the speed and the accuracy. We also demonstrated that our method works well with real user interactions. △ Less

Submitted 2 May, 2019; v1 submitted 22 April, 2019; originally announced April 2019.

Comments: CVPR 2019

arXiv:1904.00680 [pdf, other]

End-to-End Time-Lapse Video Synthesis from a Single Outdoor Image

Authors: Seonghyeon Nam, Chongyang Ma, Menglei Chai, William Brendel, Ning Xu, Seon Joo Kim

Abstract: Time-lapse videos usually contain visually appealing content but are often difficult and costly to create. In this paper, we present an end-to-end solution to synthesize a time-lapse video from a single outdoor image using deep neural networks. Our key idea is to train a conditional generative adversarial network based on existing datasets of time-lapse videos and image sequences. We propose a mul… ▽ More Time-lapse videos usually contain visually appealing content but are often difficult and costly to create. In this paper, we present an end-to-end solution to synthesize a time-lapse video from a single outdoor image using deep neural networks. Our key idea is to train a conditional generative adversarial network based on existing datasets of time-lapse videos and image sequences. We propose a multi-frame joint conditional generation framework to effectively learn the correlation between the illumination change of an outdoor scene and the time of the day. We further present a multi-domain training scheme for robust training of our generative models from two datasets with different distributions and missing timestamp labels. Compared to alternative time-lapse video synthesis algorithms, our method uses the timestamp as the control variable and does not require a reference video to guide the synthesis of the final output. We conduct ablation studies to validate our algorithm and compare with state-of-the-art techniques both qualitatively and quantitatively. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: To appear in CVPR 2019

arXiv:1904.00607 [pdf, other]

Video Object Segmentation using Space-Time Memory Networks

Authors: Seoung Wug Oh, Joon-Young Lee, Ning Xu, Seon Joo Kim

Abstract: We propose a novel solution for semi-supervised video object segmentation. By the nature of the problem, available cues (e.g. video frame(s) with object masks) become richer with the intermediate predictions. However, the existing methods are unable to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learn to read relevant information from all a… ▽ More We propose a novel solution for semi-supervised video object segmentation. By the nature of the problem, available cues (e.g. video frame(s) with object masks) become richer with the intermediate predictions. However, the existing methods are unable to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learn to read relevant information from all available sources. In our framework, the past frames with object masks form an external memory, and the current frame as the query is segmented using the mask information in the memory. Specifically, the query and the memory are densely matched in the feature space, covering all the space-time pixel locations in a feed-forward fashion. Contrast to the previous approaches, the abundant use of the guidance information allows us to better handle the challenges such as appearance changes and occlussions. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (overall score of 79.4 on Youtube-VOS val set, J of 88.7 and 79.2 on DAVIS 2016/2017 val set respectively) while having a fast runtime (0.16 second/frame on DAVIS 2016 val set). △ Less

Submitted 12 August, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: ICCV 2019

arXiv:1810.11919 [pdf, other]

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

Authors: Seonghyeon Nam, Yunji Kim, Seon Joo Kim

Abstract: This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose… ▽ More This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose the text-adaptive generative adversarial network (TAGAN) to generate semantically manipulated images while preserving text-irrelevant contents. The key to our method is the text-adaptive discriminator that creates word-level local discriminators according to input text to classify fine-grained attributes independently. With this discriminator, the generator learns to generate images where only regions that correspond to the given text are modified. Experimental results show that our method outperforms existing methods on CUB and Oxford-102 datasets, and our results were mostly preferred on a user study. Extensive analysis shows that our method is able to effectively disentangle visual attributes and produce pleasing outputs. △ Less

Submitted 27 November, 2018; v1 submitted 28 October, 2018; originally announced October 2018.

Comments: NeurIPS 2018

arXiv:1804.02379 [pdf, other]

EPINET: A Fully-Convolutional Neural Network Using Epipolar Geometry for Depth from Light Field Images

Authors: Changha Shin, Hae-Gon Jeon, Young** Yoon, In So Kweon, Seon Joo Kim

Abstract: Light field cameras capture both the spatial and the angular properties of light rays in space. Due to its property, one can compute the depth from light fields in uncontrolled lighting environments, which is a big advantage over active sensing devices. Depth computed from light fields can be used for many applications including 3D modelling and refocusing. However, light field images from hand-he… ▽ More Light field cameras capture both the spatial and the angular properties of light rays in space. Due to its property, one can compute the depth from light fields in uncontrolled lighting environments, which is a big advantage over active sensing devices. Depth computed from light fields can be used for many applications including 3D modelling and refocusing. However, light field images from hand-held cameras have very narrow baselines with noise, making the depth estimation difficult. any approaches have been proposed to overcome these limitations for the light field depth estimation, but there is a clear trade-off between the accuracy and the speed in these methods. In this paper, we introduce a fast and accurate light field depth estimation method based on a fully-convolutional neural network. Our network is designed by considering the light field geometry and we also overcome the lack of training data by proposing light field specific data augmentation methods. We achieved the top rank in the HCI 4D Light Field Benchmark on most metrics, and we also demonstrate the effectiveness of the proposed method on real-world light-field images. △ Less

Submitted 6 April, 2018; originally announced April 2018.

Comments: Accepted to CVPR 2018, Total 10 pages

arXiv:1707.08350 [pdf, other]

Modelling the Scene Dependent Imaging in Cameras with a Deep Neural Network

Authors: Seonghyeon Nam, Seon Joo Kim

Abstract: We present a novel deep learning framework that models the scene dependent image processing inside cameras. Often called as the radiometric calibration, the process of recovering RAW images from processed images (JPEG format in the sRGB color space) is essential for many computer vision tasks that rely on physically accurate radiance values. All previous works rely on the deterministic imaging mod… ▽ More We present a novel deep learning framework that models the scene dependent image processing inside cameras. Often called as the radiometric calibration, the process of recovering RAW images from processed images (JPEG format in the sRGB color space) is essential for many computer vision tasks that rely on physically accurate radiance values. All previous works rely on the deterministic imaging model where the color transformation stays the same regardless of the scene and thus they can only be applied for images taken under the manual mode. In this paper, we propose a data-driven approach to learn the scene dependent and locally varying image processing inside cameras under the automode. Our method incorporates both the global and the local scene context into pixel-wise features via multi-scale pyramid of learnable histogram layers. The results show that we can model the imaging pipeline of different cameras that operate under the automode accurately in both directions (from RAW to sRGB, from sRGB to RAW) and we show how we can apply our method to improve the performance of image deblurring. △ Less

Submitted 26 July, 2017; originally announced July 2017.

Comments: To appear in ICCV 2017

arXiv:1706.08260 [pdf, other]

Deep Semantics-Aware Photo Adjustment

Authors: Seonghyeon Nam, Seon Joo Kim

Abstract: Automatic photo adjustment is to mimic the photo retouching style of professional photographers and automatically adjust photos to the learned style. There have been many attempts to model the tone and the color adjustment globally with low-level color statistics. Also, spatially varying photo adjustment methods have been studied by exploiting high-level features and semantic label maps. Those met… ▽ More Automatic photo adjustment is to mimic the photo retouching style of professional photographers and automatically adjust photos to the learned style. There have been many attempts to model the tone and the color adjustment globally with low-level color statistics. Also, spatially varying photo adjustment methods have been studied by exploiting high-level features and semantic label maps. Those methods are semantics-aware since the color map** is dependent on the high-level semantic context. However, their performance is limited to the pre-computed hand-crafted features and it is hard to reflect user's preference to the adjustment. In this paper, we propose a deep neural network that models the semantics-aware photo adjustment. The proposed network exploits bilinear models that are the multiplicative interaction of the color and the contexual features. As the contextual features we propose the semantic adjustment map, which discovers the inherent photo retouching presets that are applied according to the scene context. The proposed method is trained using a robust loss with a scene parsing task. The experimental results show that the proposed method outperforms the existing method both quantitatively and qualitatively. The proposed method also provides users a way to retouch the photo by their own likings by giving customized adjustment maps. △ Less

Submitted 26 June, 2017; originally announced June 2017.

arXiv:1705.07543 [pdf, other]

Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks

Authors: Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee

Abstract: An image is a very effective tool for conveying emotions. Many researchers have investigated in computing the image emotions by using various features extracted from images. In this paper, we focus on two high level features, the object and the background, and assume that the semantic information of images is a good cue for predicting emotion. An object is one of the most important elements that d… ▽ More An image is a very effective tool for conveying emotions. Many researchers have investigated in computing the image emotions by using various features extracted from images. In this paper, we focus on two high level features, the object and the background, and assume that the semantic information of images is a good cue for predicting emotion. An object is one of the most important elements that define an image, and we find out through experiments that there is a high correlation between the object and the emotion in images. Even with the same object, there may be slight difference in emotion due to different backgrounds, and we use the semantic information of the background to improve the prediction performance. By combining the different levels of features, we build an emotion based feed forward deep neural network which produces the emotion values of a given image. The output emotion values in our framework are continuous values in the 2-dimensional space (Valence and Arousal), which are more effective than using a few number of emotion categories in describing emotions. Experiments confirm the effectiveness of our network in predicting the emotion of images. △ Less

Submitted 3 July, 2017; v1 submitted 21 May, 2017; originally announced May 2017.

Comments: 11 pages, 15 figures

arXiv:1608.07951 [pdf, other]

doi 10.1016/j.patcog.2016.08.013

Approaching the Computational Color Constancy as a Classification Problem through Deep Learning

Authors: Seoung Wug Oh, Seon Joo Kim

Abstract: Computational color constancy refers to the problem of computing the illuminant color so that the images of a scene under varying illumination can be normalized to an image under the canonical illumination. In this paper, we adopt a deep learning framework for the illumination estimation problem. The proposed method works under the assumption of uniform illumination over the scene and aims for the… ▽ More Computational color constancy refers to the problem of computing the illuminant color so that the images of a scene under varying illumination can be normalized to an image under the canonical illumination. In this paper, we adopt a deep learning framework for the illumination estimation problem. The proposed method works under the assumption of uniform illumination over the scene and aims for the accurate illuminant color computation. Specifically, we trained the convolutional neural network to solve the problem by casting the color constancy problem as an illumination classification problem. We designed the deep learning architecture so that the output of the network can be directly used for computing the color of the illumination. Experimental results show that our deep network is able to extract useful features for the illumination estimation and our method outperforms all previous color constancy methods on multiple test datasets. △ Less

Submitted 29 August, 2016; originally announced August 2016.

Comments: This is a preprint of an article accepted for publication in Pattern Recognition, ELSEVIER

Journal ref: Pattern Recognition, Volume 61, January 2017, Pages 405 to 416

arXiv:1502.06105 [pdf, ps, other]

doi 10.1109/ACCESS.2016.2551727

Regularization and Kernelization of the Maximin Correlation Approach

Authors: Taehoon Lee, Taesup Moon, Seung Jean Kim, Sungroh Yoon

Abstract: Robust classification becomes challenging when each class consists of multiple subclasses. Examples include multi-font optical character recognition and automated protein function prediction. In correlation-based nearest-neighbor classification, the maximin correlation approach (MCA) provides the worst-case optimal solution by minimizing the maximum misclassification risk through an iterative proc… ▽ More Robust classification becomes challenging when each class consists of multiple subclasses. Examples include multi-font optical character recognition and automated protein function prediction. In correlation-based nearest-neighbor classification, the maximin correlation approach (MCA) provides the worst-case optimal solution by minimizing the maximum misclassification risk through an iterative procedure. Despite the optimality, the original MCA has drawbacks that have limited its wide applicability in practice. That is, the MCA tends to be sensitive to outliers, cannot effectively handle nonlinearities in datasets, and suffers from having high computational complexity. To address these limitations, we propose an improved solution, named regularized maximin correlation approach (R-MCA). We first reformulate MCA as a quadratically constrained linear programming (QCLP) problem, incorporate regularization by introducing slack variables in the primal problem of the QCLP, and derive the corresponding Lagrangian dual. The dual formulation enables us to apply the kernel trick to R-MCA so that it can better handle nonlinearities. Our experimental results demonstrate that the regularization and kernelization make the proposed R-MCA more robust and accurate for various classification tasks than the original MCA. Furthermore, when the data size or dimensionality grows, R-MCA runs substantially faster by solving either the primal or dual (whichever has a smaller variable dimension) of the QCLP. △ Less

Submitted 29 March, 2016; v1 submitted 21 February, 2015; originally announced February 2015.

Comments: Submitted to IEEE Access

arXiv:1002.0123 [pdf, ps, other]

Achievable rate regions and outer bounds for a multi-pair bi-directional relay network

Authors: Sang Joon Kim, Besma Smida, Natasha Devroye

Abstract: In a bi-directional relay channel, a pair of nodes wish to exchange independent messages over a shared wireless half-duplex channel with the help of relays. Recent work has mostly considered information theoretic limits of the bi-directional relay channel with two terminal nodes (or end users) and one relay. In this work we consider bi-directional relaying with one base station, multiple termina… ▽ More In a bi-directional relay channel, a pair of nodes wish to exchange independent messages over a shared wireless half-duplex channel with the help of relays. Recent work has mostly considered information theoretic limits of the bi-directional relay channel with two terminal nodes (or end users) and one relay. In this work we consider bi-directional relaying with one base station, multiple terminal nodes and one relay, all of which operate in half-duplex modes. We assume that each terminal node communicates with the base-station in a bi-directional fashion through the relay and do not place any restrictions on the channels between the users, relays and base-stations; that is, each node has a direct link with every other node. Our contributions are three-fold: 1) the introduction of four new temporal protocols which fully exploit the two-way nature of the data and outperform simple routing or multi-hop communication schemes by carefully combining network coding, random binning and user cooperation which exploit over-heard and own-message side information, 2) derivations of inner and outer bounds on the capacity region of the discrete-memoryless multi-pair two-way network, and 3) a numerical evaluation of the obtained achievable rate regions and outer bounds in Gaussian noise which illustrate the performance of the proposed protocols compared to simpler schemes, to each other, to the outer bounds, which highlight the relative gains achieved by network coding, random binning and compress-and-forward-type cooperation between terminal nodes. △ Less

Submitted 31 January, 2010; originally announced February 2010.

Comments: 61 pages, 12 figures, will be submitted to IEEE info theory

arXiv:0810.1268 [pdf, ps, other]

Bi-directional half-duplex protocols with multiple relays

Authors: Sang Joon Kim, Natasha Devroye, Vahid Tarokh

Abstract: In a bi-directional relay channel, two nodes wish to exchange independent messages over a shared wireless half-duplex channel with the help of relays. Recent work has considered information theoretic limits of the bi-directional relay channel with a single relay. In this work we consider bi-directional relaying with multiple relays. We derive achievable rate regions and outer bounds for half-duple… ▽ More In a bi-directional relay channel, two nodes wish to exchange independent messages over a shared wireless half-duplex channel with the help of relays. Recent work has considered information theoretic limits of the bi-directional relay channel with a single relay. In this work we consider bi-directional relaying with multiple relays. We derive achievable rate regions and outer bounds for half-duplex protocols with multiple decode and forward relays and compare these to the same protocols with amplify and forward relays in an additive white Gaussian noise channel. We consider three novel classes of half-duplex protocols: the (m,2) 2 phase protocol with m relays, the (m,3) 3 phase protocol with m relays, and general (m, t) Multiple Hops and Multiple Relays (MHMR) protocols, where m is the total number of relays and 3<t< m+3 is the number of temporal phases in the protocol. The (m,2) and (m,3) protocols extend previous bi-directional relaying protocols for a single m=1 relay, while the new (m,t) protocol efficiently combines multi-hop routing with message-level network coding. Finally, we provide a comprehensive treatment of the MHMR protocols with decode and forward relaying and amplify and forward relaying in the Gaussian noise, obtaining their respective achievable rate regions, outer bounds and relative performance under different SNRs and relay geometries, including an analytical comparison on the protocols at low and high SNR. △ Less

Submitted 9 September, 2010; v1 submitted 7 October, 2008; originally announced October 2008.

Comments: 44 pages, 17 figures, Submitted to IEEE Transactions on Information Theory

Showing 1–50 of 52 results for author: Kim, S J