Skip to main content

Showing 1–22 of 22 results for author: Byeon, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.07887  [pdf, other

    cs.LG cs.CL

    An Empirical Study of Mamba-based Language Models

    Authors: Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a contr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  2. arXiv:2406.07826  [pdf, other

    cs.LG cs.AI

    The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

    Authors: Giseung Park, Woohyeon Byeon, Seongmin Kim, Elad Havakuk, Amir Leshem, Youngchul Sung

    Abstract: In this paper, we consider multi-objective reinforcement learning, which arises in many real-world problems with multiple optimization goals. We approach the problem with a max-min framework focusing on fairness among the multiple goals and develop a relevant theory and a practical model-free algorithm under the max-min framework. The developed theory provides a theoretical advance in multi-object… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ICML 2024

  3. arXiv:2403.02330  [pdf, other

    cs.CV

    RegionGPT: Towards Region Understanding Vision Language Model

    Authors: Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, ** Luo, Sifei Liu

    Abstract: Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  4. arXiv:2312.04086  [pdf, other

    cs.CV

    MTVG : Multi-text Video Generation with Text-to-Video Models

    Authors: Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, **kyu Kim, Sungwoong Kim, Hyeokmin Kwon, Sangpil Kim

    Abstract: Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation~(MTVG) by directly utilizing a pre-trained diffusion-based text-to-video~(T2V) generation model without a… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  5. arXiv:2310.19694  [pdf, other

    cs.LG

    Convolutional State Space Models for Long-Range Spatiotemporal Modeling

    Authors: Jimmy T. H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon

    Abstract: Effectively modeling long spatiotemporal sequences is challenging due to the need to model complex spatial correlations and long-range temporal dependencies simultaneously. ConvLSTMs attempt to address this by updating tensor-valued states with recurrent neural networks, but their sequential computation makes them slow to train. In contrast, Transformers can process an entire spatiotemporal sequen… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

  6. arXiv:2309.04509  [pdf, other

    cs.SD cs.CV cs.GR eess.AS

    The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

    Authors: Yu** Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, **kyu Kim

    Abstract: In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: ICCV2023

  7. arXiv:2306.08593  [pdf, other

    cs.CV cs.LG

    Heterogeneous Continual Learning

    Authors: Divyam Madaan, Hongxu Yin, Wonmin Byeon, Jan Kautz, Pavlo Molchanov

    Abstract: We propose a novel framework and a solution to tackle the continual learning (CL) problem with changing network architectures. Most CL methods focus on adapting a single architecture to a new task/class by modifying its weights. However, with rapid progress in architecture design, the problem of adapting existing solutions to novel architectures becomes relevant. To address this limitation, we pro… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted to CVPR 2023

  8. arXiv:2303.04803  [pdf, other

    cs.CV

    Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

    Authors: Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello

    Abstract: We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is… ▽ More

    Submitted 5 April, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 Highlight. Project page and code: https://jerryxu.net/ODISE

  9. arXiv:2211.11381  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    LISA: Localized Image Stylization with Audio via Implicit Neural Representation

    Authors: Seung Hyun Lee, Chanyoung Kim, Wonmin Byeon, Sang Ho Yoon, **kyu Kim, Sangpil Kim

    Abstract: We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a pa… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  10. arXiv:2208.14114  [pdf, other

    cs.CV

    Robust Sound-Guided Image Manipulation

    Authors: Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Sang Ho Yoon, **kyu Kim, Sangpil Kim

    Abstract: Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input "raining". These approaches often utilize a StyleCLIP-based image generator, which leverages multi-modal (text and image) embedding space. However, we observe that such text inputs are often bottlenecked in provi… ▽ More

    Submitted 24 April, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

    Comments: arXiv admin note: text overlap with arXiv:2112.00007

  11. arXiv:2204.09273  [pdf, other

    cs.CV cs.AI

    Sound-Guided Semantic Video Generation

    Authors: Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Chanyoung Kim, Won Jeong Ryoo, Sang Ho Yoon, Hyunjun Cho, Jihyun Bae, **kyu Kim, Sangpil Kim

    Abstract: The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound… ▽ More

    Submitted 21 October, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  12. arXiv:2202.12358  [pdf, other

    physics.comp-ph cs.AI

    Physics Informed RNN-DCT Networks for Time-Dependent Partial Differential Equations

    Authors: Benjamin Wu, Oliver Hennigh, Jan Kautz, Sanjay Choudhry, Wonmin Byeon

    Abstract: Physics-informed neural networks allow models to be trained by physical laws described by general nonlinear partial differential equations. However, traditional architectures struggle to solve more challenging time-dependent problems due to their architectural nature. In this work, we present a novel physics-informed framework for solving time-dependent partial differential equations. Using only t… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: Benjamin Wu and Wonmin Byeon contributed equally to this work. 12 pages, 3 figures

  13. arXiv:2202.11094  [pdf, other

    cs.CV

    GroupViT: Semantic Segmentation Emerges from Text Supervision

    Authors: Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang

    Abstract: Grou** and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grou** of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grou** mechanism into deep networks, which allows semant… ▽ More

    Submitted 18 July, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

    Comments: CVPR 2022. Project page and code: https://jerryxu.net/GroupViT

  14. arXiv:2112.00007  [pdf, other

    cs.GR cs.CV cs.LG cs.SD eess.AS

    Sound-Guided Semantic Image Manipulation

    Authors: Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young Kim, **kyu Kim, Sangpil Kim

    Abstract: The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a fra… ▽ More

    Submitted 30 November, 2021; originally announced December 2021.

  15. arXiv:2106.09121  [pdf, other

    cs.LG cs.CV math.NA

    Scaling-up Diverse Orthogonal Convolutional Networks with a Paraunitary Framework

    Authors: Jiahao Su, Wonmin Byeon, Furong Huang

    Abstract: Enforcing orthogonality in neural networks is an antidote for gradient vanishing/exploding problems, sensitivity by adversarial perturbation, and bounding generalization errors. However, many previous approaches are heuristic, and the orthogonality of convolutional layers is not systematically studied: some of these designs are not exactly orthogonal, while others only consider standard convolutio… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

  16. arXiv:2105.09803  [pdf, other

    cs.CV

    Weakly-Supervised Physically Unconstrained Gaze Estimation

    Authors: Rakshit Kothari, Shalini De Mello, Umar Iqbal, Wonmin Byeon, Seonwook Park, Jan Kautz

    Abstract: A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervise… ▽ More

    Submitted 20 May, 2021; originally announced May 2021.

    Comments: CVPR 2021 (Oral)

  17. arXiv:2012.07938  [pdf, other

    physics.flu-dyn cs.LG

    NVIDIA SimNet^{TM}: an AI-accelerated multi-physics simulation framework

    Authors: Oliver Hennigh, Susheela Narasimhan, Mohammad Amin Nabian, Akshay Subramaniam, Kaustubh Tangsali, Max Rietmann, Jose del Aguila Ferrandis, Wonmin Byeon, Zhiwei Fang, Sanjay Choudhry

    Abstract: We present SimNet, an AI-driven multi-physics simulation framework, to accelerate simulations across a wide range of disciplines in science and engineering. Compared to traditional numerical solvers, SimNet addresses a wide range of use cases - coupled forward simulations without any training data, inverse and data assimilation problems. SimNet offers fast turnaround time by enabling parameterized… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

  18. arXiv:2012.00899  [pdf, other

    cs.CV

    Displacement-Invariant Cost Computation for Efficient Stereo Matching

    Authors: Yiran Zhong, Charles Loop, Wonmin Byeon, Stan Birchfield, Yuchao Dai, Kaihao Zhang, Alexey Kamenev, Thomas Breuel, Hongdong Li, Jan Kautz

    Abstract: Although deep learning-based methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy, their inference time is typically slow, on the order of seconds for a pair of 540p images. The main reason is that the leading methods employ time-consuming 3D convolutions applied to a 4D feature volume. A common way to speed up the computation is to downsample the featur… ▽ More

    Submitted 1 December, 2020; originally announced December 2020.

    Comments: 8 pages

  19. arXiv:2002.09131  [pdf, other

    cs.LG cs.CV stat.ML

    Convolutional Tensor-Train LSTM for Spatio-temporal Learning

    Authors: Jiahao Su, Wonmin Byeon, Jean Kossaifi, Furong Huang, Jan Kautz, Animashree Anandkumar

    Abstract: Learning from spatio-temporal data has numerous applications such as human-behavior analysis, object tracking, video compression, and physics simulation.However, existing methods still perform poorly on challenging video tasks such as long-term forecasting. This is because these kinds of challenging tasks require learning long-term spatio-temporal correlations in the video sequence. In this paper,… ▽ More

    Submitted 4 October, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: Jiahao Su and Wonmin Byeon contributed equally to this work. 22 pages, 14 figures, NeurIPS 2020

  20. arXiv:1802.07486  [pdf, other

    physics.comp-ph cs.LG nlin.CD

    Data-Driven Forecasting of High-Dimensional Chaotic Systems with Long Short-Term Memory Networks

    Authors: Pantelis R. Vlachas, Wonmin Byeon, Zhong Y. Wan, Themistoklis P. Sapsis, Petros Koumoutsakos

    Abstract: We introduce a data-driven forecasting method for high-dimensional chaotic systems using long short-term memory (LSTM) recurrent neural networks. The proposed LSTM neural networks perform inference of high-dimensional dynamical systems in their reduced order space and are shown to be an effective set of nonlinear approximators of their attractor. We demonstrate the forecasting performance of the L… ▽ More

    Submitted 19 September, 2019; v1 submitted 21 February, 2018; originally announced February 2018.

    Comments: 31 pages

  21. arXiv:1710.08518  [pdf, other

    cs.CV

    ContextVP: Fully Context-Aware Video Prediction

    Authors: Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, Petros Koumoutsakos

    Abstract: Video prediction models based on convolutional networks, recurrent networks, and their combinations often result in blurry predictions. We identify an important contributing factor for imprecise predictions that has not been studied adequately in the literature: blind spots, i.e., lack of access to all relevant past information for accurately predicting the future. To address this issue, we introd… ▽ More

    Submitted 9 September, 2018; v1 submitted 23 October, 2017; originally announced October 2017.

    Comments: 19 pages. ECCV 2018 oral presentation. Project webpage is at https://wonmin-byeon.github.io/publication/2018-eccv

  22. arXiv:1506.07452  [pdf, other

    cs.CV cs.LG

    Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation

    Authors: Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, Juergen Schmidhuber

    Abstract: Convolutional Neural Networks (CNNs) can be shifted across 2D images or 3D videos to segment them. They have a fixed input size and typically perceive only small local contexts of the pixels to be classified as foreground or background. In contrast, Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the entire spatio-temporal context of each pixel in a few sweeps through all pixels, especially… ▽ More

    Submitted 24 June, 2015; originally announced June 2015.

    Comments: Marijn F. Stollenga and Wonmin Byeon are the shared first authors, both authors contributed equally to this work