-
Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Authors:
Jongwoo Park,
Kanchana Ranasinghe,
Kumara Kahatapitiya,
Wonjeong Ryoo,
Donghyun Kim,
Michael S. Ryoo
Abstract:
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large lan…
▽ More
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection and sequence-aware captioning, that can significantly reduce these redundancies. We propose two novel approaches that improve each of aspects, namely Hierarchical Keyframe Selector and Sequential Visual LLM. Our resulting framework termed LVNet achieves state-of-the-art performance across three benchmark LVQA datasets. Our code will be released publicly.
△ Less
Submitted 17 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
Authors:
Yu** Jeong,
Wonjeong Ryoo,
Seunghyun Lee,
Dabin Seo,
Wonmin Byeon,
Sangpil Kim,
**kyu Kim
Abstract:
In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To…
▽ More
In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Event Fusion Photometric Stereo Network
Authors:
Wonjeong Ryoo,
Giljoo Nam,
Jae-Sang Hyun,
Sangpil Kim
Abstract:
We present a novel method to estimate the surface normal of an object in an ambient light environment using RGB and event cameras. Modern photometric stereo methods rely on an RGB camera, mainly in a dark room, to avoid ambient illumination. To alleviate the limitations of the darkroom environment and to use essential light information, we employ an event camera with a high dynamic range and low l…
▽ More
We present a novel method to estimate the surface normal of an object in an ambient light environment using RGB and event cameras. Modern photometric stereo methods rely on an RGB camera, mainly in a dark room, to avoid ambient illumination. To alleviate the limitations of the darkroom environment and to use essential light information, we employ an event camera with a high dynamic range and low latency. This is the first study that uses an event camera for the photometric stereo task, which works on continuous light sources and ambient light environment. In this work, we also curate a novel photometric stereo dataset that is constructed by capturing objects with event and RGB cameras under numerous ambient lights environment. Additionally, we propose a novel framework named Event Fusion Photometric Stereo Network~(EFPS-Net), which estimates the surface normals of an object using both RGB frames and event signals. Our proposed method interpolates event observation maps that generate light information with sparse event signals to acquire fluent light information. Subsequently, the event-interpolated observation maps are fused with the RGB observation maps. Our numerous experiments showed that EFPS-Net outperforms state-of-the-art methods on a dataset captured in the real world where ambient lights exist. Consequently, we demonstrate that incorporating additional modalities with EFPS-Net alleviates the limitations that occurred from ambient illumination.
△ Less
Submitted 11 March, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Sound-Guided Semantic Video Generation
Authors:
Seung Hyun Lee,
Gyeongrok Oh,
Wonmin Byeon,
Chanyoung Kim,
Won Jeong Ryoo,
Sang Ho Yoon,
Hyunjun Cho,
Jihyun Bae,
**kyu Kim,
Sangpil Kim
Abstract:
The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound…
▽ More
The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.
△ Less
Submitted 21 October, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Asymptotic formation and orbital stability of phase-locked states in Kuramoto--Lohe type synchronization models on Lie groups
Authors:
Sang Woo Ryoo
Abstract:
It is known that the Kuramoto model has a critical coupling strength above which phase-locked states exist, and, by the work of Choi, Ha, Jung, and Kim (2012), that these phase-locked states are orbitally stable. This property of admitting orbitally stable phase-locked states is preserved under the nonabelian generalizations of the Kuramoto model pioneered by Lohe (2009). We provide a framework fo…
▽ More
It is known that the Kuramoto model has a critical coupling strength above which phase-locked states exist, and, by the work of Choi, Ha, Jung, and Kim (2012), that these phase-locked states are orbitally stable. This property of admitting orbitally stable phase-locked states is preserved under the nonabelian generalizations of the Kuramoto model pioneered by Lohe (2009). We provide a framework for understanding these phenomena, by formulating the aforementioned models as special cases of a general ODE model describing populations of particles on Lie groups with pairwise attractive Kuramoto-type interactions. This general model can be used in producing many specific models with orbitally stable nonabelian phase-locked states.
△ Less
Submitted 10 June, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Constants of motion for the finite-dimensional Lohe type models with frustration and applications to emergent dynamics
Authors:
Seung-Yeal Ha,
Dohyun Kim,
Hansol Park,
Sang Woo Ryoo
Abstract:
We present constants of motion for the finite-dimensional Lohe type aggregation models with frustration and we apply them to analyze the emergence of collective behaviors. The Lohe type models have been proposed as possible non-abelian and higher-dimensional generalizations of the Kuramoto model, which is a prototype phase model for synchronization. The aim of this paper is to study the emergent c…
▽ More
We present constants of motion for the finite-dimensional Lohe type aggregation models with frustration and we apply them to analyze the emergence of collective behaviors. The Lohe type models have been proposed as possible non-abelian and higher-dimensional generalizations of the Kuramoto model, which is a prototype phase model for synchronization. The aim of this paper is to study the emergent collective dynamics of these models under the effect of (interaction) frustration, which generalizes phase-shift frustrations in the Kuramoto model. To this end, we present constants of motion, i.e., conserved quantities along the flow generated by the models under consideration, and, from the perspective of the low-dimensional dynamics thus so obtained, derive several results concerning the emergent asymptotic patterns of the Kuramoto and Lohe sphere models.
△ Less
Submitted 3 November, 2020; v1 submitted 1 November, 2020;
originally announced November 2020.
-
Asymptotic phase-locking dynamics and critical coupling strength for the Kuramoto model
Authors:
Seung-Yeal Ha,
Sang Woo Ryoo
Abstract:
We study the asymptotic clustering (phase-locking) dynamics for the Kuramoto model. For the analysis of emergent asymptotic patterns in the Kuramoto flow, we introduce the pathwise critical coupling strength which yields a sharp transition from partial phase-locking to complete phase-locking, and provide nontrivial upper bounds for the pathwise critical coupling strength. Numerical simulations sug…
▽ More
We study the asymptotic clustering (phase-locking) dynamics for the Kuramoto model. For the analysis of emergent asymptotic patterns in the Kuramoto flow, we introduce the pathwise critical coupling strength which yields a sharp transition from partial phase-locking to complete phase-locking, and provide nontrivial upper bounds for the pathwise critical coupling strength. Numerical simulations suggest that multi- and mono-clusters can emerge asymptotically in the Kuramoto flow depending on the relative magnitude of the coupling strength compared to the sizes of natural frequencies. However, theoretical and rigorous analysis for such phase-locking dynamics of the Kuramoto flow still lacks a complete understanding, although there were some recent progress on the complete synchronization of the Kuramoto model in a sufficiently large coupling strength regime. In this paper, we present sufficient frameworks for partial phase-locking of a majority ensemble and the complete phase-locking in terms of the initial phase configuration, coupling strength and natural frequencies. As a by-product of our analysis, we obtain nontrivial upper bounds for the pathwise critical coupling strength in terms of the diameter of natural frequencies, initial Kuramoto order parameter and the system size $N$. We also show that phase-locked states whose order parameters are less than $N^{-\frac{1}{2}}$ are linearly unstable.
△ Less
Submitted 10 April, 2020;
originally announced April 2020.