Skip to main content

Showing 1–20 of 20 results for author: Gan, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.20336  [pdf, other

    cs.CV cs.SD eess.AS

    RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

    Authors: Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

    Abstract: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rap** vocals, lyrics, and high-quality 3D holistic bo… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Project website: https://vis-www.cs.umass.edu/RapVerse

  2. arXiv:2403.08580  [pdf, other

    cs.CV cs.MM eess.IV

    Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

    Authors: Yuxing Han, Yunan Ding, Chen Ye Gan, Jiangtao Wen

    Abstract: Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially when an immense volume of video content is being constantly generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these meth… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: 5 pages, 5 figures, 1 table. arXiv admin note: substantial text overlap with arXiv:2309.07361

  3. arXiv:2309.04265  [pdf, other

    eess.AS

    Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

    Authors: Chong-Xin Gan, Man-Wai Mak, Weiwei Lin, Jen-Tzung Chien

    Abstract: Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific info… ▽ More

    Submitted 11 March, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: 5 pages, 2 figures, accepted by ICASSP 2024

  4. arXiv:2306.00148  [pdf, other

    cs.LG cs.RO eess.SY

    SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

    Authors: Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus

    Abstract: Diffusion model-based approaches have shown promise in data-driven planning, but there are no safety guarantees, thus making it hard to be applied for safety-critical applications. To address these challenges, we propose a new method, called SafeDiffuser, to ensure diffusion probabilistic models satisfy specifications by using a class of control barrier functions. The key idea of our approach is t… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: 19 pages, website: https://safediffuser.github.io/safediffuser/

  5. arXiv:2303.16897  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

    Authors: Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan

    Abstract: Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely availab… ▽ More

    Submitted 8 July, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023. Project page: https://sukun1045.github.io/video-physics-sound-diffusion/

  6. arXiv:2210.04763  [pdf, other

    cs.LG cs.AI cs.RO eess.SY

    On the Forward Invariance of Neural ODEs

    Authors: Wei Xiao, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Yutong Ban, Chuang Gan, Daniela Rus

    Abstract: We propose a new method to ensure neural ordinary differential equations (ODEs) satisfy output specifications by using invariance set propagation. Our approach uses a class of control barrier functions to transform output specifications into constraints on the parameters and inputs of the learning system. This setup allows us to achieve output specification guarantees simply by changing the constr… ▽ More

    Submitted 31 May, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: 25 pages, accepted in ICML2023, website: https://weixy21.github.io/invariance/

  7. arXiv:2207.03483  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Finding Fallen Objects Via Asynchronous Audio-Visual Integration

    Authors: Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B. Tenenbaum, Josh McDermott, Antonio Torralba

    Abstract: The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped som… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: CVPR 2022. Project page: http://fallen-object.csail.mit.edu

  8. arXiv:2204.00628  [pdf, other

    cs.SD cs.CV cs.LG cs.RO eess.AS

    Learning Neural Acoustic Fields

    Authors: Andrew Luo, Yilun Du, Michael J. Tarr, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

    Abstract: Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of t… ▽ More

    Submitted 14 January, 2023; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022. Project page: https://www.andrew.cmu.edu/user/afluo/Neural_Acoustic_Fields/

  9. arXiv:2106.08519  [pdf, other

    eess.AS cs.LG cs.SD

    Global Rhythm Style Transfer Without Text Transcriptions

    Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, **jun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

    Abstract: Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony betwe… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  10. arXiv:2008.00820  [pdf, other

    cs.CV cs.SD eess.AS

    Generating Visually Aligned Sound from Videos

    Authors: Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan

    Abstract: We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect map** between visual content and these irrelevant sounds. To address this c… ▽ More

    Submitted 14 July, 2020; originally announced August 2020.

    Comments: Published in IEEE Transactions on Image Processing, 2020. Code, pre-trained models and demo video: https://github.com/PeihaoChen/regnet

  11. arXiv:2007.13729  [pdf, other

    cs.CV cs.AI cs.LG cs.RO cs.SD eess.AS

    Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

    Authors: Chuang Gan, Xiaoyu Chen, Phillip Isola, Antonio Torralba, Joshua B. Tenenbaum

    Abstract: Humans integrate multiple sensory modalities (e.g. visual and audio) to build a causal understanding of the physical world. In this work, we propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction. First, we allow the agent to collect a small amount of acoustic data and u… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: Project page: http://noisy-agent.csail.mit.edu

  12. arXiv:2007.10984  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Foley Music: Learning to Generate Music from Videos

    Authors: Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

    Abstract: In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation probl… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

    Comments: ECCV 2020. Project page: http://foley-music.csail.mit.edu

  13. arXiv:2004.09476  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Music Gesture for Visual Sound Separation

    Authors: Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba

    Abstract: Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple viol… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

    Comments: CVPR 2020. Project page: http://music-gesture.csail.mit.edu

  14. arXiv:1912.11684  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

    Authors: Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum

    Abstract: A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given… ▽ More

    Submitted 7 March, 2020; v1 submitted 25 December, 2019; originally announced December 2019.

    Comments: Accepted by ICRA 2020. Project page: http://avn.csail.mit.edu

  15. arXiv:1910.11760  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Self-supervised Moving Vehicle Tracking with Stereo Sound

    Authors: Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba

    Abstract: Humans are able to localize objects in the environment using both visual and auditory cues, integrating information from multiple modalities into a common reference frame. We introduce a system that can leverage unlabeled audio-visual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time. Since it is labor-intensive to manually… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: To appear at ICCV 2019. Project page: http://sound-track.csail.mit.edu

  16. arXiv:1910.00932  [pdf, other

    cs.CV cs.LG eess.IV

    Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

    Authors: Ji Lin, Chuang Gan, Song Han

    Abstract: Deep video recognition is more computationally expensive than image recognition, especially on large-scale datasets like Kinetics [1]. Therefore, training scalability is essential to handle a large amount of videos. In this paper, we study the factors that impact the training scalability of video networks. We recognize three bottlenecks, including data loading (data movement from disk to GPU), com… ▽ More

    Submitted 7 December, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

  17. arXiv:1904.09013  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Self-Supervised Audio-Visual Co-Segmentation

    Authors: Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba

    Abstract: Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image… ▽ More

    Submitted 18 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019

  18. arXiv:1904.05979  [pdf, other

    cs.CV cs.SD eess.AS

    The Sound of Motions

    Authors: Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba

    Abstract: Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curri… ▽ More

    Submitted 11 April, 2019; originally announced April 2019.

  19. arXiv:1804.03160  [pdf, other

    cs.CV cs.SD eess.AS

    The Sound of Pixels

    Authors: Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba

    Abstract: We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiri… ▽ More

    Submitted 13 October, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

  20. arXiv:1607.05448  [pdf

    eess.SY

    Exponentially Stabilizing Continuous-Time Controllers for multi-domain hybrid systems with application to 3D bipdeal walking

    Authors: Chunbiao Gan, Haihui Yuan, Shixi Yang, Yimin Ge

    Abstract: This paper presents a systematic approach to exponentially stabilize the periodic orbits of multi-domain hybrid systems arising from 3D bipedal walking. Firstly, the method of Poincare sections is extended to the hybrid systems with multiple domains. Then, based on the properties of the Poincare maps, a continuous piecewise feedback control strategy is presented, and three methods are furthermore… ▽ More

    Submitted 19 July, 2016; originally announced July 2016.

    Comments: submitted to IEEE Transactions on Automatic Control