Skip to main content

Showing 1–16 of 16 results for author: Morgado, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05659  [pdf, other

    cs.CV

    Audio-Synchronized Visual Animation

    Authors: Lin Zhang, Shentong Mo, Yi**g Zhang, Pedro Morgado

    Abstract: Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips acros… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: 15 pages

  2. arXiv:2312.01017  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD

    Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

    Authors: Shentong Mo, Pedro Morgado

    Abstract: Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for develo** multimodal perception models. However, training early fusion architectures poses significant challe… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  3. arXiv:2307.11978  [pdf, other

    cs.CV cs.AI cs.LG

    Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

    Authors: Cheng-En Wu, Yu Tian, Haichao Yu, Heng Wang, Pedro Morgado, Yu Hen Hu, Linjie Yang

    Abstract: Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted e… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV2023

  4. arXiv:2305.19458  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

    Authors: Shentong Mo, Pedro Morgado

    Abstract: The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as t… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  5. arXiv:2209.13583  [pdf, other

    cs.CV

    Learning State-Aware Visual Representations from Audible Interactions

    Authors: Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta

    Abstract: We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. Fi… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: NeurIPS 2022. Code available at https://github.com/HimangiM/RepLAI

  6. arXiv:2209.09634  [pdf, other

    cs.SD cs.CV cs.LG cs.MM

    A Closer Look at Weakly-Supervised Audio-Visual Source Localization

    Authors: Shentong Mo, Pedro Morgado

    Abstract: Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audi… ▽ More

    Submitted 30 August, 2022; originally announced September 2022.

  7. arXiv:2203.12710  [pdf, other

    cs.CV cs.LG

    The Challenges of Continuous Self-Supervised Learning

    Authors: Senthil Purushwalkam, Pedro Morgado, Abhinav Gupta

    Abstract: Self-supervised learning (SSL) aims to eliminate one of the major bottlenecks in representation learning - the need for human annotations. As a result, SSL holds the promise to learn representations from data in-the-wild, i.e., without the need for finite and static datasets. Instead, true SSL algorithms should be able to exploit the continuous stream of data being generated on the internet or by… ▽ More

    Submitted 28 March, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

  8. arXiv:2203.09324  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Localizing Visual Sounds the Easy Way

    Authors: Shentong Mo, Pedro Morgado

    Abstract: Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging witho… ▽ More

    Submitted 29 March, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

  9. arXiv:2103.15916  [pdf, other

    cs.CV

    Robust Audio-Visual Instance Discrimination

    Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos

    Abstract: We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

  10. arXiv:2011.01819  [pdf, other

    cs.CV

    Learning Representations from Audio-Visual Spatial Alignment

    Authors: Pedro Morgado, Yi Li, Nuno Vasconcelos

    Abstract: We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) furthe… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: To appear at Advances in Neural Information Processing Systems (NeurIPS), 2020

  11. arXiv:2007.13912  [pdf, other

    cs.CV

    Deep Hashing with Hash-Consistent Large Margin Proxy Embeddings

    Authors: Pedro Morgado, Yunsheng Li, Jose Costa Pereira, Mohammad Saberian, Nuno Vasconcelos

    Abstract: Image hash codes are produced by binarizing the embeddings of convolutional neural networks (CNN) trained for either classification or retrieval. While proxy embeddings achieve good performance on both tasks, they are non-trivial to binarize, due to a rotational ambiguity that encourages non-binary embeddings. The use of a fixed set of proxies (weights of the CNN classification layer) is proposed… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: Accepted at International Journal of Computer Vision

  12. arXiv:2007.09898  [pdf, other

    cs.CV

    Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

    Authors: Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, Nuno Vasconcelos

    Abstract: Long-tail recognition tackles the natural non-uniformly distributed data in real-world scenarios. While modern classifiers perform well on populated classes, its performance degrades significantly on tail classes. Humans, however, are less affected by this since, when confronted with uncertain examples, they simply opt to provide coarser predictions. Motivated by this, a deep realistic taxonomic c… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

    Comments: Accepted to ECCV 2020

  13. arXiv:2004.12943  [pdf, other

    cs.CV

    Audio-Visual Instance Discrimination with Cross-Modal Agreement

    Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra

    Abstract: We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerf… ▽ More

    Submitted 29 March, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

  14. arXiv:1907.00274  [pdf, other

    cs.CV cs.LG

    NetTailor: Tuning the Architecture, Not Just the Weights

    Authors: Pedro Morgado, Nuno Vasconcelos

    Abstract: Real-world applications of object recognition often require the solution of multiple tasks in a single platform. Under the standard paradigm of network fine-tuning, an entirely new CNN is learned per task, and the final network size is independent of task complexity. This is wasteful, since simple tasks require smaller networks than more complex tasks, and limits the number of tasks that can be so… ▽ More

    Submitted 29 June, 2019; originally announced July 2019.

    Journal ref: CVF/IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  15. arXiv:1809.02587  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    Self-Supervised Generation of Spatial Audio for 360 Video

    Authors: Pedro Morgado, Nuno Vasconcelos, Timothy Langlois, Oliver Wang

    Abstract: We introduce an approach to convert mono audio recorded by a 360 video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere. Spatial audio is an important component of immersive 360 video viewing, but spatial audio microphones are still rare in current 360 video production. Our system consists of end-to-end trainable neural networks that separate in… ▽ More

    Submitted 7 September, 2018; originally announced September 2018.

    Comments: To appear in NIPS 2018

  16. arXiv:1704.03039  [pdf, other

    cs.CV cs.AI cs.LG

    Semantically Consistent Regularization for Zero-Shot Recognition

    Authors: Pedro Morgado, Nuno Vasconcelos

    Abstract: The role of semantics in zero-shot learning is considered. The effectiveness of previous approaches is analyzed according to the form of supervision provided. While some learn semantics independently, others only supervise the semantic subspace explained by training classes. Thus, the former is able to constrain the whole space but lacks the ability to model semantic correlations. The latter addre… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017