Skip to main content

Showing 1–20 of 20 results for author: Seo, P H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.03924  [pdf, other

    cs.CV

    Learning Correlation Structures for Vision Transformers

    Authors: Man** Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

    Abstract: We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages ri… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  2. arXiv:2303.17811  [pdf, other

    cs.CV cs.AI cs.CL

    Zero-shot Referring Image Segmentation with Global-Local Context Features

    Authors: Seonghoon Yu, Paul Hongsuck Seo, Jeany Son

    Abstract: Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.… ▽ More

    Submitted 3 April, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  3. arXiv:2303.16501  [pdf, other

    cs.CV cs.SD eess.AS

    AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only mode… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  4. arXiv:2303.14396  [pdf, other

    cs.CV cs.AI cs.LG

    IFSeg: Image-free Semantic Segmentation via Vision-Language Model

    Authors: Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, **woo Shin

    Abstract: Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and the existing approaches still have the burden of acquiring additional training images or even segmentation annotations to adapt a VL model to downstrea… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR 2023

  5. arXiv:2303.11797  [pdf, other

    cs.CV

    CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

    Authors: Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

    Abstract: Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text… ▽ More

    Submitted 31 March, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR 2024. Project page: https://ku-cvlab.github.io/CAT-Seg/

  6. arXiv:2302.14115  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

    Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

    Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, w… ▽ More

    Submitted 21 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures

  7. arXiv:2211.09966  [pdf, ps, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    AVATAR submission to the Ego4D AV Transcription Challenge

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming t… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  8. arXiv:2206.07684  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVATAR: Unconstrained Audiovisual Speech Recognition

    Authors: Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

    Abstract: Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  9. arXiv:2204.00679  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning Audio-Video Modalities from Image Captions

    Authors: Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

    Abstract: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new l… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  10. arXiv:2201.08264  [pdf, other

    cs.CV cs.AI cs.CL cs.HC

    End-to-end Generative Pretraining for Multimodal Video Captioning

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid

    Abstract: Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video… ▽ More

    Submitted 10 May, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Journal ref: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR) 2022

  11. arXiv:2012.05710  [pdf, other

    cs.CV cs.HC

    Look Before you Speak: Visually Contextualized Utterances

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Pred… ▽ More

    Submitted 28 March, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

  12. arXiv:1911.09753  [pdf, other

    cs.CV cs.CL

    Reinforcing an Image Caption Generator Using Off-Line Human Feedback

    Authors: Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut

    Abstract: Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption rati… ▽ More

    Submitted 21 November, 2019; originally announced November 2019.

    Comments: AAAI 2020

  13. arXiv:1910.01467  [pdf, other

    cs.LG cs.CV

    Regularizing Neural Networks via Stochastic Branch Layers

    Authors: Wonpyo Park, Paul Hongsuck Seo, Bohyung Han, Minsu Cho

    Abstract: We introduce a novel stochastic regularization technique for deep neural networks, which decomposes a layer into multiple branches with different parameters and merges stochastically sampled combinations of the outputs from the branches during training. Since the factorized branches can collapse into a single branch through a linear operation, inference requires no additional complexity compared t… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

    Comments: ACML 2019 (oral)

  14. arXiv:1809.10877  [pdf, other

    cs.LG stat.ML

    Learning for Single-Shot Confidence Calibration in Deep Neural Networks through Stochastic Inferences

    Authors: Seonguk Seo, Paul Hongsuck Seo, Bohyung Han

    Abstract: We propose a generic framework to calibrate accuracy and confidence of a prediction in deep neural networks through stochastic inferences. We interpret stochastic regularization using a Bayesian model, and analyze the relation between predictive uncertainty of networks and variance of the prediction scores obtained by stochastic inferences for a single example. Our empirical study shows that the a… ▽ More

    Submitted 24 April, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

  15. arXiv:1808.02130  [pdf, other

    cs.CV

    CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps

    Authors: Paul Hongsuck Seo, Tobias Weyand, Jack Sim, Bohyung Han

    Abstract: Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information. This task is inherently challenging since many photos have only few, possibly ambiguous cues to their geolocation. Recent work has cast this task as a classification problem by partitioning the earth into a set of discrete cells that correspond to geographic regions. The granular… ▽ More

    Submitted 6 August, 2018; originally announced August 2018.

    Comments: ECCV 2018 accepted paper

  16. arXiv:1808.02128  [pdf, other

    cs.CV

    Attentive Semantic Alignment with Offset-Aware Correlation Kernels

    Authors: Paul Hongsuck Seo, Jongmin Lee, Deunsol Jung, Bohyung Han, Minsu Cho

    Abstract: Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. One of recent approaches to this problem is to estimate parameters of a global transformation model that densely aligns one image to the other. Since an entire correlation map between all feature pairs across images is typically used to predict such… ▽ More

    Submitted 26 October, 2018; v1 submitted 6 August, 2018; originally announced August 2018.

    Comments: ECCV 2018 accepted paper

  17. arXiv:1709.07992  [pdf, other

    cs.CV

    Visual Reference Resolution using Attention Memory for Visual Dialog

    Authors: Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

    Abstract: Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attenti… ▽ More

    Submitted 6 August, 2018; v1 submitted 22 September, 2017; originally announced September 2017.

  18. arXiv:1612.01669  [pdf, other

    cs.CV

    MarioQA: Answering Questions by Watching Gameplay Videos

    Authors: Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han

    Abstract: We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inference… ▽ More

    Submitted 13 August, 2017; v1 submitted 6 December, 2016; originally announced December 2016.

  19. arXiv:1606.02393  [pdf, other

    cs.CV

    Progressive Attention Networks for Visual Attribute Prediction

    Authors: Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, Bohyung Han

    Abstract: We propose a novel attention model that can accurately attends to target objects of various scales and shapes in images. The model is trained to gradually suppress irrelevant regions in an input image via a progressive attentive process over multiple layers of a convolutional neural network. The attentive process in each layer determines whether to pass or block features at certain spatial locatio… ▽ More

    Submitted 6 August, 2018; v1 submitted 8 June, 2016; originally announced June 2016.

    Comments: BMVC 2018 accepted paper

  20. arXiv:1511.05756  [pdf, other

    cs.CV cs.CL cs.LG

    Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

    Authors: Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han

    Abstract: We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent unit (GRU) taking a question as its input and a fully-connected layer generating a… ▽ More

    Submitted 18 November, 2015; originally announced November 2015.