Skip to main content

Showing 1–50 of 54 results for author: Kuehne, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10082  [pdf, other

    eess.AS cs.CV cs.SD

    Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

    Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

  2. arXiv:2404.03214  [pdf, other

    cs.CV

    LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

    Authors: Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne

    Abstract: Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of Vi… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Code available at https://github.com/WalBouss/LeGrad

  3. arXiv:2402.08324  [pdf, other

    cs.LG cs.AI

    Uncertainty Quantification via Stable Distribution Propagation

    Authors: Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, Mikhail Yurochkin

    Abstract: We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: Published at ICLR 2024, Code @ https://github.com/Felix-Petersen/distprop

  4. arXiv:2312.15289  [pdf, other

    cs.CV cs.LG eess.IV

    Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

    Authors: Lokesh Veeramacheneni, Moritz Wolter, Hildegard Kuehne, Juergen Gall

    Abstract: Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectru… ▽ More

    Submitted 10 June, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

  5. arXiv:2312.00878  [pdf, other

    cs.CV cs.AI

    Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

    Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne

    Abstract: Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (… ▽ More

    Submitted 14 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Code available at https://github.com/WalBouss/GEM

  6. arXiv:2311.06231  [pdf, other

    cs.CV

    Learning Human Action Recognition Representations Without Real Humans

    Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung **, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris

    Abstract: Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to a… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Track

  7. arXiv:2310.04900  [pdf, other

    cs.CV

    HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

    Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

    Abstract: Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: https://github.com/ninatu/howtocaption

  8. arXiv:2309.08928  [pdf, other

    cs.CV

    In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

    Authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setti… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: Published at ICCV 2023, code: https://github.com/ninatu/in_style

  9. arXiv:2308.13077  [pdf, other

    cs.CV

    Preserving Modality Structure Improves Multi-Modal Learning

    Authors: Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah

    Abstract: Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic struct… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  10. arXiv:2306.15521  [pdf, other

    cs.CV

    What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation

    Authors: Benedikt Blumenstiel, Johannes Jakubik, Hilde Kühne, Michael Vössing

    Abstract: While semantic segmentation has seen tremendous improvements in the past, there are still significant labeling efforts necessary and the problem of limited generalization to classes that have not been present during training. To address this problem, zero-shot semantic segmentation makes use of large self-supervised vision-language models, allowing zero-shot transfer to unseen classes. In this wor… ▽ More

    Submitted 16 December, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  11. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  12. arXiv:2305.00604  [pdf, other

    cs.LG cs.CV math.OC stat.ML

    ISAAC Newton: Input-based Approximate Curvature for Newton's Method

    Authors: Felix Petersen, Tobias Sutter, Christian Borgelt, Dongsung Huh, Hilde Kuehne, Yuekai Sun, Oliver Deussen

    Abstract: We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational over… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: Published at ICLR 2023, Code @ https://github.com/Felix-Petersen/isaac, Video @ https://youtu.be/7RKRX-MdwqM

  13. arXiv:2304.08682  [pdf, other

    cs.CV

    Learning Situation Hyper-Graphs for Video Question Answering

    Authors: Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

    Abstract: Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a… ▽ More

    Submitted 6 May, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

  14. arXiv:2304.05088  [pdf, other

    cs.CV cs.HC

    WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition

    Authors: Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller

    Abstract: Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activiti… ▽ More

    Submitted 21 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: 15 pages, 3 figures, 2 tables

  15. arXiv:2303.16990  [pdf, other

    cs.CV

    What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

    Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

    Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More

    Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

  16. arXiv:2303.13664  [pdf, other

    cs.CV cs.LG

    Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data

    Authors: Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, Christian Rupprecht

    Abstract: Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contr… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  17. arXiv:2303.08914  [pdf, other

    cs.CV

    MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

    Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

    Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best ze… ▽ More

    Submitted 22 July, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023

  18. arXiv:2303.05166  [pdf, other

    cs.CV

    TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering

    Authors: Wei Lin, Anna Kukleva, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Computer Vision Winter Workshop 2023

  19. arXiv:2301.02009  [pdf, other

    cs.CV

    Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

    Authors: Nina Shvetsova, Felix Petersen, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints… ▽ More

    Submitted 18 August, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Published at ICCV 2023, Code @ https://github.com/ninatu/learning_by_sorting

  20. arXiv:2211.15393  [pdf, other

    cs.CV

    Video Test-Time Adaptation for Action Recognition

    Authors: Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal mode… ▽ More

    Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: Accepted at CVPR 2023

  21. arXiv:2210.08277  [pdf, other

    cs.LG

    Deep Differentiable Logic Gate Networks

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Recently, research has increasingly focused on develo** efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as "AND" and "XOR", which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differen… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Published at NeurIPS 2022

  22. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  23. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  24. arXiv:2209.06103  [pdf, other

    cs.CV cs.AI cs.CL

    VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

    Authors: Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne

    Abstract: Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to wh… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

  25. arXiv:2208.01956  [pdf, other

    cs.CV

    Augmentation Learning for Semi-Supervised Classification

    Authors: Tim Frommknecht, Pedro Alves Zipf, Quanfu Fan, Nina Shvetsova, Hilde Kuehne

    Abstract: Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other doma… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted to GCPR 2022, 13 pages with 4 figures

  26. arXiv:2207.02334  [pdf, other

    cs.CV

    Weakly Supervised Grounding for VQA in Vision-Language Transformers

    Authors: Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah

    Abstract: Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this li… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: To appear at ECCV 2022

  27. arXiv:2206.07290  [pdf, other

    cs.LG cs.CV

    Differentiable Top-k Classification Learning

    Authors: Felix Petersen, Hilde Kuehne, Christian Borgelt, Oliver Deussen

    Abstract: The top-k classification accuracy is one of the core metrics in machine learning. Here, k is conventionally a positive integer, such as 1 or 5, leading to top-1 or top-5 training objectives. In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a different… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Published at ICML 2022, Code @ https://github.com/Felix-Petersen/difftopk

  28. arXiv:2203.16244  [pdf, other

    cs.CV

    CycDA: Unsupervised Cycle Domain Adaptation from Image to Video

    Authors: Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and vid… ▽ More

    Submitted 22 March, 2023; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at ECCV2022. Supplementary included

  29. arXiv:2203.09630  [pdf, other

    cs.LG cs.AI cs.IR stat.ML

    Monotonic Differentiable Sorting Networks

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Differentiable sorting algorithms allow training with sorting and ranking supervision, where only the ordering or ranking of samples is known. Various methods have been proposed to address this challenge, ranging from optimal transport-based differentiable Sinkhorn sorting algorithms to making classic sorting networks differentiable. One problem of current differentiable sorting methods is that th… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Published at ICLR 2022, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=Rl-sFaE1z4M

  30. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  31. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  32. arXiv:2112.00775  [pdf, other

    cs.CV

    Routing with Self-Attention for Multimodal Capsule Networks

    Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

    Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  33. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  34. arXiv:2110.10784  [pdf, other

    cs.CV cs.LG

    Style Agnostic 3D Reconstruction via Adversarial Style Transfer

    Authors: Felix Petersen, Bastian Goldluecke, Oliver Deussen, Hilde Kuehne

    Abstract: Reconstructing the 3D geometry of an object from an image is a major challenge in computer vision. Recently introduced differentiable renderers can be leveraged to learn the 3D geometry of objects from 2D images, but those approaches require additional supervision to enable the renderer to produce an output that can be compared to the input image. This can be scene information or constraints such… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: To be published at WACV 2022, Code @ https://github.com/Felix-Petersen/style-agnostic-3d-reconstruction

  35. arXiv:2110.05651  [pdf, other

    cs.LG stat.ML

    Learning with Algorithmic Supervision via Continuous Relaxations

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: The integration of algorithmic components into neural architectures has gained increased attention recently, as it allows training neural networks with new forms of supervision such as ordering constraints or silhouettes instead of using ground truth labels. Many approaches in the field focus on the continuous relaxation of a specific task and show promising results in this context. But the focus… ▽ More

    Submitted 25 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Published at NeurIPS 2021, Code @ https://github.com/Felix-Petersen/algovision, Video @ https://www.youtube.com/watch?v=01ENzpkjOCE

  36. arXiv:2108.08165  [pdf, other

    cs.CV

    Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

    Authors: Anna Kukleva, Hilde Kuehne, Bernt Schiele

    Abstract: Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes. In this work we propose a three-stage framework that allows to explicitly and effectively address these challenges. While the first phase lea… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: ICCV 2021

  37. arXiv:2105.04836  [pdf, other

    cs.CV

    Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

    Authors: Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah

    Abstract: The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. However, pre-trained object detectors require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this… ▽ More

    Submitted 11 May, 2021; originally announced May 2021.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  38. arXiv:2105.04019  [pdf, other

    cs.LG cs.IR

    Differentiable Sorting Networks for Scalable Sorting and Ranking Supervision

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Sorting and ranking supervision is a method for training neural networks end-to-end based on ordering constraints. That is, the ground truth order of sets of samples is known, while their absolute values remain unsupervised. For that, we propose differentiable sorting networks by relaxing their pairwise conditional swap operations. To address the problems of vanishing gradients and extensive blurr… ▽ More

    Submitted 14 July, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

    Comments: Published at ICML 2021, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=38dvqdYEs1o

    Journal ref: PMLR 139:8546-8555, 2021

  39. arXiv:2105.00067  [pdf, other

    cs.CV

    Unsupervised Discriminative Embedding for Sub-Action Learning in Complex Activities

    Authors: Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah

    Abstract: Action recognition and detection in the context of long untrimmed video sequences has seen an increased attention from the research community. However, annotation of complex activities is usually time consuming and challenging in practice. Therefore, recent works started to tackle the problem of unsupervised learning of sub-actions in complex activities. This paper proposes a novel approach for un… ▽ More

    Submitted 30 April, 2021; originally announced May 2021.

  40. arXiv:2104.12671  [pdf, other

    cs.CV

    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

    Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

    Abstract: Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalitie… ▽ More

    Submitted 3 September, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To be presented at ICCV 2021

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8012-8021

  41. arXiv:2104.09829  [pdf, other

    cs.CV

    Detector-Free Weakly Supervised Grounding by Separation

    Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

    Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

  42. arXiv:2006.09199  [pdf, other

    cs.CV cs.CL cs.MM cs.SD eess.AS

    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

    Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the nee… ▽ More

    Submitted 29 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: A version of this work has been accepted to Interspeech 2021

  43. arXiv:2001.11122  [pdf, other

    cs.CV

    Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences

    Authors: Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, Hilde Kuehne

    Abstract: Understanding the structure of complex activities in untrimmed videos is a challenging task in the area of action recognition. One problem here is that this task usually requires a large amount of hand-annotated minute- or even hour-long video data, but annotating such data is very time consuming and can not easily be automated or scaled. To address this problem, this paper proposes an approach fo… ▽ More

    Submitted 30 September, 2020; v1 submitted 29 January, 2020; originally announced January 2020.

  44. arXiv:1912.00869  [pdf, ps, other

    cs.CV

    More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

    Authors: Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, David Cox

    Abstract: Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources.… ▽ More

    Submitted 2 December, 2019; originally announced December 2019.

    Comments: Accepted at NeurIPS 2019, codes and models are available at https://github.com/IBM/bLVNet-TAM

    Report number: 32

    Journal ref: Advances in Neural Information Processing Systems (Neurips 2019)

  45. A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation

    Authors: Hilde Kuehne, Alexander Richard, Juergen Gall

    Abstract: Action recognition has become a rapidly develo** research field within the last decade. But with the increasing demand for large scale data, the need of hand annotated data for the training becomes more and more impractical. One way to avoid frame-based human annotation is the use of action order information to learn the respective action classes. In this context, we propose a hierarchical appro… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: 15 pages, preprint for IEEE TPAMI https://ieeexplore.ieee.org/document/8585084 (open access). arXiv admin note: substantial text overlap with arXiv:1703.08132

    ACM Class: I.4.9

  46. arXiv:1906.01012  [pdf, other

    cs.CV

    Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

    Authors: Hilde Kuehne, Ahsan Iqbal, Alexander Richard, Juergen Gall

    Abstract: Action recognition is so far mainly focusing on the problem of classification of hand selected preclipped actions and reaching impressive results in this field. But with the performance even ceiling on current datasets, it also appears that the next steps in the field will have to go beyond this fully supervised classification. One way to overcome those problems is to move towards less restricted… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: 9 pages

    ACM Class: I.4.9

  47. arXiv:1904.04189  [pdf, other

    cs.CV

    Unsupervised learning of action classes with continuous temporal embedding

    Authors: Anna Kukleva, Hilde Kuehne, Fadime Sener, Juergen Gall

    Abstract: The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently. One problem in this context arises from the need to define and label action boundaries to create annotations for training which is very time and cost intensive. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. To… ▽ More

    Submitted 8 April, 2019; originally announced April 2019.

    Comments: CVPR 2019

  48. arXiv:1805.06875  [pdf, other

    cs.CV

    NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning

    Authors: Alexander Richard, Hilde Kuehne, Ahsan Iqbal, Juergen Gall

    Abstract: Video learning is an important task in computer vision and has experienced increasing interest over the recent years. Since even a small amount of videos easily comprises several million frames, methods that do not rely on a frame-level annotation are of special importance. In this work, we propose a novel learning algorithm with a Viterbi-based loss that allows for online and incremental learning… ▽ More

    Submitted 17 May, 2018; originally announced May 2018.

    Comments: CVPR 2018

  49. arXiv:1706.08807  [pdf, ps, other

    cs.CV

    Recurrent Residual Learning for Action Recognition

    Authors: Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall

    Abstract: Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an ima… ▽ More

    Submitted 27 June, 2017; originally announced June 2017.

  50. arXiv:1706.00699  [pdf, other

    cs.CV

    Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints

    Authors: Alexander Richard, Hilde Kuehne, Juergen Gall

    Abstract: Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this are… ▽ More

    Submitted 17 May, 2018; v1 submitted 2 June, 2017; originally announced June 2017.

    Comments: CVPR 2018