Search | arXiv e-print repository

Efficient Centroid-Linkage Clustering

Authors: MohammadHossein Bateni, Laxman Dhulipala, Willem Fletcher, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

Abstract: We give an efficient algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a $c$-approximate clustering in roughly $n^{1+O(1/c^2)}$ time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By… ▽ More We give an efficient algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a $c$-approximate clustering in roughly $n^{1+O(1/c^2)}$ time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a $36\times$ speedup due to performing fewer distance comparisons. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.02929 [pdf, other]

Exploring Data Efficiency in Zero-Shot Learning with Diffusion Models

Authors: Zihan Ye, Shreyank N. Gowda, Xiaobo **, Xiaowei Huang, Haotian Xu, Yaochu **, Kaizhu Huang

Abstract: Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate th… ▽ More Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate that limited seen examples generally result in deteriorated performance of generative models. To overcome these challenges, we propose ZeroDiff, a Diffusion-based Generative ZSL model. This unified framework incorporates diffusion models to improve data efficiency at both the class and instance levels. Specifically, for instance-level effectiveness, ZeroDiff utilizes a forward diffusion chain to transform limited data into an expanded set of noised data. For class-level effectiveness, we design a two-branch generation structure that consists of a Diffusion-based Feature Generator (DFG) and a Diffusion-based Representation Generator (DRG). DFG focuses on learning and sampling the distribution of cross-entropy-based features, whilst DRG learns the supervised contrastive-based representation to boost the zero-shot capabilities of DFG. Additionally, we employ three discriminators to evaluate generated features from various aspects and introduce a Wasserstein-distance-based mutual learning loss to transfer knowledge among discriminators, thereby enhancing guidance for generation. Demonstrated through extensive experiments on three popular ZSL benchmarks, our ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code will be released upon acceptance. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2404.19019 [pdf, other]

Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering

Authors: Laxman Dhulipala, Xiaojun Dong, Kishen N Gowda, Yan Gu

Abstract: Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree $T$, the SLD of $T$ is a binary dendrogram that summarizes the $n-1$ clusterings obtained by contracting the edges of $T$ in order of weight. Existing algorithms for computing the SLD all require $Ω(n\log n)$ work where $n = |T|$. Furthermore,… ▽ More Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree $T$, the SLD of $T$ is a binary dendrogram that summarizes the $n-1$ clusterings obtained by contracting the edges of $T$ in order of weight. Existing algorithms for computing the SLD all require $Ω(n\log n)$ work where $n = |T|$. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallel algorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires $O(n \log h)$ work and $O(\log^2 n \log^2 h)$ depth, where $h$ is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves $O(n\log h)$ work and $O(h \log n)$ depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice. △ Less

Submitted 12 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

Comments: To appear at SPAA 2024

arXiv:2404.14730 [pdf, other]

It's Hard to HAC with Average Linkage!

Authors: MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

Abstract: Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower… ▽ More Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of $n^{3/2-ε}$ on $n$ node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter $4$. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: To appear at ICALP 2024

arXiv:2401.17883 [pdf, other]

Reimagining Reality: A Comprehensive Survey of Video Inpainting Techniques

Authors: Shreyank N Gowda, Yash Thakre, Shashank Narayana Gowda, Xiaobo **

Abstract: This paper offers a comprehensive analysis of recent advancements in video inpainting techniques, a critical subset of computer vision and artificial intelligence. As a process that restores or fills in missing or corrupted portions of video sequences with plausible content, video inpainting has evolved significantly with the advent of deep learning methodologies. Despite the plethora of existing… ▽ More This paper offers a comprehensive analysis of recent advancements in video inpainting techniques, a critical subset of computer vision and artificial intelligence. As a process that restores or fills in missing or corrupted portions of video sequences with plausible content, video inpainting has evolved significantly with the advent of deep learning methodologies. Despite the plethora of existing methods and their swift development, the landscape remains complex, posing challenges to both novices and established researchers. Our study deconstructs major techniques, their underpinning theories, and their effective applications. Moreover, we conduct an exhaustive comparative study, centering on two often-overlooked dimensions: visual quality and computational efficiency. We adopt a human-centric approach to assess visual quality, enlisting a panel of annotators to evaluate the output of different video inpainting techniques. This provides a nuanced qualitative understanding that complements traditional quantitative metrics. Concurrently, we delve into the computational aspects, comparing inference times and memory demands across a standardized hardware setup. This analysis underscores the balance between quality and efficiency: a critical consideration for practical applications where resources may be constrained. By integrating human validation and computational resource comparison, this survey not only clarifies the present landscape of video inpainting techniques but also charts a course for future explorations in this vibrant and evolving field. △ Less

Submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.11406 [pdf, other]

Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts

Authors: Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, Robert B Fisher

Abstract: Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one… ▽ More Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlap** in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlap** classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2310.06522 [pdf, other]

Watt For What: Rethinking Deep Learning's Energy-Performance Relationship

Authors: Shreyank N Gowda, Xinyue Hao, Gen Li, Laura Sevilla-Lara, Shashank Narayana Gowda

Abstract: Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model… ▽ More Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model accuracy and electricity consumption, proposing a metric that penalizes large consumption of electricity. We conduct a comprehensive study on the electricity consumption of various deep learning models across different GPUs, presenting a detailed analysis of their accuracy-efficiency trade-offs. By evaluating accuracy per unit of electricity consumed, we demonstrate how smaller, more energy-efficient models can significantly expedite research while mitigating environmental concerns. Our results highlight the potential for a more sustainable approach to deep learning, emphasizing the importance of optimizing models for efficiency. This research also contributes to a more equitable research landscape, where smaller entities can compete effectively with larger counterparts. This advocates for the adoption of efficient deep learning practices to reduce electricity consumption, safeguarding the environment for future generations whilst also hel** ensure a fairer competitive landscape. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2309.17327 [pdf, other]

Telling Stories for Common Sense Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara

Abstract: Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual desc… ▽ More Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories . △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.01390 [pdf, other]

Bridging the Projection Gap: Overcoming Projection Bias Through Parameterized Distance Learning

Authors: Chong Zhang, Mingyu **, Qinkai Yu, Haochen Xue, Shreyank N Gowda, Xiaobo **

Abstract: Generalized zero-shot learning (GZSL) aims to recognize samples from both seen and unseen classes using only seen class samples for training. However, GZSL methods are prone to bias towards seen classes during inference due to the projection function being learned from seen classes. Most methods focus on learning an accurate projection, but bias in the projection is inevitable. We address this pro… ▽ More Generalized zero-shot learning (GZSL) aims to recognize samples from both seen and unseen classes using only seen class samples for training. However, GZSL methods are prone to bias towards seen classes during inference due to the projection function being learned from seen classes. Most methods focus on learning an accurate projection, but bias in the projection is inevitable. We address this projection bias by proposing to learn a parameterized Mahalanobis distance metric for robust inference. Our key insight is that the distance computation during inference is critical, even with a biased projection. We make two main contributions - (1) We extend the VAEGAN (Variational Autoencoder \& Generative Adversarial Networks) architecture with two branches to separately output the projection of samples from seen and unseen classes, enabling more robust distance learning. (2) We introduce a novel loss function to optimize the Mahalanobis distance representation and reduce projection bias. Extensive experiments on four datasets show that our approach outperforms state-of-the-art GZSL techniques with improvements of up to 3.5 \% on the harmonic mean metric. △ Less

Submitted 2 April, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: 18 pages, 9 figures

arXiv:2308.16041 [pdf, other]

From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications

Authors: Shreyank N Gowda, Dheeraj Pandey, Shashank Narayana Gowda

Abstract: Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. This paper presents a comprehensive survey of state-of-the-art methods for talking head generation. We systematically categorises them into four main approaches: image-driven, audio-driven, video-driven and others (including neural radiance fields (NeRF), and 3D-based met… ▽ More Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. This paper presents a comprehensive survey of state-of-the-art methods for talking head generation. We systematically categorises them into four main approaches: image-driven, audio-driven, video-driven and others (including neural radiance fields (NeRF), and 3D-based methods). We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations. Furthermore, we thoroughly compare publicly available models, evaluating them on key aspects such as inference time and human-rated quality of the generated outputs. Our aim is to provide a clear and concise overview of the current landscape in talking head generation, elucidating the relationships between different approaches and identifying promising directions for future research. This survey will serve as a valuable reference for researchers and practitioners interested in this rapidly evolving field. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2306.04822 [pdf, other]

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

Authors: Shreyank N Gowda, Anurag Arnab, Jonathan Huang

Abstract: In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approach… ▽ More In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by $\sim 50\%$) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2304.02846 [pdf, other]

Synthetic Sample Selection for Generalized Zero-Shot Learning

Authors: Shreyank N Gowda

Abstract: Generalized Zero-Shot Learning (GZSL) has emerged as a pivotal research domain in computer vision, owing to its capability to recognize objects that have not been seen during training. Despite the significant progress achieved by generative techniques in converting traditional GZSL to fully supervised learning, they tend to generate a large number of synthetic features that are often redundant, th… ▽ More Generalized Zero-Shot Learning (GZSL) has emerged as a pivotal research domain in computer vision, owing to its capability to recognize objects that have not been seen during training. Despite the significant progress achieved by generative techniques in converting traditional GZSL to fully supervised learning, they tend to generate a large number of synthetic features that are often redundant, thereby increasing training time and decreasing accuracy. To address this issue, this paper proposes a novel approach for synthetic feature selection using reinforcement learning. In particular, we propose a transformer-based selector that is trained through proximal policy optimization (PPO) to select synthetic features based on the validation classification accuracy of the seen classes, which serves as a reward. The proposed method is model-agnostic and data-agnostic, making it applicable to both images and videos and versatile for diverse applications. Our experimental results demonstrate the superiority of our approach over existing feature-generating methods, yielding improved overall performance on multiple benchmarks. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: Paper accepted in CVPRW 2023

arXiv:2301.02200 [pdf, other]

doi 10.1109/ICCRE57112.2023.10155607

Impact, Attention, Influence: Early Assessment of Autonomous Driving Datasets

Authors: Daniel Bogdoll, Jonas Hendl, Felix Schreyer, Nishanth Gowda, Michael Färber, J. Marius Zöllner

Abstract: Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has been conducted in other fields, it rarely revolves… ▽ More Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has been conducted in other fields, it rarely revolves around datasets. Thus, the impact, attention, and influence of datasets on autonomous driving remains a rarely investigated field. In this work, we provide a scientometric analysis for over 200 datasets in AD. We perform a rigorous evaluation of relations between available metadata and citation counts based on linear regression. Subsequently, we propose an Influence Score to assess a dataset already early on without the need for a track-record of citations, which is only available with a certain delay. △ Less

Submitted 31 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: Daniel Bogdoll and Jonas Hendl contributed equally. Accepted for publication at ICCRE 2023

arXiv:2210.13395 [pdf, other]

Improved Bi-point Rounding Algorithms and a Golden Barrier for $k$-Median

Authors: Kishen N. Gowda, Thomas Pensyl, Aravind Srinivasan, Khoa Trinh

Abstract: The current best approximation algorithms for $k$-median rely on first obtaining a structured fractional solution known as a bi-point solution, and then rounding it to an integer solution. We improve this second step by unifying and refining previous approaches. We describe a hierarchy of increasingly-complex partitioning schemes for the facilities, along with corresponding sets of algorithms and… ▽ More The current best approximation algorithms for $k$-median rely on first obtaining a structured fractional solution known as a bi-point solution, and then rounding it to an integer solution. We improve this second step by unifying and refining previous approaches. We describe a hierarchy of increasingly-complex partitioning schemes for the facilities, along with corresponding sets of algorithms and factor-revealing non-linear programs. We prove that the third layer of this hierarchy is a $2.613$-approximation, improving upon the current best ratio of $2.675$, while no layer can be proved better than $2.588$ under the proposed analysis. On the negative side, we give a family of bi-point solutions which cannot be approximated better than the square root of the golden ratio, even if allowed to open $k+o(k)$ facilities. This gives a barrier to current approaches for obtaining an approximation better than $2 \sqrtφ \approx 2.544$. Altogether we reduce the approximation gap of bi-point solutions by two thirds. △ Less

Submitted 24 October, 2022; originally announced October 2022.

arXiv:2209.15501 [pdf, other]

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos

Authors: Anil Batra, Shreyank N Gowda, Frank Keller, Laura Sevilla-Lara

Abstract: Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look a… ▽ More Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look at PSS and propose three fundamental improvements over current methods. The segmentation task is critical, as generating a correct summary requires each step of the procedure to be correctly identified. However, current segmentation metrics often overestimate the segmentation quality because they do not consider the temporal order of segments. In our first contribution, we propose a new segmentation metric that takes into account the order of segments, giving a more reliable measure of the accuracy of a given predicted segmentation. Current PSS methods are typically trained by proposing segments, matching them with the ground truth and computing a loss. However, much like segmentation metrics, existing matching algorithms do not consider the temporal order of the map** between candidate segments and the ground truth. In our second contribution, we propose a matching algorithm that constrains the temporal order of segment map**, and is also differentiable. Lastly, we introduce multi-modal feature training for PSS, which further improves segmentation. We evaluate our approach on two instructional video datasets (YouCook2 and Tasty) and observe an improvement over the state-of-the-art of $\sim7\%$ and $\sim2.5\%$ for procedure segmentation and summarization, respectively. △ Less

Submitted 7 October, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: Accepted at BMVC 2022

arXiv:2208.10095 [pdf, other]

Socially Fair Center-based and Linear Subspace Clustering

Authors: Sruthi Gorantla, Kishen N. Gowda, Amit Deshpande, Anand Louis

Abstract: Center-based clustering (e.g., $k$-means, $k$-medians) and clustering using linear subspaces are two most popular techniques to partition real-world data into smaller clusters. However, when the data consists of sensitive demographic groups, significantly different clustering cost per point for different sensitive groups can lead to fairness-related harms (e.g., different quality-of-service). The… ▽ More Center-based clustering (e.g., $k$-means, $k$-medians) and clustering using linear subspaces are two most popular techniques to partition real-world data into smaller clusters. However, when the data consists of sensitive demographic groups, significantly different clustering cost per point for different sensitive groups can lead to fairness-related harms (e.g., different quality-of-service). The goal of socially fair clustering is to minimize the maximum cost of clustering per point over all groups. In this work, we propose a unified framework to solve socially fair center-based clustering and linear subspace clustering, and give practical, efficient approximation algorithms for these problems. We do extensive experiments to show that on multiple benchmark datasets our algorithms either closely match or outperform state-of-the-art baselines. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 17 pages

arXiv:2206.04790 [pdf, other]

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Authors: Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara

Abstract: We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a good video for action recognition and select only high-quality samples for augment… ▽ More We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a good video for action recognition and select only high-quality samples for augmentation. In particular, we choose video compositing of a foreground and a background video as the data augmentation process, which results in diverse and realistic new samples. We learn which pairs of videos to augment without having to actually composite them. This reduces the space of possible augmentations, which has two advantages: it saves computational cost and increases the accuracy of the final trained classifier, as the augmented pairs are of higher quality than average. We present experimental results on the entire spectrum of training settings: few-shot, semi-supervised and fully supervised. We observe consistent improvements across all of them over prior work and baselines on Kinetics, UCF101, HMDB51, and achieve a new state-of-the-art on settings with limited data. We see improvements of up to 8.6% in the semi-supervised setting. △ Less

Submitted 23 July, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: Accepted to ECCV-2022

arXiv:2201.10394 [pdf, other]

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

Authors: Kiyoon Kim, Shreyank N Gowda, Oisin Mac Aodha, Laura Sevilla-Lara

Abstract: We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling s… ▽ More We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, TSM, and MVFNet) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available on our website: https://github.com/kiyoon/channel_sampling. △ Less

Submitted 10 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: BMVC 2022

arXiv:2107.13029 [pdf, other]

A New Split for Evaluating True Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach

Abstract: Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap… ▽ More Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be unseen, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis, we find that our TruZesplits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower,up to 8.9% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 17.1%. We publish oursplits1and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward. △ Less

Submitted 13 September, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

Comments: Accepted to GCPR 2021

arXiv:2107.00443 [pdf, other]

Test Framework for a Virtual Competition Testbed

Authors: Liam Wellacott, Emilyann Nault, Ioannis Skottis, Alexandre Colle, Shreyank N Gowda, Pierre Nicolay, Emily Rolley-Parnell

Abstract: Virtual environments have been utilised in robotics research as a tool to assess systems before deploying them in the field. The COVID-19 pandemic has brought about additional motivation for the development of virtual benchmarks in order to aid in safe and productive development. In-person robotics competitions have also halted, thus limiting the scope of opportunities for students and researchers… ▽ More Virtual environments have been utilised in robotics research as a tool to assess systems before deploying them in the field. The COVID-19 pandemic has brought about additional motivation for the development of virtual benchmarks in order to aid in safe and productive development. In-person robotics competitions have also halted, thus limiting the scope of opportunities for students and researchers. We implemented the structure of a service robotics competition into an extendable and adaptable virtual scoring environment. The competition challenges the state of the art in home service robotics by presenting realistic household tasks for robots to complete. The virtual environment provides a foundation for competition teams to assess their systems when accessing the physical environment is not possible. We believe that utilising virtual environments as a means of assessment will lead to other benefits such as increased access and generalisation. △ Less

Submitted 1 July, 2021; originally announced July 2021.

arXiv:2101.07042 [pdf, other]

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach

Abstract: Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plain… ▽ More Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plains their success as supervised models. However, inzero-shot learning, these highly specialized class bound-aries may not transfer well from seen to unseen classes.In this paper we propose a centroid-based representation,which clusters visual and semantic representation, consid-ers all training samples at once, and in this way generaliz-ing well to instances from unseen classes. We optimize theclustering using Reinforcement Learning which we show iscritical for our approach to work. We call the proposedmethod CLASTER and observe that it consistently outper-forms the state-of-the-art in all standard datasets, includ-ing UCF101, HMDB51 and Olympic Sports; both in thestandard zero-shot evaluation and the generalized zero-shotlearning. Further, we show that our model performs com-petitively in the image domain as well, outperforming thestate-of-the-art in many settings. △ Less

Submitted 23 July, 2022; v1 submitted 18 January, 2021; originally announced January 2021.

Comments: Accepted to ECCV-22

arXiv:2012.10671 [pdf, other]

SMART Frame Selection for Action Recognition

Authors: Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

Abstract: Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not releva… ▽ More Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not relevant, and easy to discard. In this work, however, we focus on the more standard short, trimmed action recognition problem. We argue that good frame selection can not only reduce the computational cost of action recognition but also increase the accuracy by getting rid of frames that are hard to classify. In contrast to previous work, we propose a method that instead of selecting frames by considering one at a time, considers them jointly. This results in a more efficient selection, where good frames are more effectively distributed over the video, like snapshots that tell a story. We call the proposed frame selection SMART and we test it in combination with different backbone architectures and on multiple benchmarks (Kinetics, Something-something, UCF101). We show that the SMART frame selection consistently improves the accuracy compared to other frame selection strategies while reducing the computational cost by a factor of 4 to 10 times. Additionally, we show that when the primary goal is recognition performance, our selection strategy can improve over recent state-of-the-art models and frame selection strategies on various benchmarks (UCF101, HMDB51, FCVID, and ActivityNet). △ Less

Submitted 19 December, 2020; originally announced December 2020.

Comments: To be published in AAAI-21

arXiv:2009.13949 [pdf, other]

Improved FPT Algorithms for Deletion to Forest-like Structures

Authors: Kishen N. Gowda, Aditya Lonkar, Fahad Panolan, Vraj Patel, Saket Saurabh

Abstract: The Feedback Vertex Set problem is undoubtedly one of the most well-studied problems in Parameterized Complexity. In this problem, given an undirected graph $G$ and a non-negative integer $k$, the objective is to test whether there exists a subset $S\subseteq V(G)$ of size at most $k$ such that $G-S$ is a forest. After a long line of improvement, recently, Li and Nederlof [SODA, 2020] designed a r… ▽ More The Feedback Vertex Set problem is undoubtedly one of the most well-studied problems in Parameterized Complexity. In this problem, given an undirected graph $G$ and a non-negative integer $k$, the objective is to test whether there exists a subset $S\subseteq V(G)$ of size at most $k$ such that $G-S$ is a forest. After a long line of improvement, recently, Li and Nederlof [SODA, 2020] designed a randomized algorithm for the problem running in time $\mathcal{O}^{\star}(2.7^k)$. In the Parameterized Complexity literature, several problems around Feedback Vertex Set have been studied. Some of these include Independent Feedback Vertex Set (where the set $S$ should be an independent set in $G$), Almost Forest Deletion and Pseudoforest Deletion. In Pseudoforest Deletion, each connected component in $G-S$ has at most one cycle in it. However, in Almost Forest Deletion, the input is a graph $G$ and non-negative integers $k,\ell \in \mathbb{N}$, and the objective is to test whether there exists a vertex subset $S$ of size at most $k$, such that $G-S$ is $\ell$ edges away from a forest. In this paper, using the methodology of Li and Nederlof [SODA, 2020], we obtain the current fastest algorithms for all these problems. In particular we obtain following randomized algorithms. 1) Independent Feedback Vertex Set can be solved in time $\mathcal{O}^{\star}(2.7^k)$. 2) Pseudo Forest Deletion can be solved in time $\mathcal{O}^{\star}(2.85^k)$. 3) Almost Forest Deletion can be solved in $\mathcal{O}^{\star}(\min\{2.85^k \cdot 8.54^\ell,2.7^k \cdot 36.61^\ell,3^k \cdot 1.78^\ell\})$. △ Less

Submitted 26 September, 2020; originally announced September 2020.

Comments: ISAAC 2020, 36 pages. arXiv admin note: text overlap with arXiv:1906.12298, arXiv:1103.0534 by other authors

arXiv:2005.13039 [pdf, other]

ALBA : Reinforcement Learning for Video Object Segmentation

Authors: Shreyank N Gowda, Panagiotis Eustratiadis, Timothy Hospedales, Laura Sevilla-Lara

Abstract: We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grou** problem by exploiting object proposals and making a joint inference about grou** over both space and time. We propose a network architecture for tractably perfo… ▽ More We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grou** problem by exploiting object proposals and making a joint inference about grou** over both space and time. We propose a network architecture for tractably performing proposal selection and joint grou**. Crucially, we then show how to train this network with reinforcement learning so that it learns to perform the optimal non-myopic sequence of grou** decisions to segment the whole video. Unlike standard supervised techniques, this also enables us to directly optimize for the non-differentiable overlap-based metrics used to evaluate VOS. We show that the proposed method, which we call ALBA outperforms the previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] and Youtube-VOS [27]. △ Less

Submitted 14 August, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

arXiv:2005.03176 [pdf, ps, other]

doi 10.1007/978-3-030-48966-3_21

A Parameterized Perspective on Attacking and Defending Elections

Authors: Kishen N. Gowda, Neeldhara Misra, Vraj Patel

Abstract: We consider the problem of protecting and manipulating elections by recounting and changing ballots, respectively. Our setting involves a plurality-based election held across multiple districts, and the problem formulations are based on the model proposed recently by~[Elkind et al, IJCAI 2019]. It turns out that both of the manipulation and protection problems are NP-complete even in fairly simple… ▽ More We consider the problem of protecting and manipulating elections by recounting and changing ballots, respectively. Our setting involves a plurality-based election held across multiple districts, and the problem formulations are based on the model proposed recently by~[Elkind et al, IJCAI 2019]. It turns out that both of the manipulation and protection problems are NP-complete even in fairly simple settings. We study these problems from a parameterized perspective with the goal of establishing a more detailed complexity landscape. The parameters we consider include the number of voters, and the budgets of the attacker and the defender. While we observe fixed-parameter tractability when parameterizing by number of voters, our main contribution is a demonstration of parameterized hardness when working with the budgets of the attacker and the defender. △ Less

Submitted 6 May, 2020; originally announced May 2020.

arXiv:2003.05005 [pdf, other]

Using an ensemble color space model to tackle adversarial examples

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Minute pixel changes in an image drastically change the prediction that the deep learning model makes. One of the most significant problems that could arise due to this, for instance, is autonomous driving. Many methods have been proposed to combat this with varying amounts of success. We propose a 3 step method for defending such attacks. First, we denoise the image using statistical methods. Sec… ▽ More Minute pixel changes in an image drastically change the prediction that the deep learning model makes. One of the most significant problems that could arise due to this, for instance, is autonomous driving. Many methods have been proposed to combat this with varying amounts of success. We propose a 3 step method for defending such attacks. First, we denoise the image using statistical methods. Second, we show that adopting multiple color spaces in the same model can help us to fight these adversarial attacks further as each color space detects certain features explicit to itself. Finally, the feature maps generated are enlarged and sent back as an input to obtain even smaller features. We show that the proposed model does not need to be trained to defend an particular type of attack and is inherently more robust to black-box, white-box, and grey-box adversarial attack techniques. In particular, the model is 56.12 percent more robust than compared models in case of white box attacks when the models are not subject to adversarial example training. △ Less

Submitted 10 March, 2020; originally announced March 2020.

arXiv:2002.02413 [pdf, other]

doi 10.1007/978-3-030-73973-7_30

StegColNet: Steganalysis based on an ensemble colorspace approach

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Image steganography refers to the process of hiding information inside images. Steganalysis is the process of detecting a steganographic image. We introduce a steganalysis approach that uses an ensemble color space model to obtain a weighted concatenated feature activation map. The concatenated map helps to obtain certain features explicit to each color space. We use a levy-flight grey wolf optimi… ▽ More Image steganography refers to the process of hiding information inside images. Steganalysis is the process of detecting a steganographic image. We introduce a steganalysis approach that uses an ensemble color space model to obtain a weighted concatenated feature activation map. The concatenated map helps to obtain certain features explicit to each color space. We use a levy-flight grey wolf optimization strategy to reduce the number of features selected in the map. We then use these features to classify the image into one of two classes: whether the given image has secret information stored or not. Extensive experiments have been done on a large scale dataset extracted from the Bossbase dataset. Also, we show that the model can be transferred to different datasets and perform extensive experiments on a mixture of datasets. Our results show that the proposed approach outperforms the recent state of the art deep learning steganalytical approaches by 2.32 percent on average for 0.2 bits per channel (bpc) and 1.87 percent on average for 0.4 bpc. △ Less

Submitted 16 October, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1906.07421 [pdf]

Using colorization as a tool for automatic makeup suggestion

Authors: Shreyank Narayana Gowda

Abstract: Colorization is the method of converting an image in grayscale to a fully color image. There are multiple methods to do the same. Old school methods used machine learning algorithms and optimization techniques to suggest possible colors to use. With advances in the field of deep learning, colorization results have improved consistently with improvements in deep learning architectures. The latest d… ▽ More Colorization is the method of converting an image in grayscale to a fully color image. There are multiple methods to do the same. Old school methods used machine learning algorithms and optimization techniques to suggest possible colors to use. With advances in the field of deep learning, colorization results have improved consistently with improvements in deep learning architectures. The latest development in the field of deep learning is the emergence of generative adversarial networks (GANs) which is used to generate information and not just predict or classify. As part of this report, 2 architectures of recent papers are reproduced along with a novel architecture being suggested for general colorization. Following this, we propose the use of colorization by generating makeup suggestions automatically on a face. To do this, a dataset consisting of 1000 images has been created. When an image of a person without makeup is sent to the model, the model first converts the image to grayscale and then passes it through the suggested GAN model. The output is a generated makeup suggestion. To develop this model, we need to tweak the general colorization model to deal only with faces of people. △ Less

Submitted 18 June, 2019; originally announced June 2019.

arXiv:1902.00267 [pdf, ps, other]

ColorNet: Investigating the importance of color spaces for image classification

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Image classification is a fundamental application in computer vision. Recently, deeper networks and highly connected networks have shown state of the art performance for image classification tasks. Most datasets these days consist of a finite number of color images. These color images are taken as input in the form of RGB images and classification is done without modifying them. We explore the imp… ▽ More Image classification is a fundamental application in computer vision. Recently, deeper networks and highly connected networks have shown state of the art performance for image classification tasks. Most datasets these days consist of a finite number of color images. These color images are taken as input in the form of RGB images and classification is done without modifying them. We explore the importance of color spaces and show that color spaces (essentially transformations of original RGB images) can significantly affect classification accuracy. Further, we show that certain classes of images are better represented in particular color spaces and for a dataset with a highly varying number of classes such as CIFAR and Imagenet, using a model that considers multiple color spaces within the same model gives excellent levels of accuracy. Also, we show that such a model, where the input is preprocessed into multiple color spaces simultaneously, needs far fewer parameters to obtain high accuracy for classification. For example, our model with 1.75M parameters significantly outperforms DenseNet 100-12 that has 12M parameters and gives results comparable to Densenet-BC-190-40 that has 25.6M parameters for classification of four competitive image classification datasets namely: CIFAR-10, CIFAR-100, SVHN and Imagenet. Our model essentially takes an RGB image as input, simultaneously converts the image into 7 different color spaces and uses these as inputs to individual densenets. We use small and wide densenets to reduce computation overhead and number of hyperparameters required. We obtain significant improvement on current state of the art results on these datasets as well. △ Less

Submitted 1 February, 2019; originally announced February 2019.

Journal ref: Asian Conference on Computer Vision 2018

Showing 1–29 of 29 results for author: Gowda, N