-
Clockwork Diffusion: Efficient Generation With Model-Step Distillation
Authors:
Amirhossein Habibian,
Amir Ghodrati,
Noor Fathima,
Guillaume Sautiere,
Risheek Garrepalli,
Fatih Porikli,
Jens Petersen
Abstract:
This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations.…
▽ More
This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.
△ Less
Submitted 20 February, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Skip-Attention: Improving Vision Transformers by Paying Less Attention
Authors:
Shashanka Venkataramanan,
Amir Ghodrati,
Yuki M. Asano,
Fatih Porikli,
Amirhossein Habibian
Abstract:
This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to ap…
▽ More
This work aims to improve the efficiency of vision transformers (ViT). While ViTs use computationally expensive self-attention operations in every layer, we identify that these operations are highly correlated across layers -- a key redundancy that causes unnecessary computations. Based on this observation, we propose SkipAt, a method to reuse self-attention computation from preceding layers to approximate attention at one or more subsequent layers. To ensure that reusing self-attention blocks across layers does not degrade the performance, we introduce a simple parametric function, which outperforms the baseline transformer's performance while running computationally faster. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks.
△ Less
Submitted 17 January, 2023; v1 submitted 5 January, 2023;
originally announced January 2023.
-
SALISA: Saliency-based Input Sampling for Efficient Video Object Detection
Authors:
Babak Ehteshami Bejnordi,
Amirhossein Habibian,
Fatih Porikli,
Amir Ghodrati
Abstract:
High-resolution images are widely adopted for high-performance object detection in videos. However, processing high-resolution inputs comes with high computation costs, and naive down-sampling of the input to reduce the computation costs quickly degrades the detection performance. In this paper, we propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detecti…
▽ More
High-resolution images are widely adopted for high-performance object detection in videos. However, processing high-resolution inputs comes with high computation costs, and naive down-sampling of the input to reduce the computation costs quickly degrades the detection performance. In this paper, we propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection that allows for heavy down-sampling of unimportant background regions while preserving the fine-grained details of a high-resolution image. The resulting image is spatially smaller, leading to reduced computational costs while enabling a performance comparable to a high-resolution input. To achieve this, we propose a differentiable resampling module based on a thin plate spline spatial transformer network (TPS-STN). This module is regularized by a novel loss to provide an explicit supervision signal to learn to "magnify" salient regions. We report state-of-the-art results in the low compute regime on the ImageNet-VID and UA-DETRAC video object detection datasets. We demonstrate that on both datasets, the mAP of an EfficientDet-D1 (EfficientDet-D2) gets on par with EfficientDet-D2 (EfficientDet-D3) at a much lower computational cost. We also show that SALISA significantly improves the detection of small objects. In particular, SALISA with an EfficientDet-D1 detector improves the detection of small objects by $77\%$, and remarkably also outperforms EfficientDetD3 baseline.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
FrameExit: Conditional Early Exiting for Efficient Video Recognition
Authors:
Amir Ghodrati,
Babak Ehteshami Bejnordi,
Amirhossein Habibian
Abstract:
In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more fr…
▽ More
In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more frames for complex ones. To achieve this, we employ a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable. We generate on-the-fly supervision signals to the gates to provide a dynamic trade-off between accuracy and computational cost. Our proposed model outperforms competing methods on three large-scale video benchmarks. In particular, on ActivityNet1.3 and mini-kinetics, we outperform the state-of-the-art efficient video recognition methods with 1.3$\times$ and 2.1$\times$ less GFLOPs, respectively. Additionally, our method sets a new state of the art for efficient video understanding on the HVU benchmark.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Well-indumatched Trees and Graphs of Bounded Girth
Authors:
S. Akbari,
T. Ekim,
A. H. Ghodrati,
S. Zare
Abstract:
A graph G is called well-indumatched if all of its maximal induced matchings have the same size. In this paper we characterize all well-indumatched trees. We provide a linear time algorithm to decide if a tree is well-indumatched or not. Then, we characterize minimal well-indumatched graphs of girth at least 9 and show subsequently that for an odd integer g greater than or equal to 9 and different…
▽ More
A graph G is called well-indumatched if all of its maximal induced matchings have the same size. In this paper we characterize all well-indumatched trees. We provide a linear time algorithm to decide if a tree is well-indumatched or not. Then, we characterize minimal well-indumatched graphs of girth at least 9 and show subsequently that for an odd integer g greater than or equal to 9 and different from 11, there is no well-indumatched graph of girth g. On the other hand, there are infinitely many well-indumatched unicyclic graphs of girth k, where k is in {3, 5, 7} or k is an even integer greater than 2. We also show that, although the recognition of well-indumatched graphs is known to be co-NP-complete in general, one can recognize in polynomial time well-indumatched graphs where the size of maximal induced matchings is fixed.
△ Less
Submitted 16 December, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Video Time: Properties, Encoders and Evaluation
Authors:
Amir Ghodrati,
Efstratios Gavves,
Cees G. M. Snoek
Abstract:
Time-aware encoding of frame sequences in a video is a fundamental problem in video understanding. While many attempted to model time in videos, an explicit study on quantifying video time is missing. To fill this lacuna, we aim to evaluate video time explicitly. We describe three properties of video time, namely a) temporal asymmetry, b)temporal continuity and c) temporal causality. Based on each…
▽ More
Time-aware encoding of frame sequences in a video is a fundamental problem in video understanding. While many attempted to model time in videos, an explicit study on quantifying video time is missing. To fill this lacuna, we aim to evaluate video time explicitly. We describe three properties of video time, namely a) temporal asymmetry, b)temporal continuity and c) temporal causality. Based on each we formulate a task able to quantify the associated property. This allows assessing the effectiveness of modern video encoders, like C3D and LSTM, in their ability to model time. Our analysis provides insights about existing encoders while also leading us to propose a new video time encoder, which is better suited for the video time recognition tasks than C3D and LSTM. We believe the proposed meta-analysis can provide a reasonable baseline to assess video time encoders on equal grounds on a set of temporal-aware tasks.
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Actor and Action Video Segmentation from a Sentence
Authors:
Kirill Gavrilyuk,
Amir Ghodrati,
Zhenyang Li,
Cees G. M. Snoek
Abstract:
This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment…
▽ More
This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.
△ Less
Submitted 20 March, 2018;
originally announced March 2018.
-
Chromatic Number and Dichromatic Polynomial of Digraphs
Authors:
Saeed Akbari,
Amir Hossein Ghodrati,
Afrouz Jabalameli,
Morteza Saghafian
Abstract:
Let $G$ be a graph of order $n$. It is well-known that $α(G)\geq \sum_{i=1}^n \frac{1}{1+d_i}$, where $α(G)$ is the independence number of $G$ and $d_1,\ldots,d_n$ is the degree sequence of $G$. We extend this result to digraphs by showing that if $D$ is a digraph with $n$ vertices, then $ α(D)\geq \sum_{i=1}^n \left( \frac{1}{1+d_i^+} + \frac{1}{1+d_i^-}
- \frac{1}{1+d_i}\right)$, where $α(D)$…
▽ More
Let $G$ be a graph of order $n$. It is well-known that $α(G)\geq \sum_{i=1}^n \frac{1}{1+d_i}$, where $α(G)$ is the independence number of $G$ and $d_1,\ldots,d_n$ is the degree sequence of $G$. We extend this result to digraphs by showing that if $D$ is a digraph with $n$ vertices, then $ α(D)\geq \sum_{i=1}^n \left( \frac{1}{1+d_i^+} + \frac{1}{1+d_i^-}
- \frac{1}{1+d_i}\right)$, where $α(D)$ is the maximum size of an acyclic vertex set of $D$. Golowich proved that for any digraph $D$, $χ(D)\leq \lceil \frac{4k}{5} \rceil+2$, where $k=max(Δ^+(D),Δ^-(D))$. We give a short and simple proof for this result. Next, we investigate the chromatic number of tournaments and determine the unique tournament such that for every integer $k>1$, the number of proper $k$-colorings of that tournament is maximum among all strongly connected tournaments with the same number of vertices. Also, we find the chromatic polynomial of the strongly connected tournament with the minimum number of cycles.
△ Less
Submitted 16 November, 2017;
originally announced November 2017.
-
DeepProposals: Hunting Objects and Actions by Cascading Deep Convolutional Layers
Authors:
Amir Ghodrati,
Ali Diba,
Marco Pedersoli,
Tinne Tuytelaars,
Luc Van Gool
Abstract:
In this paper, a new method for generating object and action proposals in images and videos is proposed. It builds on activations of different convolutional layers of a pretrained CNN, combining the localization accuracy of the early layers with the high informative-ness (and hence recall) of the later layers. To this end, we build an inverse cascade that, going backward from the later to the earl…
▽ More
In this paper, a new method for generating object and action proposals in images and videos is proposed. It builds on activations of different convolutional layers of a pretrained CNN, combining the localization accuracy of the early layers with the high informative-ness (and hence recall) of the later layers. To this end, we build an inverse cascade that, going backward from the later to the earlier convolutional layers of the CNN, selects the most promising locations and refines them in a coarse-to-fine manner. The method is efficient, because i) it re-uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals thanks to the use of the inverse coarse-to-fine cascade. The method is also accurate. We show that our DeepProposals outperform most of the previously proposed object proposal and action proposal approaches and, when plugged into a CNN-based object detector, produce state-of-the-art detection performance.
△ Less
Submitted 15 June, 2016;
originally announced June 2016.
-
Online Action Detection
Authors:
Roeland De Geest,
Efstratios Gavves,
Amir Ghodrati,
Zhenyang Li,
Cees Snoek,
Tinne Tuytelaars
Abstract:
In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the star…
▽ More
In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.
△ Less
Submitted 30 August, 2016; v1 submitted 21 April, 2016;
originally announced April 2016.
-
Rank Pooling for Action Recognition
Authors:
Basura Fernando,
Efstratios Gavves,
Jose Oramas,
Amir Ghodrati,
Tinne Tuytelaars
Abstract:
We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the fra…
▽ More
We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.
△ Less
Submitted 15 May, 2016; v1 submitted 6 December, 2015;
originally announced December 2015.
-
Towards Automatic Image Editing: Learning to See another You
Authors:
Amir Ghodrati,
Xu Jia,
Marco Pedersoli,
Tinne Tuytelaars
Abstract:
Learning the distribution of images in order to generate new samples is a challenging task due to the high dimensionality of the data and the highly non-linear relations that are involved. Nevertheless, some promising results have been reported in the literature recently,building on deep network architectures. In this work, we zoom in on a specific type of image generation: given an image and know…
▽ More
Learning the distribution of images in order to generate new samples is a challenging task due to the high dimensionality of the data and the highly non-linear relations that are involved. Nevertheless, some promising results have been reported in the literature recently,building on deep network architectures. In this work, we zoom in on a specific type of image generation: given an image and knowing the category of objects it belongs to (e.g. faces), our goal is to generate a similar and plausible image, but with some altered attributes. This is particularly challenging, as the model needs to learn to disentangle the effect of each attribute and to apply a desired attribute change to a given input image, while kee** the other attributes and overall object appearance intact. To this end, we learn a convolutional network, where the desired attribute information is encoded then merged with the encoded image at feature map level. We show promising results, both qualitatively as well as quantitatively, in the context of a retrieval experiment, on two face datasets (MultiPie and CAS-PEAL-R1).
△ Less
Submitted 26 November, 2015;
originally announced November 2015.
-
DeepProposal: Hunting Objects by Cascading Deep Convolutional Layers
Authors:
Amir Ghodrati,
Ali Diba,
Marco Pedersoli,
Tinne Tuytelaars,
Luc Van Gool
Abstract:
In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the gen- eration of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the…
▽ More
In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the gen- eration of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate; it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the- art detection performance.
△ Less
Submitted 15 October, 2015;
originally announced October 2015.