-
MoPEFT: A Mixture-of-PEFTs for the Segment Anything Model
Authors:
Rajat Sahay,
Andreas Savakis
Abstract:
The emergence of foundation models, such as the Segment Anything Model (SAM), has sparked interest in Parameter-Efficient Fine-Tuning (PEFT) methods that tailor these large models to application domains outside their training data. However, different PEFT techniques modify the representation of a model differently, making it a non-trivial task to select the most appropriate method for the domain o…
▽ More
The emergence of foundation models, such as the Segment Anything Model (SAM), has sparked interest in Parameter-Efficient Fine-Tuning (PEFT) methods that tailor these large models to application domains outside their training data. However, different PEFT techniques modify the representation of a model differently, making it a non-trivial task to select the most appropriate method for the domain of interest. We propose a new framework, Mixture-of-PEFTs methods (MoPEFT), that is inspired by traditional Mixture-of-Experts (MoE) methodologies and is utilized for fine-tuning SAM. Our MoPEFT framework incorporates three different PEFT techniques as submodules and dynamically learns to activate the ones that are best suited for a given data-task setup. We test our method on the Segment Anything Model and show that MoPEFT consistently outperforms other fine-tuning methods on the MESS benchmark.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
LRP-QViT: Mixed-Precision Vision Transformer Quantization via Layer-wise Relevance Propagation
Authors:
Navin Ranjan,
Andreas Savakis
Abstract:
Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained platforms. Quantization is a popular approach for reducing model size, but most studies mainly focus on equal bit-width quantization for the entire network, re…
▽ More
Vision transformers (ViTs) have demonstrated remarkable performance across various visual tasks. However, ViT models suffer from substantial computational and memory requirements, making it challenging to deploy them on resource-constrained platforms. Quantization is a popular approach for reducing model size, but most studies mainly focus on equal bit-width quantization for the entire network, resulting in sub-optimal solutions. While there are few works on mixed precision quantization (MPQ) for ViTs, they typically rely on search space-based methods or employ mixed precision arbitrarily. In this paper, we introduce LRP-QViT, an explainability-based method for assigning mixed-precision bit allocations to different layers based on their importance during classification. Specifically, to measure the contribution score of each layer in predicting the target class, we employ the Layer-wise Relevance Propagation (LRP) method. LRP assigns local relevance at the output layer and propagates it through all layers, distributing the relevance until it reaches the input layers. These relevance scores serve as indicators for computing the layer contribution score. Additionally, we have introduced a clipped channel-wise quantization aimed at eliminating outliers from post-LayerNorm activations to alleviate severe inter-channel variations. To validate and assess our approach, we employ LRP-QViT across ViT, DeiT, and Swin transformer models on various datasets. Our experimental findings demonstrate that both our fixed-bit and mixed-bit post-training quantization methods surpass existing models in the context of 4-bit and 6-bit quantization.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Unknown Sample Discovery for Source Free Open Set Domain Adaptation
Authors:
Chowdhury Sadman Jahan,
Andreas Savakis
Abstract:
Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source domain to a target domain that undergoes distribution shift and contains samples from novel classes outside the source domain. Source-free OSDA (SF-OSDA) techniques eliminate the need to access source domain samples, but current SF-OSDA methods utilize only the known classes in the target domain for adaptation, and require…
▽ More
Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source domain to a target domain that undergoes distribution shift and contains samples from novel classes outside the source domain. Source-free OSDA (SF-OSDA) techniques eliminate the need to access source domain samples, but current SF-OSDA methods utilize only the known classes in the target domain for adaptation, and require access to the entire target domain even during inference after adaptation, to make the distinction between known and unknown samples. In this paper, we introduce Unknown Sample Discovery (USD) as an SF-OSDA method that utilizes a temporally ensembled teacher model to conduct known-unknown target sample separation and adapts the student model to the target domain over all classes using co-training and temporal consistency between the teacher and the student. USD promotes Jensen-Shannon distance (JSD) as an effective measure for known-unknown sample separation. Our teacher-student framework significantly reduces error accumulation resulting from imperfect known-unknown sample separation, while curriculum guidance helps to reliably learn the distinction between target known and target unknown subspaces. USD appends the target model with an unknown class node, thus readily classifying a target sample into any of the known or unknown classes in subsequent post-adaptation inference stages. Empirical results show that USD is superior to existing SF-OSDA methods and is competitive with current OSDA models that utilize both source and target domains during adaptation.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Curriculum Guided Domain Adaptation in the Dark
Authors:
Chowdhury Sadman Jahan,
Andreas Savakis
Abstract:
Addressing the rising concerns of privacy and security, domain adaptation in the dark aims to adapt a black-box source trained model to an unlabeled target domain without access to any source data or source model parameters. The need for domain adaptation of black-box predictors becomes even more pronounced to protect intellectual property as deep learning based solutions are becoming increasingly…
▽ More
Addressing the rising concerns of privacy and security, domain adaptation in the dark aims to adapt a black-box source trained model to an unlabeled target domain without access to any source data or source model parameters. The need for domain adaptation of black-box predictors becomes even more pronounced to protect intellectual property as deep learning based solutions are becoming increasingly commercialized. Current methods distill noisy predictions on the target data obtained from the source model to the target model, and/or separate clean/noisy target samples before adapting using traditional noisy label learning algorithms. However, these methods do not utilize the easy-to-hard learning nature of the clean/noisy data splits. Also, none of the existing methods are end-to-end, and require a separate fine-tuning stage and an initial warmup stage. In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation, compared to the traditional criterion of cross entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra finetuning stage, unlike existing methods. Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Continual Domain Adaptation on Aerial Images under Gradually Degrading Weather
Authors:
Chowdhury Sadman Jahan,
Andreas Savakis
Abstract:
Domain adaptation (DA) strives to mitigate the domain gap between the source domain where a model is trained, and the target domain where the model is deployed. When a deep learning model is deployed on an aerial platform, it may face gradually degrading weather conditions during operation, leading to widening domain gaps between the training data and the encountered evaluation data. We synthesize…
▽ More
Domain adaptation (DA) strives to mitigate the domain gap between the source domain where a model is trained, and the target domain where the model is deployed. When a deep learning model is deployed on an aerial platform, it may face gradually degrading weather conditions during operation, leading to widening domain gaps between the training data and the encountered evaluation data. We synthesize two such gradually worsening weather conditions on real images from two existing aerial imagery datasets, generating a total of four benchmark datasets. Under the continual, or test-time adaptation setting, we evaluate three DA models on our datasets: a baseline standard DA model and two continual DA models. In such setting, the models can access only one small portion, or one batch of the target data at a time, and adaptation takes place continually, and over only one epoch of the data. The combination of the constraints of continual adaptation, and gradually deteriorating weather conditions provide the practical DA scenario for aerial deployment. Among the evaluated models, we consider both convolutional and transformer architectures for comparison. We discover stability issues during adaptation for existing buffer-fed continual DA methods, and offer gradient normalization as a simple solution to curb training instability.
△ Less
Submitted 14 August, 2023; v1 submitted 1 August, 2023;
originally announced August 2023.
-
DeepRM: Deep Recurrent Matching for 6D Pose Refinement
Authors:
Alexander Avery,
Andreas Savakis
Abstract:
Precise 6D pose estimation of rigid objects from RGB images is a critical but challenging task in robotics, augmented reality and human-computer interaction. To address this problem, we propose DeepRM, a novel recurrent network architecture for 6D pose refinement. DeepRM leverages initial coarse pose estimates to render synthetic images of target objects. The rendered images are then matched with…
▽ More
Precise 6D pose estimation of rigid objects from RGB images is a critical but challenging task in robotics, augmented reality and human-computer interaction. To address this problem, we propose DeepRM, a novel recurrent network architecture for 6D pose refinement. DeepRM leverages initial coarse pose estimates to render synthetic images of target objects. The rendered images are then matched with the observed images to predict a rigid transform for updating the previous pose estimate. This process is repeated to incrementally refine the estimate at each iteration. The DeepRM architecture incorporates LSTM units to propagate information through each refinement step, significantly improving overall performance. In contrast to current 2-stage Perspective-n-Point based solutions, DeepRM is trained end-to-end, and uses a scalable backbone that can be tuned via a single parameter for accuracy and efficiency. During training, a multi-scale optical flow head is added to predict the optical flow between the observed and synthetic images. Optical flow prediction stabilizes the training process, and enforces the learning of features that are relevant to the task of pose estimation. Our results demonstrate that DeepRM achieves state-of-the-art performance on two widely accepted challenging datasets.
△ Less
Submitted 16 June, 2023; v1 submitted 28 May, 2022;
originally announced May 2022.
-
BAPose: Bottom-Up Pose Estimation with Disentangled Waterfall Representations
Authors:
Bruno Artacho,
Andreas Savakis
Abstract:
We propose BAPose, a novel bottom-up approach that achieves state-of-the-art results for multi-person pose estimation. Our end-to-end trainable framework leverages a disentangled multi-scale waterfall architecture and incorporates adaptive convolutions to infer keypoints more precisely in crowded scenes with occlusions. The multi-scale representations, obtained by the disentangled waterfall module…
▽ More
We propose BAPose, a novel bottom-up approach that achieves state-of-the-art results for multi-person pose estimation. Our end-to-end trainable framework leverages a disentangled multi-scale waterfall architecture and incorporates adaptive convolutions to infer keypoints more precisely in crowded scenes with occlusions. The multi-scale representations, obtained by the disentangled waterfall module in BAPose, leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on the challenging COCO and CrowdPose datasets demonstrate that BAPose is an efficient and robust framework for multi-person pose estimation, achieving significant improvements on state-of-the-art accuracy.
△ Less
Submitted 20 December, 2021;
originally announced December 2021.
-
Extreme Face Inpainting with Sketch-Guided Conditional GAN
Authors:
Nilesh Pandey,
Andreas Savakis
Abstract:
Recovering badly damaged face images is a useful yet challenging task, especially in extreme cases where the masked or damaged region is very large. One of the major challenges is the ability of the system to generalize on faces outside the training dataset. We propose to tackle this extreme inpainting task with a conditional Generative Adversarial Network (GAN) that utilizes structural informatio…
▽ More
Recovering badly damaged face images is a useful yet challenging task, especially in extreme cases where the masked or damaged region is very large. One of the major challenges is the ability of the system to generalize on faces outside the training dataset. We propose to tackle this extreme inpainting task with a conditional Generative Adversarial Network (GAN) that utilizes structural information, such as edges, as a prior condition. Edge information can be obtained from the partially masked image and a structurally similar image or a hand drawing. In our proposed conditional GAN, we pass the conditional input in every layer of the encoder while maintaining consistency in the distributions between the learned weights and the incoming conditional input. We demonstrate the effectiveness of our method with badly damaged face examples.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Grassmann Iterative Linear Discriminant Analysis with Proxy Matrix Optimization
Authors:
Navya Nagananda,
Breton Minnehan,
Andreas Savakis
Abstract:
Linear Discriminant Analysis (LDA) is commonly used for dimensionality reduction in pattern recognition and statistics. It is a supervised method that aims to find the most discriminant space of reduced dimension that can be further used for classification. In this work, we present a Grassmann Iterative LDA method (GILDA) that is based on Proxy Matrix Optimization (PMO). PMO makes use of automatic…
▽ More
Linear Discriminant Analysis (LDA) is commonly used for dimensionality reduction in pattern recognition and statistics. It is a supervised method that aims to find the most discriminant space of reduced dimension that can be further used for classification. In this work, we present a Grassmann Iterative LDA method (GILDA) that is based on Proxy Matrix Optimization (PMO). PMO makes use of automatic differentiation and stochastic gradient descent (SGD) on the Grassmann manifold to arrive at the optimal projection matrix. Our results show that GILDAoutperforms the prevailing manifold optimization method.
△ Less
Submitted 16 April, 2021;
originally announced April 2021.
-
SiamReID: Confuser Aware Siamese Tracker with Re-identification Feature
Authors:
Abu Md Niamul Taufique,
Andreas Savakis,
Michael Braun,
Daniel Kubacki,
Ethan Dell,
Lei Qian,
Sean M. O'Rourke
Abstract:
Siamese deep-network trackers have received significant attention in recent years due to their real-time speed and state-of-the-art performance. However, Siamese trackers suffer from similar looking confusers, that are prevalent in aerial imagery and create challenging conditions due to prolonged occlusions where the tracker object re-appears under different pose and illumination. Our work propose…
▽ More
Siamese deep-network trackers have received significant attention in recent years due to their real-time speed and state-of-the-art performance. However, Siamese trackers suffer from similar looking confusers, that are prevalent in aerial imagery and create challenging conditions due to prolonged occlusions where the tracker object re-appears under different pose and illumination. Our work proposes SiamReID, a novel re-identification framework for Siamese trackers, that incorporates confuser rejection during prolonged occlusions and is well-suited for aerial tracking. The re-identification feature is trained using both triplet loss and a class balanced loss. Our approach achieves state-of-the-art performance in the UAVDT single object tracking benchmark.
△ Less
Submitted 15 April, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Benchmarking Deep Trackers on Aerial Videos
Authors:
Abu Md Niamul Taufique,
Breton Minnehan,
Andreas Savakis
Abstract:
In recent years, deep learning-based visual object trackers have achieved state-of-the-art performance on several visual object tracking benchmarks. However, most tracking benchmarks are focused on ground level videos, whereas aerial tracking presents a new set of challenges. In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets. We choose top performing…
▽ More
In recent years, deep learning-based visual object trackers have achieved state-of-the-art performance on several visual object tracking benchmarks. However, most tracking benchmarks are focused on ground level videos, whereas aerial tracking presents a new set of challenges. In this paper, we compare ten trackers based on deep learning techniques on four aerial datasets. We choose top performing trackers utilizing different approaches, specifically tracking by detection, discriminative correlation filters, Siamese networks and reinforcement learning. In our experiments, we use a subset of OTB2015 dataset with aerial style videos; the UAV123 dataset without synthetic sequences; the UAV20L dataset, which contains 20 long sequences; and DTB70 dataset as our benchmark datasets. We compare the advantages and disadvantages of different trackers in different tracking situations encountered in aerial data. Our findings indicate that the trackers perform significantly worse in aerial datasets compared to standard ground level videos. We attribute this effect to smaller target size, camera motion, significant camera rotation with respect to the target, out of view movement, and clutter in the form of occlusions or similar looking distractors near tracked object.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Visualization of Deep Transfer Learning In SAR Imagery
Authors:
Abu Md Niamul Taufique,
Navya Nagananda,
Andreas Savakis
Abstract:
Synthetic Aperture Radar (SAR) imagery has diverse applications in land and marine surveillance. Unlike electro-optical (EO) systems, these systems are not affected by weather conditions and can be used in the day and night times. With the growing importance of SAR imagery, it would be desirable if models trained on widely available EO datasets can also be used for SAR images. In this work, we con…
▽ More
Synthetic Aperture Radar (SAR) imagery has diverse applications in land and marine surveillance. Unlike electro-optical (EO) systems, these systems are not affected by weather conditions and can be used in the day and night times. With the growing importance of SAR imagery, it would be desirable if models trained on widely available EO datasets can also be used for SAR images. In this work, we consider transfer learning to leverage deep features from a network trained on an EO ships dataset and generate predictions on SAR imagery. Furthermore, by exploring the network activations in the form of class-activation maps (CAMs), we visualize the transfer learning process to SAR imagery and gain insight on how a deep network interprets a new modality.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Automatic Quantification of Facial Asymmetry using Facial Landmarks
Authors:
Abu Md Niamul Taufique,
Andreas Savakis,
Jonathan Leckenby
Abstract:
One-sided facial paralysis causes uneven movements of facial muscles on the sides of the face. Physicians currently assess facial asymmetry in a subjective manner based on their clinical experience. This paper proposes a novel method to provide an objective and quantitative asymmetry score for frontal faces. Our metric has the potential to help physicians for diagnosis as well as monitoring the re…
▽ More
One-sided facial paralysis causes uneven movements of facial muscles on the sides of the face. Physicians currently assess facial asymmetry in a subjective manner based on their clinical experience. This paper proposes a novel method to provide an objective and quantitative asymmetry score for frontal faces. Our metric has the potential to help physicians for diagnosis as well as monitoring the rehabilitation of patients with one-sided facial paralysis. A deep learning based landmark detection technique is used to estimate style invariant facial landmark points and dense optical flow is used to generate motion maps from a short sequence of frames. Six face regions are considered corresponding to the left and right parts of the forehead, eyes, and mouth. Motion is computed and compared between the left and the right parts of each region of interest to estimate the symmetry score. For testing, asymmetric sequences are synthetically generated from a facial expression dataset. A score equation is developed to quantify symmetry in both symmetric and asymmetric face sequences.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
ConDA: Continual Unsupervised Domain Adaptation
Authors:
Abu Md Niamul Taufique,
Chowdhury Sadman Jahan,
Andreas Savakis
Abstract:
Domain Adaptation (DA) techniques are important for overcoming the domain shift between the source domain used for training and the target domain where testing takes place. However, current DA methods assume that the entire target domain is available during adaptation, which may not hold in practice. This paper considers a more realistic scenario, where target data become available in smaller batc…
▽ More
Domain Adaptation (DA) techniques are important for overcoming the domain shift between the source domain used for training and the target domain where testing takes place. However, current DA methods assume that the entire target domain is available during adaptation, which may not hold in practice. This paper considers a more realistic scenario, where target data become available in smaller batches and adaptation on the entire target domain is not feasible. In our work, we introduce a new, data-constrained DA paradigm where unlabeled target samples are received in batches and adaptation is performed continually. We propose a novel source-free method for continual unsupervised domain adaptation that utilizes a buffer for selective replay of previously seen samples. In our continual DA framework, we selectively mix samples from incoming batches with data stored in a buffer using buffer management strategies and use the combination to incrementally update our model. We evaluate the classification performance of the continual DA approach with state-of-the-art DA methods based on the entire target domain. Our results on three popular DA datasets demonstrate that our method outperforms many existing state-of-the-art DA methods with access to the entire target domain during adaptation.
△ Less
Submitted 7 April, 2021; v1 submitted 19 March, 2021;
originally announced March 2021.
-
OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation
Authors:
Bruno Artacho,
Andreas Savakis
Abstract:
We propose OmniPose, a single-pass, end-to-end trainable framework, that achieves state-of-the-art results for multi-person pose estimation. Using a novel waterfall module, the OmniPose architecture leverages multi-scale feature representations that increase the effectiveness of backbone feature extractors, without the need for post-processing. OmniPose incorporates contextual information across s…
▽ More
We propose OmniPose, a single-pass, end-to-end trainable framework, that achieves state-of-the-art results for multi-person pose estimation. Using a novel waterfall module, the OmniPose architecture leverages multi-scale feature representations that increase the effectiveness of backbone feature extractors, without the need for post-processing. OmniPose incorporates contextual information across scales and joint localization with Gaussian heatmap modulation at the multi-scale feature extractor to estimate human pose with state-of-the-art accuracy. The multi-scale representations, obtained by the improved waterfall module in OmniPose, leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on multiple datasets demonstrate that OmniPose, with an improved HRNet backbone and waterfall module, is a robust and efficient architecture for multi-person pose estimation that achieves state-of-the-art results.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
LABNet: Local Graph Aggregation Network with Class Balanced Loss for Vehicle Re-Identification
Authors:
Abu Md Niamul Taufique,
Andreas Savakis
Abstract:
Vehicle re-identification is an important computer vision task where the objective is to identify a specific vehicle among a set of vehicles seen at various viewpoints. Recent methods based on deep learning utilize a global average pooling layer after the backbone feature extractor, however, this ignores any spatial reasoning on the feature map. In this paper, we propose local graph aggregation on…
▽ More
Vehicle re-identification is an important computer vision task where the objective is to identify a specific vehicle among a set of vehicles seen at various viewpoints. Recent methods based on deep learning utilize a global average pooling layer after the backbone feature extractor, however, this ignores any spatial reasoning on the feature map. In this paper, we propose local graph aggregation on the backbone feature map, to learn associations of local information and hence improve feature learning as well as reduce the effects of partial occlusion and background clutter. Our local graph aggregation network considers spatial regions of the feature map as nodes and builds a local neighborhood graph that performs local feature aggregation before the global average pooling layer. We further utilize a batch normalization layer to improve the system effectiveness. Additionally, we introduce a class balanced loss to compensate for the imbalance in the sample distributions found in the most widely used vehicle re-identification datasets. Finally, we evaluate our method in three popular benchmarks and show that our approach outperforms many state-of-the-art methods.
△ Less
Submitted 30 January, 2021; v1 submitted 29 November, 2020;
originally announced November 2020.
-
UniPose: Unified Human Pose Estimation in Single Images and Videos
Authors:
Bruno Artacho,
Andreas Savakis
Abstract:
We propose UniPose, a unified framework for human pose estimation, based on our "Waterfall" Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. Current pose estimation methods utilizing standard CNN architectures heavily rely on statistical postprocessing or predefined anchor poses for joint localization. UniPose incorporates contextual segme…
▽ More
We propose UniPose, a unified framework for human pose estimation, based on our "Waterfall" Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. Current pose estimation methods utilizing standard CNN architectures heavily rely on statistical postprocessing or predefined anchor poses for joint localization. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPose-LSTM for multi-frame processing and achieves state-of-the-art results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-of-the-art results in single person pose detection for both single images and videos.
△ Less
Submitted 22 January, 2020;
originally announced January 2020.
-
Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation
Authors:
Bruno Artacho,
Andreas Savakis
Abstract:
We propose a new efficient architecture for semantic segmentation, based on a "Waterfall" Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-v…
▽ More
We propose a new efficient architecture for semantic segmentation, based on a "Waterfall" Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Poly-GAN: Multi-Conditioned GAN for Fashion Synthesis
Authors:
Nilesh Pandey,
Andreas Savakis
Abstract:
We present Poly-GAN, a novel conditional GAN architecture that is motivated by Fashion Synthesis, an application where garments are automatically placed on images of human models at an arbitrary pose. Poly-GAN allows conditioning on multiple inputs and is suitable for many tasks, including image alignment, image stitching, and inpainting. Existing methods have a similar pipeline where three differ…
▽ More
We present Poly-GAN, a novel conditional GAN architecture that is motivated by Fashion Synthesis, an application where garments are automatically placed on images of human models at an arbitrary pose. Poly-GAN allows conditioning on multiple inputs and is suitable for many tasks, including image alignment, image stitching, and inpainting. Existing methods have a similar pipeline where three different networks are used to first align garments with the human pose, then perform stitching of the aligned garment and finally refine the results. Poly-GAN is the first instance where a common architecture is used to perform all three tasks. Our novel architecture enforces the conditions at all layers of the encoder and utilizes skip connections from the coarse layers of the encoder to the respective layers of the decoder. Poly-GAN is able to perform a spatial transformation of the garment based on the RGB skeleton of the model at an arbitrary pose. Additionally, Poly-GAN can perform image stitching, regardless of the garment orientation, and inpainting on the garment mask when it contains irregular holes. Our system achieves state-of-the-art quantitative results on Structural Similarity Index metric and Inception Score metric using the DeepFashion dataset.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Cascaded Projection: End-to-End Network Compression and Acceleration
Authors:
Breton Minnehan,
Andreas Savakis
Abstract:
We propose a data-driven approach for deep convolutional neural network compression that achieves high accuracy with high throughput and low memory requirements. Current network compression methods either find a low-rank factorization of the features that requires more memory, or select only a subset of features by pruning entire filter channels. We propose the Cascaded Projection (CaP) compressio…
▽ More
We propose a data-driven approach for deep convolutional neural network compression that achieves high accuracy with high throughput and low memory requirements. Current network compression methods either find a low-rank factorization of the features that requires more memory, or select only a subset of features by pruning entire filter channels. We propose the Cascaded Projection (CaP) compression method that projects the output and input filter channels of successive layers to a unified low dimensional space based on a low-rank projection. We optimize the projection to minimize classification loss and the difference between the next layer's features in the compressed and uncompressed networks. To solve this non-convex optimization problem we propose a new optimization method of a proxy matrix using backpropagation and Stochastic Gradient Descent (SGD) with geometric constraints. Our cascaded projection approach leads to improvements in all critical areas of network compression: high accuracy, low memory consumption, low parameter count and high processing speed. The proposed CaP method demonstrates state-of-the-art results compressing VGG16 and ResNet networks with over 4x reduction in the number of computations and excellent performance in top-5 accuracy on the ImageNet dataset before and after fine-tuning.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
Semantically Invariant Text-to-Image Generation
Authors:
Shagan Sah,
Dheeraj Peri,
Ameya Shringi,
Chi Zhang,
Miguel Dominguez,
Andreas Savakis,
Ray Ptucha
Abstract:
Image captioning has demonstrated models that are capable of generating plausible text given input images or videos. Further, recent work in image generation has shown significant improvements in image quality when text is used as a prior. Our work ties these concepts together by creating an architecture that can enable bidirectional generation of images and text. We call this network Multi-Modal…
▽ More
Image captioning has demonstrated models that are capable of generating plausible text given input images or videos. Further, recent work in image generation has shown significant improvements in image quality when text is used as a prior. Our work ties these concepts together by creating an architecture that can enable bidirectional generation of images and text. We call this network Multi-Modal Vector Representation (MMVR). Along with MMVR, we propose two improvements to the text conditioned image generation. Firstly, a n-gram metric based cost function is introduced that generalizes the caption with respect to the image. Secondly, multiple semantically similar sentences are shown to help in generating better images. Qualitative and quantitative evaluations demonstrate that MMVR improves upon existing text conditioned image generation results by over 20%, while integrating visual and text modalities.
△ Less
Submitted 26 September, 2018;
originally announced September 2018.
-
DEFRAG: Deep Euclidean Feature Representations through Adaptation on the Grassmann Manifold
Authors:
Breton Minnehan,
Andreas Savakis
Abstract:
We propose a novel technique for training deep networks with the objective of obtaining feature representations that exist in a Euclidean space and exhibit strong clustering behavior. Our desired features representations have three traits: they can be compared using a standard Euclidian distance metric, samples from the same class are tightly clustered, and samples from different classes are well…
▽ More
We propose a novel technique for training deep networks with the objective of obtaining feature representations that exist in a Euclidean space and exhibit strong clustering behavior. Our desired features representations have three traits: they can be compared using a standard Euclidian distance metric, samples from the same class are tightly clustered, and samples from different classes are well separated. However, most deep networks do not enforce such feature representations. The DEFRAG training technique consists of two steps: first good feature clustering behavior is encouraged though an auxiliary loss function based on the Silhouette clustering metric. Then the feature space is retracted onto a Grassmann manifold to ensure that the L_2 Norm forms a similarity metric. The DEFRAG technique achieves state of the art results on standard classification datasets using a relatively small network architecture with significantly fewer parameters than many standard networks.
△ Less
Submitted 20 June, 2018;
originally announced June 2018.
-
Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks
Authors:
Jefferson Ryan Medel,
Andreas Savakis
Abstract:
Automating the detection of anomalous events within long video sequences is challenging due to the ambiguity of how such events are defined. We approach the problem by learning generative models that can identify anomalies in videos using limited supervision. We propose end-to-end trainable composite Convolutional Long Short-Term Memory (Conv-LSTM) networks that are able to predict the evolution o…
▽ More
Automating the detection of anomalous events within long video sequences is challenging due to the ambiguity of how such events are defined. We approach the problem by learning generative models that can identify anomalies in videos using limited supervision. We propose end-to-end trainable composite Convolutional Long Short-Term Memory (Conv-LSTM) networks that are able to predict the evolution of a video sequence from a small number of input frames. Regularity scores are derived from the reconstruction errors of a set of predictions with abnormal video sequences yielding lower regularity scores as they diverge further from the actual sequence over time. The models utilize a composite structure and examine the effects of conditioning in learning more meaningful representations. The best model is chosen based on the reconstruction and prediction accuracy. The Conv-LSTM models are evaluated both qualitatively and quantitatively, demonstrating competitive results on anomaly detection datasets. Conv-LSTM units are shown to be an effective tool for modeling and predicting video sequences.
△ Less
Submitted 15 December, 2016; v1 submitted 1 December, 2016;
originally announced December 2016.