Search | arXiv e-print repository

Unsupervised Restoration of Weather-affected Images using Deep Gaussian Process-based CycleGAN

Authors: Rajeev Yasarla, Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Existing approaches for restoring weather-degraded images follow a fully-supervised paradigm and they require paired data for training. However, collecting paired data for weather degradations is extremely challenging, and existing methods end up training on synthetic data. To overcome this issue, we describe an approach for supervising deep networks that are based on CycleGAN, thereby enabling th… ▽ More Existing approaches for restoring weather-degraded images follow a fully-supervised paradigm and they require paired data for training. However, collecting paired data for weather degradations is extremely challenging, and existing methods end up training on synthetic data. To overcome this issue, we describe an approach for supervising deep networks that are based on CycleGAN, thereby enabling the use of unlabeled real-world data for training. Specifically, we introduce new losses for training CycleGAN that lead to more effective training, resulting in high-quality reconstructions. These new losses are obtained by jointly modeling the latent space embeddings of predicted clean images and original clean images through Deep Gaussian Processes. This enables the CycleGAN architecture to transfer the knowledge from one domain (weather-degraded) to another (clean) more effectively. We demonstrate that the proposed method can be effectively applied to different restoration tasks like de-raining, de-hazing and de-snowing and it outperforms other unsupervised techniques (that leverage weather-based characteristics) by a considerable margin. △ Less

Submitted 22 April, 2022; originally announced April 2022.

Comments: Accepted at ICPR 2022

arXiv:2109.14651 [pdf, other]

Uncertainty-aware Mean Teacher for Source-free Unsupervised Domain Adaptive 3D Object Detection

Authors: Deepti Hegde, Vishwanath Sindagi, Velat Kilic, A. Brinton Cooper, Mark Foster, Vishal Patel

Abstract: Pseudo-label based self training approaches are a popular method for source-free unsupervised domain adaptation. However, their efficacy depends on the quality of the labels generated by the source trained model. These labels may be incorrect with high confidence, rendering thresholding methods ineffective. In order to avoid reinforcing errors caused by label noise, we propose an uncertainty-aware… ▽ More Pseudo-label based self training approaches are a popular method for source-free unsupervised domain adaptation. However, their efficacy depends on the quality of the labels generated by the source trained model. These labels may be incorrect with high confidence, rendering thresholding methods ineffective. In order to avoid reinforcing errors caused by label noise, we propose an uncertainty-aware mean teacher framework which implicitly filters incorrect pseudo-labels during training. Leveraging model uncertainty allows the mean teacher network to perform implicit filtering by down-weighing losses corresponding uncertain pseudo-labels. Effectively, we perform automatic soft-sampling of pseudo-labeled data while aligning predictions from the student and teacher networks. We demonstrate our method on several domain adaptation scenarios, from cross-dataset to cross-weather conditions, and achieve state-of-the-art performance in these cases, on the KITTI lidar target dataset. △ Less

Submitted 29 September, 2021; originally announced September 2021.

arXiv:2107.07004 [pdf, other]

Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection

Authors: Velat Kilic, Deepti Hegde, Vishwanath Sindagi, A. Brinton Cooper, Mark A. Foster, Vishal M. Patel

Abstract: Lidar-based object detectors are critical parts of the 3D perception pipeline in autonomous navigation systems such as self-driving cars. However, they are known to be sensitive to adverse weather conditions such as rain, snow and fog due to reduced signal-to-noise ratio (SNR) and signal-to-background ratio (SBR). As a result, lidar-based object detectors trained on data captured in normal weather… ▽ More Lidar-based object detectors are critical parts of the 3D perception pipeline in autonomous navigation systems such as self-driving cars. However, they are known to be sensitive to adverse weather conditions such as rain, snow and fog due to reduced signal-to-noise ratio (SNR) and signal-to-background ratio (SBR). As a result, lidar-based object detectors trained on data captured in normal weather tend to perform poorly in such scenarios. However, collecting and labelling sufficient training data in a diverse range of adverse weather conditions is laborious and prohibitively expensive. To address this issue, we propose a physics-based approach to simulate lidar point clouds of scenes in adverse weather conditions. These augmented datasets can then be used to train lidar-based detectors to improve their all-weather reliability. Specifically, we introduce a hybrid Monte-Carlo based approach that treats (i) the effects of large particles by placing them randomly and comparing their back reflected power against the target, and (ii) attenuation effects on average through calculation of scattering efficiencies from the Mie theory and particle size distributions. Retraining networks with this augmented data improves mean average precision evaluated on real world rainy scenes and we observe greater improvement in performance with our model relative to existing models from the literature. Furthermore, we evaluate recent state-of-the-art detectors on the simulated weather conditions and present an in-depth analysis of their performance. △ Less

Submitted 14 July, 2021; originally announced July 2021.

arXiv:2105.13502 [pdf, other]

Unsupervised Domain Adaptation of Object Detectors: A Survey

Authors: Poojan Oza, Vishwanath A. Sindagi, Vibashan VS, Vishal M. Patel

Abstract: Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. Due to this, model performance drops drastically when evaluated on label-scarce datasets having visually dist… ▽ More Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. Due to this, model performance drops drastically when evaluated on label-scarce datasets having visually distinct images, termed as domain adaptation problem. There is a plethora of works to adapt classification and segmentation models to label-scarce target datasets through unsupervised domain adaptation. Considering that detection is a fundamental task in computer vision, many recent works have focused on develo** novel domain adaptive detection techniques. Here, we describe in detail the domain adaptation problem for detection and present an extensive survey of the various methods. Furthermore, we highlight strategies proposed and the associated shortcomings. Subsequently, we identify multiple aspects of the problem that are most promising for future research. We believe that this survey shall be valuable to the pattern recognition experts working in the fields of computer vision, biometrics, medical imaging, and autonomous navigation by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. △ Less

Submitted 4 July, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

arXiv:2103.04224 [pdf, other]

MeGA-CDA: Memory Guided Attention for Category-Aware Unsupervised Domain Adaptive Object Detection

Authors: Vibashan VS, Vikram Gupta, Poojan Oza, Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Existing approaches for unsupervised domain adaptive object detection perform feature alignment via adversarial training. While these methods achieve reasonable improvements in performance, they typically perform category-agnostic domain alignment, thereby resulting in negative transfer of features. To overcome this issue, in this work, we attempt to incorporate category information into the domai… ▽ More Existing approaches for unsupervised domain adaptive object detection perform feature alignment via adversarial training. While these methods achieve reasonable improvements in performance, they typically perform category-agnostic domain alignment, thereby resulting in negative transfer of features. To overcome this issue, in this work, we attempt to incorporate category information into the domain adaptation process by proposing Memory Guided Attention for Category-Aware Domain Adaptation (MeGA-CDA). The proposed method consists of employing category-wise discriminators to ensure category-aware feature alignment for learning domain-invariant discriminative features. However, since the category information is not available for the target samples, we propose to generate memory-guided category-specific attention maps which are then used to route the features appropriately to the corresponding category discriminator. The proposed method is evaluated on several benchmark datasets and is shown to outperform existing approaches. △ Less

Submitted 3 April, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

Comments: Accepted to CVPR 2021

arXiv:2010.01663 [pdf, other]

KiU-Net: Overcomplete Convolutional Architectures for Biomedical Image and Volumetric Segmentation

Authors: Jeya Maria Jose Valanarasu, Vishwanath A. Sindagi, Ilker Hacihaliloglu, Vishal M. Patel

Abstract: Most methods for medical image segmentation use U-Net or its variants as they have been successful in most of the applications. After a detailed analysis of these "traditional" encoder-decoder based approaches, we observed that they perform poorly in detecting smaller structures and are unable to segment boundary regions precisely. This issue can be attributed to the increase in receptive field si… ▽ More Most methods for medical image segmentation use U-Net or its variants as they have been successful in most of the applications. After a detailed analysis of these "traditional" encoder-decoder based approaches, we observed that they perform poorly in detecting smaller structures and are unable to segment boundary regions precisely. This issue can be attributed to the increase in receptive field size as we go deeper into the encoder. The extra focus on learning high level features causes the U-Net based approaches to learn less information about low-level features which are crucial for detecting small structures. To overcome this issue, we propose using an overcomplete convolutional architecture where we project our input image into a higher dimension such that we constrain the receptive field from increasing in the deep layers of the network. We design a new architecture for image segmentation- KiU-Net which has two branches: (1) an overcomplete convolutional network Kite-Net which learns to capture fine details and accurate edges of the input, and (2) U-Net which learns high level features. Furthermore, we also propose KiU-Net 3D which is a 3D convolutional architecture for volumetric segmentation. We perform a detailed study of KiU-Net by performing experiments on five different datasets covering various image modalities like ultrasound (US), magnetic resonance imaging (MRI), computed tomography (CT), microscopic and fundus images. The proposed method achieves a better performance as compared to all the recent methods with an additional benefit of fewer parameters and faster convergence. Additionally, we also demonstrate that the extensions of KiU-Net based on residual blocks and dense blocks result in further performance improvements. The implementation of KiU-Net can be found here: https://github.com/jeya-maria-jose/KiU-Net-pytorch △ Less

Submitted 14 October, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

Comments: Journal Extension of KiU-Net (MICCAI-2020)

arXiv:2009.13075 [pdf, other]

Semi-Supervised Image Deraining using Gaussian Processes

Authors: Rajeev Yasarla, V. A. Sindagi, V. M. Patel

Abstract: Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data an… ▽ More Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data and hence, generalize poorly to real-world images. The use of real-world data in training image deraining networks is relatively less explored in the literature. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset while generalizing better using unlabeled real-world images. More specifically, we model the latent space vectors of unlabeled data using Gaussian Processes, which is then used to compute pseudo-ground-truth for supervising the network on unlabeled data. Through extensive experiments and ablations on several challenging datasets (such as Rain800, Rain200L and DDN-SIRR), we show that the proposed method is able to effectively leverage unlabeled data thereby resulting in significantly better performance as compared to labeled-only training. Additionally, we demonstrate that using unlabeled real-world images in the proposed GP-based framework results △ Less

Submitted 25 September, 2020; originally announced September 2020.

Comments: arXiv admin note: substantial text overlap with 2006.05580

arXiv:2009.06420 [pdf, other]

Completely Self-Supervised Crowd Counting via Distribution Matching

Authors: Deepak Babu Sam, Abhinav Agarwalla, Jimmy Joseph, Vishwanath A. Sindagi, R. Venkatesh Babu, Vishal M. Patel

Abstract: Dense crowd counting is a challenging task that demands millions of head annotations for training models. Though existing self-supervised approaches could learn good representations, they require some labeled data to map these features to the end task of density estimation. We mitigate this issue with the proposed paradigm of complete self-supervision, which does not need even a single labeled ima… ▽ More Dense crowd counting is a challenging task that demands millions of head annotations for training models. Though existing self-supervised approaches could learn good representations, they require some labeled data to map these features to the end task of density estimation. We mitigate this issue with the proposed paradigm of complete self-supervision, which does not need even a single labeled image. The only input required to train, apart from a large set of unlabeled crowd images, is the approximate upper limit of the crowd count for the given dataset. Our method dwells on the idea that natural crowds follow a power law distribution, which could be leveraged to yield error signals for backpropagation. A density regressor is first pretrained with self-supervision and then the distribution of predictions is matched to the prior by optimizing Sinkhorn distance between the two. Experiments show that this results in effective learning of crowd features and delivers significant counting performance. Furthermore, we establish the superiority of our method in less data setting as well. The code and models for our approach is available at https://github.com/val-iisc/css-ccnn. △ Less

Submitted 14 September, 2020; originally announced September 2020.

arXiv:2007.03195 [pdf, other]

Learning to Count in the Crowd from Limited Labeled Data

Authors: Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, Vishal M. Patel

Abstract: Recent crowd counting approaches have achieved excellent performance. However, they are essentially based on fully supervised paradigm and require large number of annotated samples. Obtaining annotations is an expensive and labour-intensive process. In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a… ▽ More Recent crowd counting approaches have achieved excellent performance. However, they are essentially based on fully supervised paradigm and require large number of annotated samples. Obtaining annotations is an expensive and labour-intensive process. In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a large pool of unlabeled data. Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data, which is then used as supervision for training the network. The proposed method is shown to be effective under the reduced data (semi-supervised) settings for several datasets like ShanghaiTech, UCF-QNRF, WorldExpo, UCSD, etc. Furthermore, we demonstrate that the proposed method can be leveraged to enable the network in learning to count from synthetic dataset while being able to generalize better to real-world datasets (synthetic-to-real transfer). △ Less

Submitted 8 July, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

Comments: Accepted at ECCV 2020

arXiv:2006.05580 [pdf, other]

Syn2Real Transfer Learning for Image Deraining using Gaussian Processes

Authors: Rajeev Yasarla, Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data an… ▽ More Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data and hence, generalize poorly to real-world images. The use of real-world data in training image deraining networks is relatively less explored in the literature. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset while generalizing better using unlabeled real-world images. Through extensive experiments and ablations on several challenging datasets (such as Rain800, Rain200H and DDN-SIRR), we show that the proposed method, when trained on limited labeled data, achieves on-par performance with fully-labeled training. Additionally, we demonstrate that using unlabeled real-world images in the proposed GP-based framework results in superior performance as compared to existing methods. Code is available at: https://github.com/rajeevyasarla/Syn2Real △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: Accepted at CVPR 2020

arXiv:2006.04878 [pdf, other]

KiU-Net: Towards Accurate Segmentation of Biomedical Images using Over-complete Representations

Authors: Jeya Maria Jose, Vishwanath Sindagi, Ilker Hacihaliloglu, Vishal M. Patel

Abstract: Due to its excellent performance, U-Net is the most widely used backbone architecture for biomedical image segmentation in the recent years. However, in our studies, we observe that there is a considerable performance drop in the case of detecting smaller anatomical landmarks with blurred noisy boundaries. We analyze this issue in detail, and address it by proposing an over-complete architecture (… ▽ More Due to its excellent performance, U-Net is the most widely used backbone architecture for biomedical image segmentation in the recent years. However, in our studies, we observe that there is a considerable performance drop in the case of detecting smaller anatomical landmarks with blurred noisy boundaries. We analyze this issue in detail, and address it by proposing an over-complete architecture (Ki-Net) which involves projecting the data onto higher dimensions (in the spatial sense). This network, when augmented with U-Net, results in significant improvements in the case of segmenting small anatomical landmarks and blurred noisy boundaries while obtaining better overall performance. Furthermore, the proposed network has additional benefits like faster convergence and fewer number of parameters. We evaluate the proposed method on the task of brain anatomy segmentation from 2D Ultrasound (US) of preterm neonates, and achieve an improvement of around 4% in terms of the DICE accuracy and Jaccard index as compared to the standard-U-Net, while outperforming the recent best methods by 2%. Code: https://github.com/jeya-maria-jose/KiU-Net-pytorch . △ Less

Submitted 8 July, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: Accepted at MICCAI 2020

arXiv:2004.03597 [pdf, other]

JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method

Authors: Vishwanath A. Sindagi, Rajeev Yasarla, Vishal M. Patel

Abstract: Due to its variety of applications in the real-world, the task of single image-based crowd counting has received a lot of interest in the recent years. Recently, several approaches have been proposed to address various problems encountered in crowd counting. These approaches are essentially based on convolutional neural networks that require large amounts of data to train the network parameters. C… ▽ More Due to its variety of applications in the real-world, the task of single image-based crowd counting has received a lot of interest in the recent years. Recently, several approaches have been proposed to address various problems encountered in crowd counting. These approaches are essentially based on convolutional neural networks that require large amounts of data to train the network parameters. Considering this, we introduce a new large scale unconstrained crowd counting dataset (JHU-CROWD++) that contains "4,372" images with "1.51 million" annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions. Specifically, the dataset includes several images with weather-based degradations and illumination variations, making it a very challenging dataset. Additionally, the dataset consists of a rich set of annotations at both image-level and head-level. Several recent methods are evaluated and compared on this dataset. The dataset can be downloaded from http://www.crowd-counting.com . Furthermore, we propose a novel crowd counting network that progressively generates crowd density maps via residual error estimation. The proposed method uses VGG16 as the backbone network and employs density map generated by the final layer as a coarse prediction to refine and generate finer density maps in a progressive fashion using residual learning. Additionally, the residual learning is guided by an uncertainty-based confidence weighting mechanism that permits the flow of only high-confidence residuals in the refinement path. The proposed Confidence Guided Deep Residual Counting Network (CG-DRCN) is evaluated on recent complex datasets, and it achieves significant improvements in errors. △ Less

Submitted 2 November, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: Accepted at T-PAMI 2020. The dataset can be downloaded from http://www.crowd-counting.com. arXiv admin note: substantial text overlap with arXiv:1910.12384

arXiv:1912.00070 [pdf, other]

Prior-based Domain Adaptive Object Detection for Hazy and Rainy Conditions

Authors: Vishwanath A. Sindagi, Poojan Oza, Rajeev Yasarla, Vishal M. Patel

Abstract: Adverse weather conditions such as haze and rain corrupt the quality of captured images, which cause detection networks trained on clean images to perform poorly on these images. To address this issue, we propose an unsupervised prior-based domain adversarial object detection framework for adapting the detectors to hazy and rainy conditions. In particular, we use weather-specific prior knowledge o… ▽ More Adverse weather conditions such as haze and rain corrupt the quality of captured images, which cause detection networks trained on clean images to perform poorly on these images. To address this issue, we propose an unsupervised prior-based domain adversarial object detection framework for adapting the detectors to hazy and rainy conditions. In particular, we use weather-specific prior knowledge obtained using the principles of image formation to define a novel prior-adversarial loss. The prior-adversarial loss used to train the adaptation process aims to reduce the weather-specific information in the features, thereby mitigating the effects of weather on the detection performance. Additionally, we introduce a set of residual feature recovery blocks in the object detection pipeline to de-distort the feature space, resulting in further improvements. Evaluations performed on various datasets (Foggy-Cityscapes, Rainy-Cityscapes, RTTS and UFDD) for rainy and hazy conditions demonstrates the effectiveness of the proposed approach. △ Less

Submitted 15 July, 2020; v1 submitted 29 November, 2019; originally announced December 2019.

Comments: Accepted at ECCV 2020

arXiv:1910.12384 [pdf, other]

Pushing the Frontiers of Unconstrained Crowd Counting: New Dataset and Benchmark Method

Authors: Vishwanath A. Sindagi, Rajeev Yasarla, Vishal M. Patel

Abstract: In this work, we propose a novel crowd counting network that progressively generates crowd density maps via residual error estimation. The proposed method uses VGG16 as the backbone network and employs density map generated by the final layer as a coarse prediction to refine and generate finer density maps in a progressive fashion using residual learning. Additionally, the residual learning is gui… ▽ More In this work, we propose a novel crowd counting network that progressively generates crowd density maps via residual error estimation. The proposed method uses VGG16 as the backbone network and employs density map generated by the final layer as a coarse prediction to refine and generate finer density maps in a progressive fashion using residual learning. Additionally, the residual learning is guided by an uncertainty-based confidence weighting mechanism that permits the flow of only high-confidence residuals in the refinement path. The proposed Confidence Guided Deep Residual Counting Network (CG-DRCN) is evaluated on recent complex datasets, and it achieves significant improvements in errors. Furthermore, we introduce a new large scale unconstrained crowd counting dataset (JHU-CROWD) that is ~2.8 larger than the most recent crowd counting datasets in terms of the number of images. It contains 4,250 images with 1.11 million annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions. Specifically, the dataset includes several images with weather-based degradations and illumination variations in addition to many distractor images, making it a very challenging dataset. Additionally, the dataset consists of rich annotations at both image-level and head-level. Several recent methods are evaluated and compared on this dataset. △ Less

Submitted 27 October, 2019; originally announced October 2019.

Comments: Accepted at ICCV 2019

arXiv:1908.10937 [pdf, other]

Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting

Authors: Vishwanath A Sindagi, Vishal M. Patel

Abstract: Crowd counting presents enormous challenges in the form of large variation in scales within images and across the dataset. These issues are further exacerbated in highly congested scenes. Approaches based on straightforward fusion of multi-scale features from a deep network seem to be obvious solutions to this problem. However, these fusion approaches do not yield significant improvements in the c… ▽ More Crowd counting presents enormous challenges in the form of large variation in scales within images and across the dataset. These issues are further exacerbated in highly congested scenes. Approaches based on straightforward fusion of multi-scale features from a deep network seem to be obvious solutions to this problem. However, these fusion approaches do not yield significant improvements in the case of crowd counting in congested scenes. This is usually due to their limited abilities in effectively combining the multi-scale features for problems like crowd counting. To overcome this, we focus on how to efficiently leverage information present in different layers of the network. Specifically, we present a network that involves: (i) a multi-level bottom-top and top-bottom fusion (MBTTBF) method to combine information from shallower to deeper layers and vice versa at multiple levels, (ii) scale complementary feature extraction blocks (SCFB) involving cross-scale residual functions to explicitly enable flow of complementary features from adjacent conv layers along the fusion paths. Furthermore, in order to increase the effectiveness of the multi-scale fusion, we employ a principled way of generating scale-aware ground-truth density maps for training. Experiments conducted on three datasets that contain highly congested scenes (ShanghaiTech, UCF_CC_50, and UCF-QNRF) demonstrate that the proposed method is able to outperform several recent methods in all the datasets. △ Less

Submitted 28 August, 2019; originally announced August 2019.

Comments: Accepted at ICCV 2019

arXiv:1907.10255 [pdf, other]

doi 10.1109/TIP.2019.2928634

HA-CCN: Hierarchical Attention-based Crowd Counting Network

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16… ▽ More Single image-based crowd counting has recently witnessed increased focus, but many leading methods are far from optimal, especially in highly congested scenes. In this paper, we present Hierarchical Attention-based Crowd Counting Network (HA-CCN) that employs attention mechanisms at various levels to selectively enhance the features of the network. The proposed method, which is based on the VGG16 network, consists of a spatial attention module (SAM) and a set of global attention modules (GAM). SAM enhances low-level features in the network by infusing spatial segmentation information, whereas the GAM focuses on enhancing channel-wise information in the higher level layers. The proposed method is a single-step training framework, simple to implement and achieves state-of-the-art results on different datasets. Furthermore, we extend the proposed counting network by introducing a novel set-up to adapt the network to different scenes and datasets via weak supervision using image-level labels. This new set up reduces the burden of acquiring labour intensive point-wise annotations for new datasets while improving the cross-dataset performance. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: Accepted for publication at IEEE Transactions on Image Processing (TIP) 2019

arXiv:1907.01193 [pdf, other]

Inverse Attention Guided Deep Crowd Counting Network

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: In this paper, we address the challenging problem of crowd counting in congested scenes. Specifically, we present Inverse Attention Guided Deep Crowd Counting Network (IA-DCCN) that efficiently infuses segmentation information through an inverse attention mechanism into the counting network, resulting in significant improvements. The proposed method, which is based on VGG-16, is a single-step trai… ▽ More In this paper, we address the challenging problem of crowd counting in congested scenes. Specifically, we present Inverse Attention Guided Deep Crowd Counting Network (IA-DCCN) that efficiently infuses segmentation information through an inverse attention mechanism into the counting network, resulting in significant improvements. The proposed method, which is based on VGG-16, is a single-step training framework and is simple to implement. The use of segmentation information results in minimal computational overhead and does not require any additional annotations. We demonstrate the significance of segmentation guided inverse attention through a detailed analysis and ablation study. Furthermore, the proposed method is evaluated on three challenging crowd counting datasets and is shown to achieve significant improvements over several recent methods. △ Less

Submitted 21 July, 2019; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: Accepted at 16th IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS) 2019

arXiv:1904.01649 [pdf, other]

MVX-Net: Multimodal VoxelNet for 3D Object Detection

Authors: Vishwanath A. Sindagi, Yin Zhou, Oncel Tuzel

Abstract: Many recent works on 3D object detection have focused on designing neural network architectures that can consume point cloud data. While these approaches demonstrate encouraging performance, they are typically based on a single modality and are unable to leverage information from other modalities, such as a camera. Although a few approaches fuse data from different modalities, these methods either… ▽ More Many recent works on 3D object detection have focused on designing neural network architectures that can consume point cloud data. While these approaches demonstrate encouraging performance, they are typically based on a single modality and are unable to leverage information from other modalities, such as a camera. Although a few approaches fuse data from different modalities, these methods either use a complicated pipeline to process the modalities sequentially, or perform late-fusion and are unable to learn interaction between different modalities at early stages. In this work, we present PointFusion and VoxelFusion: two simple yet effective early-fusion approaches to combine the RGB and point cloud modalities, by leveraging the recently introduced VoxelNet architecture. Evaluation on the KITTI dataset demonstrates significant improvements in performance over approaches which only use point cloud data. Furthermore, the proposed method provides results competitive with the state-of-the-art multimodal algorithms, achieving top-2 ranking in five of the six bird's eye view and 3D detection categories on the KITTI benchmark, by using a simple single stage network. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: 7 pages

Journal ref: International Conference on Robotics and Automation (ICRA), 2019

arXiv:1901.05375 [pdf, other]

DAFE-FD: Density Aware Feature Enrichment for Face Detection

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Recent research on face detection, which is focused primarily on improving accuracy of detecting smaller faces, attempt to develop new anchor design strategies to facilitate increased overlap between anchor boxes and ground truth faces of smaller sizes. In this work, we approach the problem of small face detection with the motivation of enriching the feature maps using a density map estimation mod… ▽ More Recent research on face detection, which is focused primarily on improving accuracy of detecting smaller faces, attempt to develop new anchor design strategies to facilitate increased overlap between anchor boxes and ground truth faces of smaller sizes. In this work, we approach the problem of small face detection with the motivation of enriching the feature maps using a density map estimation module. This module, inspired by recent crowd counting/density estimation techniques, performs the task of estimating the per pixel density of people/faces present in the image. Output of this module is employed to accentuate the feature maps from the backbone network using a feature enrichment module before being used for detecting smaller faces. The proposed approach can be used to complement recent anchor-design based novel methods to further improve their results. Experiments conducted on different datasets such as WIDER, FDDB and Pascal-Faces demonstrate the effectiveness of the proposed approach. △ Less

Submitted 16 January, 2019; originally announced January 2019.

arXiv:1804.10275 [pdf, other]

Pushing the Limits of Unconstrained Face Detection: a Challenge Dataset and Baseline Results

Authors: Hajime Nada, Vishwanath A. Sindagi, He Zhang, Vishal M. Patel

Abstract: Face detection has witnessed immense progress in the last few years, with new milestones being surpassed every year. While many challenges such as large variations in scale, pose, appearance are successfully addressed, there still exist several issues which are not specifically captured by existing methods or datasets. In this work, we identify the next set of challenges that requires attention fr… ▽ More Face detection has witnessed immense progress in the last few years, with new milestones being surpassed every year. While many challenges such as large variations in scale, pose, appearance are successfully addressed, there still exist several issues which are not specifically captured by existing methods or datasets. In this work, we identify the next set of challenges that requires attention from the research community and collect a new dataset of face images that involve these issues such as weather-based degradations, motion blur, focus blur and several others. We demonstrate that there is a considerable gap in the performance of state-of-the-art detectors and real-world requirements. Hence, in an attempt to fuel further research in unconstrained face detection, we present a new annotated Unconstrained Face Detection Dataset (UFDD) with several challenges and benchmark recent methods. Additionally, we provide an in-depth analysis of the results and failure cases of these methods. The dataset as well as baseline results will be made publicly available in due time. The UFDD dataset as well as baseline results are available at: www.ufdd.info/ △ Less

Submitted 8 August, 2018; v1 submitted 26 April, 2018; originally announced April 2018.

Comments: Accepted in BTAS'2018

arXiv:1710.10182 [pdf, other]

High-Quality Facial Photo-Sketch Synthesis Using Multi-Adversarial Networks

Authors: Lidan Wang, Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Synthesizing face sketches from real photos and its inverse have many applications. However, photo/sketch synthesis remains a challenging problem due to the fact that photo and sketch have different characteristics. In this work, we consider this task as an image-to-image translation problem and explore the recently popular generative models (GANs) to generate high-quality realistic photos from sk… ▽ More Synthesizing face sketches from real photos and its inverse have many applications. However, photo/sketch synthesis remains a challenging problem due to the fact that photo and sketch have different characteristics. In this work, we consider this task as an image-to-image translation problem and explore the recently popular generative models (GANs) to generate high-quality realistic photos from sketches and sketches from photos. Recent GAN-based methods have shown promising results on image-to-image translation problems and photo-to-sketch synthesis in particular, however, they are known to have limited abilities in generating high-resolution realistic images. To this end, we propose a novel synthesis framework called Photo-Sketch Synthesis using Multi-Adversarial Networks, (PS2-MAN) that iteratively generates low resolution to high resolution images in an adversarial way. The hidden layers of the generator are supervised to first generate lower resolution images followed by implicit refinement in the network to generate higher resolution images. Furthermore, since photo-sketch synthesis is a coupled/paired translation problem, we leverage the pair information using CycleGAN framework. Both Image Quality Assessment (IQA) and Photo-Sketch Matching experiments are conducted to demonstrate the superior performance of our framework in comparison to existing state-of-the-art solutions. Code available at: https://github.com/lidan1/PhotoSketchMAN. △ Less

Submitted 2 March, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

Comments: Accepted by 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)(Oral)

arXiv:1710.00962 [pdf, other]

doi 10.1109/ICPR.2018.8545081

GP-GAN: Gender Preserving GAN for Synthesizing Faces from Landmarks

Authors: Xing Di, Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Facial landmarks constitute the most compressed representation of faces and are known to preserve information such as pose, gender and facial structure present in the faces. Several works exist that attempt to perform high-level face-related analysis tasks based on landmarks. In contrast, in this work, an attempt is made to tackle the inverse problem of synthesizing faces from their respective lan… ▽ More Facial landmarks constitute the most compressed representation of faces and are known to preserve information such as pose, gender and facial structure present in the faces. Several works exist that attempt to perform high-level face-related analysis tasks based on landmarks. In contrast, in this work, an attempt is made to tackle the inverse problem of synthesizing faces from their respective landmarks. The primary aim of this work is to demonstrate that information preserved by landmarks (gender in particular) can be further accentuated by leveraging generative models to synthesize corresponding faces. Though the problem is particularly challenging due to its ill-posed nature, we believe that successful synthesis will enable several applications such as boosting performance of high-level face related tasks using landmark points and performing dataset augmentation. To this end, a novel face-synthesis method known as Gender Preserving Generative Adversarial Network (GP-GAN) that is guided by adversarial loss, perceptual loss and a gender preserving loss is presented. Further, we propose a novel generator sub-network UDeNet for GP-GAN that leverages advantages of U-Net and DenseNet architectures. Extensive experiments and comparison with recent methods are performed to verify the effectiveness of the proposed method. △ Less

Submitted 25 April, 2018; v1 submitted 2 October, 2017; originally announced October 2017.

Comments: 6 pages, 5 figures, this paper is accepted as 2018 24th International Conference on Pattern Recognition (ICPR2018)

arXiv:1708.00953 [pdf, other]

Generating High-Quality Crowd Density Maps using Contextual Pyramid CNNs

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: We present a novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images. The proposed CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN). GCE is a VGG-16 based CN… ▽ More We present a novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images. The proposed CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN). GCE is a VGG-16 based CNN that encodes global context and it is trained to classify input images into different density classes, whereas LCE is another CNN that encodes local context information and it is trained to perform patch-wise classification of input images into different density classes. DME is a multi-column architecture-based CNN that aims to generate high-dimensional feature maps from the input image which are fused with the contextual information estimated by GCE and LCE using F-CNN. To generate high resolution and high-quality density maps, F-CNN uses a set of convolutional and fractionally-strided convolutional layers and it is trained along with the DME in an end-to-end fashion using a combination of adversarial loss and pixel-level Euclidean loss. Extensive experiments on highly challenging datasets show that the proposed method achieves significant improvements over the state-of-the-art methods. △ Less

Submitted 2 August, 2017; originally announced August 2017.

Comments: Accepted at ICCV 2017

arXiv:1708.00581 [pdf, other]

Joint Transmission Map Estimation and Dehazing using Deep Networks

Authors: He Zhang, Vishwanath Sindagi, Vishal M. Patel

Abstract: Single image haze removal is an extremely challenging problem due to its inherent ill-posed nature. Several prior-based and learning-based methods have been proposed in the literature to solve this problem and they have achieved superior results. However, most of the existing methods assume constant atmospheric light model and tend to follow a two-step procedure involving prior-based methods for e… ▽ More Single image haze removal is an extremely challenging problem due to its inherent ill-posed nature. Several prior-based and learning-based methods have been proposed in the literature to solve this problem and they have achieved superior results. However, most of the existing methods assume constant atmospheric light model and tend to follow a two-step procedure involving prior-based methods for estimating transmission map followed by calculation of dehazed image using the closed form solution. In this paper, we relax the constant atmospheric light assumption and propose a novel unified single image dehazing network that jointly estimates the transmission map and performs dehazing. In other words, our new approach provides an end-to-end learning framework, where the inherent transmission map and dehazed result are learned directly from the loss function. Extensive experiments on synthetic and real datasets with challenging hazy images demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods. △ Less

Submitted 20 April, 2019; v1 submitted 1 August, 2017; originally announced August 2017.

Comments: This paper has been accepted in IEEE-TCSVT

arXiv:1707.09605 [pdf, other]

CNN-based Cascaded Multi-task Learning of High-level Prior and Density Estimation for Crowd Counting

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Estimating crowd count in densely crowded scenes is an extremely challenging task due to non-uniform scale variations. In this paper, we propose a novel end-to-end cascaded network of CNNs to jointly learn crowd count classification and density map estimation. Classifying crowd count into various groups is tantamount to coarsely estimating the total count in the image thereby incorporating a high-… ▽ More Estimating crowd count in densely crowded scenes is an extremely challenging task due to non-uniform scale variations. In this paper, we propose a novel end-to-end cascaded network of CNNs to jointly learn crowd count classification and density map estimation. Classifying crowd count into various groups is tantamount to coarsely estimating the total count in the image thereby incorporating a high-level prior into the density estimation network. This enables the layers in the network to learn globally relevant discriminative features which aid in estimating highly refined density maps with lower count error. The joint training is performed in an end-to-end fashion. Extensive experiments on highly challenging publicly available datasets show that the proposed method achieves lower count error and better quality density maps as compared to the recent state-of-the-art methods. △ Less

Submitted 16 August, 2017; v1 submitted 30 July, 2017; originally announced July 2017.

Comments: Accepted at AVSS 2017 (14th International Conference on Advanced Video and Signal Based Surveillance)

arXiv:1707.01202 [pdf, other]

doi 10.1016/j.patrec.2017.07.007

A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation

Authors: Vishwanath A. Sindagi, Vishal M. Patel

Abstract: Estimating count and density maps from crowd images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. In addition, techniques developed for crowd counting can be applied to related tasks in other fields of study such as cell microscopy, vehicle counting and environmental survey. The task of crowd counting and density map estimation i… ▽ More Estimating count and density maps from crowd images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. In addition, techniques developed for crowd counting can be applied to related tasks in other fields of study such as cell microscopy, vehicle counting and environmental survey. The task of crowd counting and density map estimation is riddled with many challenges such as occlusions, non-uniform density, intra-scene and inter-scene variations in scale and perspective. Nevertheless, over the last few years, crowd count analysis has evolved from earlier methods that are often limited to small variations in crowd density and scales to the current state-of-the-art methods that have developed the ability to perform successfully on a wide range of scenarios. The success of crowd counting methods in the recent years can be largely attributed to deep learning and publications of challenging datasets. In this paper, we provide a comprehensive survey of recent Convolutional Neural Network (CNN) based approaches that have demonstrated significant improvements over earlier methods that rely largely on hand-crafted representations. First, we briefly review the pioneering methods that use hand-crafted representations and then we delve in detail into the deep learning-based approaches and recently published datasets. Furthermore, we discuss the merits and drawbacks of existing CNN-based approaches and identify promising avenues of research in this rapidly evolving field. △ Less

Submitted 4 July, 2017; originally announced July 2017.

Comments: 16 pages, 17 figures

arXiv:1701.05957 [pdf, other]

Image De-raining Using a Conditional Generative Adversarial Network

Authors: He Zhang, Vishwanath Sindagi, Vishal M. Patel

Abstract: Severe weather conditions such as rain and snow adversely affect the visual quality of images captured under such conditions thus rendering them useless for further usage and sharing. In addition, such degraded images drastically affect performance of vision systems. Hence, it is important to solve the problem of single image de-raining/de-snowing. However, this is a difficult problem to solve due… ▽ More Severe weather conditions such as rain and snow adversely affect the visual quality of images captured under such conditions thus rendering them useless for further usage and sharing. In addition, such degraded images drastically affect performance of vision systems. Hence, it is important to solve the problem of single image de-raining/de-snowing. However, this is a difficult problem to solve due to its inherent ill-posed nature. Existing approaches attempt to introduce prior information to convert it into a well-posed problem. In this paper, we investigate a new point of view in addressing the single image de-raining problem. Instead of focusing only on deciding what is a good prior or a good framework to achieve good quantitative and qualitative performance, we also ensure that the de-rained image itself does not degrade the performance of a given computer vision algorithm such as detection and classification. In other words, the de-rained result should be indistinguishable from its corresponding clear image to a given discriminator. This criterion can be directly incorporated into the optimization framework by using the recently introduced conditional generative adversarial networks (GANs). To minimize artifacts introduced by GANs and ensure better visual quality, a new refined loss function is introduced. Based on this, we propose a novel single image de-raining method called Image De-raining Conditional General Adversarial Network (ID-CGAN), which considers quantitative, visual and also discriminative performance into the objective function. Experiments evaluated on synthetic images and real images show that the proposed method outperforms many recent state-of-the-art single image de-raining methods in terms of quantitative and visual performance. △ Less

Submitted 2 June, 2019; v1 submitted 20 January, 2017; originally announced January 2017.

Showing 1–27 of 27 results for author: Sindagi, V