-
DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions
Authors:
Sanket Kalwar,
Mihir Ungarala,
Shruti Jain,
Aaron Monis,
Krishna Reddy Konda,
Sourav Garg,
K Madhava Krishna
Abstract:
Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in founda…
▽ More
Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in foundation models. Our proposed $\nabla$HFC image processing block excels particularly in adverse weather conditions, where conventional methods often fall short. Furthermore, we investigate the advantages of jointly training visual and latent prompts, demonstrating that this combined approach significantly enhances performance in out-of-distribution scenarios. Our differentiable visual prompts leverage parallel and series architectures to generate prompts, effectively improving object segmentation tasks in adverse conditions. Through a comprehensive series of experiments and evaluations, we provide empirical evidence to support the efficacy of our approach. Project page at https://diffprompter.github.io.
△ Less
Submitted 26 March, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
GDIP: Gated Differentiable Image Processing for Object-Detection in Adverse Conditions
Authors:
Sanket Kalwar,
Dhruv Patel,
Aakash Aanegola,
Krishna Reddy Konda,
Sourav Garg,
K Madhava Krishna
Abstract:
Detecting objects under adverse weather and lighting conditions is crucial for the safe and continuous operation of an autonomous vehicle, and remains an unsolved problem. We present a Gated Differentiable Image Processing (GDIP) block, a domain-agnostic network architecture, which can be plugged into existing object detection networks (e.g., Yolo) and trained end-to-end with adverse condition ima…
▽ More
Detecting objects under adverse weather and lighting conditions is crucial for the safe and continuous operation of an autonomous vehicle, and remains an unsolved problem. We present a Gated Differentiable Image Processing (GDIP) block, a domain-agnostic network architecture, which can be plugged into existing object detection networks (e.g., Yolo) and trained end-to-end with adverse condition images such as those captured under fog and low lighting. Our proposed GDIP block learns to enhance images directly through the downstream object detection loss. This is achieved by learning parameters of multiple image pre-processing (IP) techniques that operate concurrently, with their outputs combined using weights learned through a novel gating mechanism. We further improve GDIP through a multi-stage guidance procedure for progressive image enhancement. Finally, trading off accuracy for speed, we propose a variant of GDIP that can be used as a regularizer for training Yolo, which eliminates the need for GDIP-based image enhancement during inference, resulting in higher throughput and plausible real-world deployment. We demonstrate significant improvement in detection performance over several state-of-the-art methods through quantitative and qualitative studies on synthetic datasets such as PascalVOC, and real-world foggy (RTTS) and low-lighting (ExDark) datasets.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Contextual road lane and symbol generation for autonomous driving
Authors:
Ajay Soni,
Pratik Padamwar,
Krishna Reddy Konda
Abstract:
In this paper we present a novel approach for lane detection and segmentation using generative models. Traditionally discriminative models have been employed to classify pixels semantically on a road. We model the probability distribution of lanes and road symbols by training a generative adversarial network. Based on the learned probability distribution, context-aware lanes and road signs are gen…
▽ More
In this paper we present a novel approach for lane detection and segmentation using generative models. Traditionally discriminative models have been employed to classify pixels semantically on a road. We model the probability distribution of lanes and road symbols by training a generative adversarial network. Based on the learned probability distribution, context-aware lanes and road signs are generated for a given image which are further quantized for nearest class label. Proposed method has been tested on BDD100K and Baidu ApolloScape datasets and performs better than state of the art and exhibits robustness to adverse conditions by generating lanes in faded out and occluded scenarios.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Only sparsity based loss function for learning representations
Authors:
Vivek Bakaraju,
Kishore Reddy Konda
Abstract:
We study the emergence of sparse representations in neural networks. We show that in unsupervised models with regularization, the emergence of sparsity is the result of the input data samples being distributed along highly non-linear or discontinuous manifold. We also derive a similar argument for discriminatively trained networks and present experiments to support this hypothesis. Based on our st…
▽ More
We study the emergence of sparse representations in neural networks. We show that in unsupervised models with regularization, the emergence of sparsity is the result of the input data samples being distributed along highly non-linear or discontinuous manifold. We also derive a similar argument for discriminatively trained networks and present experiments to support this hypothesis. Based on our study of sparsity, we introduce a new loss function which can be used as regularization term for models like autoencoders and MLPs. Further, the same loss function can also be used as a cost function for an unsupervised single-layered neural network model for learning efficient representations.
△ Less
Submitted 7 March, 2019;
originally announced March 2019.
-
Building effective deep neural network architectures one feature at a time
Authors:
Martin Mundt,
Tobias Weis,
Kishore Konda,
Visvanathan Ramesh
Abstract:
Successful training of convolutional neural networks is often associated with sufficiently deep architectures composed of high amounts of features. These networks typically rely on a variety of regularization and pruning techniques to converge to less redundant states. We introduce a novel bottom-up approach to expand representations in fixed-depth architectures. These architectures start from jus…
▽ More
Successful training of convolutional neural networks is often associated with sufficiently deep architectures composed of high amounts of features. These networks typically rely on a variety of regularization and pruning techniques to converge to less redundant states. We introduce a novel bottom-up approach to expand representations in fixed-depth architectures. These architectures start from just a single feature per layer and greedily increase width of individual layers to attain effective representational capacities needed for a specific task. While network growth can rely on a family of metrics, we propose a computationally efficient version based on feature time evolution and demonstrate its potency in determining feature importance and a networks' effective capacity. We demonstrate how automatically expanded architectures converge to similar topologies that benefit from lesser amount of parameters or improved accuracy and exhibit systematic correspondence in representational complexity with the specified task. In contrast to conventional design patterns with a typical monotonic increase in the amount of features with increased depth, we observe that CNNs perform better when there is more learnable parameters in intermediate, with falloffs to earlier and later layers.
△ Less
Submitted 19 October, 2017; v1 submitted 18 May, 2017;
originally announced May 2017.
-
How far can we go without convolution: Improving fully-connected networks
Authors:
Zhouhan Lin,
Roland Memisevic,
Kishore Konda
Abstract:
We propose ways to improve the performance of fully connected networks. We found that two approaches in particular have a strong effect on performance: linear bottleneck layers and unsupervised pre-training using autoencoders without hidden unit biases. We show how both approaches can be related to improving gradient flow and reducing sparsity in the network. We show that a fully connected network…
▽ More
We propose ways to improve the performance of fully connected networks. We found that two approaches in particular have a strong effect on performance: linear bottleneck layers and unsupervised pre-training using autoencoders without hidden unit biases. We show how both approaches can be related to improving gradient flow and reducing sparsity in the network. We show that a fully connected network can yield approximately 70% classification accuracy on the permutation-invariant CIFAR-10 task, which is much higher than the current state-of-the-art. By adding deformations to the training data, the fully connected network achieves 78% accuracy, which is just 10% short of a decent convolutional network.
△ Less
Submitted 9 November, 2015;
originally announced November 2015.
-
Dropout as data augmentation
Authors:
Xavier Bouthillier,
Kishore Konda,
Pascal Vincent,
Roland Memisevic
Abstract:
Dropout is typically interpreted as bagging a large number of models sharing parameters. We show that using dropout in a network can also be interpreted as a kind of data augmentation in the input space without domain knowledge. We present an approach to projecting the dropout noise within a network back into the input space, thereby generating augmented versions of the training data, and we show…
▽ More
Dropout is typically interpreted as bagging a large number of models sharing parameters. We show that using dropout in a network can also be interpreted as a kind of data augmentation in the input space without domain knowledge. We present an approach to projecting the dropout noise within a network back into the input space, thereby generating augmented versions of the training data, and we show that training a deterministic network on the augmented samples yields similar results. Finally, we propose a new dropout noise scheme based on our observations and show that it improves dropout results without adding significant computational cost.
△ Less
Submitted 7 January, 2016; v1 submitted 29 June, 2015;
originally announced June 2015.
-
EmoNets: Multimodal deep learning approaches for emotion recognition in video
Authors:
Samira Ebrahimi Kahou,
Xavier Bouthillier,
Pascal Lamblin,
Caglar Gulcehre,
Vincent Michalski,
Kishore Konda,
Sébastien Jean,
Pierre Froumenty,
Yann Dauphin,
Nicolas Boulanger-Lewandowski,
Raul Chandias Ferrari,
Mehdi Mirza,
David Warde-Farley,
Aaron Courville,
Pascal Vincent,
Roland Memisevic,
Christopher Pal,
Yoshua Bengio
Abstract:
The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple…
▽ More
The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset.
△ Less
Submitted 29 March, 2015; v1 submitted 5 March, 2015;
originally announced March 2015.
-
Zero-bias autoencoders and the benefits of co-adapting features
Authors:
Kishore Konda,
Roland Memisevic,
David Krueger
Abstract:
Regularized training of an autoencoder typically results in hidden unit biases that take on large negative values. We show that negative biases are a natural result of using a hidden layer whose responsibility is to both represent the input data and act as a selection mechanism that ensures sparsity of the representation. We then show that negative biases impede the learning of data distributions…
▽ More
Regularized training of an autoencoder typically results in hidden unit biases that take on large negative values. We show that negative biases are a natural result of using a hidden layer whose responsibility is to both represent the input data and act as a selection mechanism that ensures sparsity of the representation. We then show that negative biases impede the learning of data distributions whose intrinsic dimensionality is high. We also propose a new activation function that decouples the two roles of the hidden layer and that allows us to learn representations on data with very high intrinsic dimensionality, where standard autoencoders typically fail. Since the decoupled activation function acts like an implicit regularizer, the model can be trained by minimizing the reconstruction error of training data, without requiring any additional regularization.
△ Less
Submitted 8 April, 2015; v1 submitted 13 February, 2014;
originally announced February 2014.
-
Modeling sequential data using higher-order relational features and predictive training
Authors:
Vincent Michalski,
Roland Memisevic,
Kishore Konda
Abstract:
Bi-linear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By minimizing reconstruction error of one frame, given the previous frame, these models learn "map** units" that encode the transformations inherent in a sequence, and thereby learn to encode motion. In this work we extend bi-linear models by introducing "higher…
▽ More
Bi-linear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By minimizing reconstruction error of one frame, given the previous frame, these models learn "map** units" that encode the transformations inherent in a sequence, and thereby learn to encode motion. In this work we extend bi-linear models by introducing "higher-order map** units" that allow us to encode transformations between frames and transformations between transformations.
We show that this makes it possible to encode temporal structure that is more complex and longer-range than the structure captured within standard bi-linear models. We also show that a natural way to train the model is by replacing the commonly used reconstruction objective with a prediction objective which forces the model to correctly predict the evolution of the input multiple steps into the future. Learning can be achieved by back-propagating the multi-step prediction through time. We test the model on various temporal prediction tasks, and show that higher-order map**s and predictive training both yield a significant improvement over bi-linear models in terms of prediction accuracy.
△ Less
Submitted 10 February, 2014;
originally announced February 2014.
-
Unsupervised learning of depth and motion
Authors:
Kishore Konda,
Roland Memisevic
Abstract:
We present a model for the joint estimation of disparity and motion. The model is based on learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both. We show that learning depth and motion cues, as well as their combinations, from data is possible within a single type of architecture and a single type of learning algorithm, by us…
▽ More
We present a model for the joint estimation of disparity and motion. The model is based on learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both. We show that learning depth and motion cues, as well as their combinations, from data is possible within a single type of architecture and a single type of learning algorithm, by using biologically inspired "complex cell" like units, which encode correlations between the pixels across image pairs. Our experimental results show that the learning of depth and motion makes it possible to achieve state-of-the-art performance in 3-D activity analysis, and to outperform existing hand-engineered 3-D motion features by a very large margin.
△ Less
Submitted 16 December, 2013; v1 submitted 12 December, 2013;
originally announced December 2013.
-
Learning to encode motion using spatio-temporal synchrony
Authors:
Kishore Reddy Konda,
Roland Memisevic,
Vincent Michalski
Abstract:
We consider the task of learning to extract motion from videos. To this end, we show that the detection of spatial transformations can be viewed as the detection of synchrony between the image sequence and a sequence of features undergoing the motion we wish to detect. We show that learning about synchrony is possible using very fast, local learning rules, by introducing multiplicative "gating" in…
▽ More
We consider the task of learning to extract motion from videos. To this end, we show that the detection of spatial transformations can be viewed as the detection of synchrony between the image sequence and a sequence of features undergoing the motion we wish to detect. We show that learning about synchrony is possible using very fast, local learning rules, by introducing multiplicative "gating" interactions between hidden units across frames. This makes it possible to achieve competitive performance in a wide variety of motion estimation tasks, using a small fraction of the time required to learn features, and to outperform hand-crafted spatio-temporal features by a large margin. We also show how learning about synchrony can be viewed as performing greedy parameter estimation in the well-known motion energy model.
△ Less
Submitted 10 February, 2014; v1 submitted 13 June, 2013;
originally announced June 2013.