Search | arXiv e-print repository

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Authors: Louis Blankemeier, Joseph Paul Cohen, Ashwin Kumar, Dave Van Veen, Syed Jamal Safdar Gardezi, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Cesar Truyts, Christian Bluethgen, Malte Engmann Kjeldskov Jensen, Sophie Ostmeier, Maya Varma, Jeya Maria Jose Valanarasu, Zhongnan Fang, Zepeng Huo, Zaid Nabulsi, Diego Ardila, Wei-Hung Weng, Edson Amaro Junior, Neera Ahuja, Jason Fries, Nigam H. Shah, Andrew Johnston , et al. (6 additional authors not shown)

Abstract: Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision la… ▽ More Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 18 pages, 7 figures

arXiv:2406.01042 [pdf, other]

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Authors: Fang Li, Hao Zhang, Narendra Ahuja

Abstract: Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which l… ▽ More Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: GitHub Page: https://github.com/fangli333/SC-4DGS

arXiv:2405.18560 [pdf, other]

Potential Field Based Deep Metric Learning

Authors: Shubhang Bhatnagar, Narendra Ahuja

Abstract: Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potentia… ▽ More Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.14017 [pdf, other]

MagicPose4D: Crafting Articulated Models with Appearance and Motion Control

Authors: Hao Zhang, Di Chang, Fang Li, Mohammad Soleymani, Narendra Ahuja

Abstract: With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike… ▽ More With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike traditional methods, MagicPose4D accepts monocular videos as motion prompts, enabling precise and customizable motion generation. MagicPose4D comprises two key modules: i) Dual-Phase 4D Reconstruction Module} which operates in two phases. The first phase focuses on capturing the model's shape using accurate 2D supervision and less accurate but geometrically informative 3D pseudo-supervision without imposing skeleton constraints. The second phase refines the model using more accurate pseudo-3D supervision, obtained in the first phase and introduces kinematic chain-based skeleton constraints to ensure physical plausibility. Additionally, we propose a Global-local Chamfer loss that aligns the overall distribution of predicted mesh vertices with the supervision while maintaining part-level alignment without extra annotations. ii) Cross-category Motion Transfer Module} leverages the predictions from the 4D reconstruction module and uses a kinematic-chain-based skeleton to achieve cross-category motion transfer. It ensures smooth transitions between frames through dynamic rigidity, facilitating robust generalization without additional training. Through extensive experiments, we demonstrate that MagicPose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Project Page: https://boese0601.github.io/magicpose4d

arXiv:2405.12607 [pdf, other]

S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

Authors: Hao Zhang, Fang Li, Samyak Rawlekar, Narendra Ahuja

Abstract: Reconstructing dynamic articulated objects from a singular monocular video is challenging, requiring joint estimation of shape, motion, and camera parameters from limited views. Current methods typically demand extensive computational resources and training time, and require additional human annotations such as predefined parametric models, camera poses, and key points, limiting their generalizabi… ▽ More Reconstructing dynamic articulated objects from a singular monocular video is challenging, requiring joint estimation of shape, motion, and camera parameters from limited views. Current methods typically demand extensive computational resources and training time, and require additional human annotations such as predefined parametric models, camera poses, and key points, limiting their generalizability. We propose Synergistic Shape and Skeleton Optimization (S3O), a novel two-phase method that forgoes these prerequisites and efficiently learns parametric models including visible shapes and underlying skeletons. Conventional strategies typically learn all parameters simultaneously, leading to interdependencies where a single incorrect prediction can result in significant errors. In contrast, S3O adopts a phased approach: it first focuses on learning coarse parametric models, then progresses to motion learning and detail addition. This method substantially lowers computational complexity and enhances robustness in reconstruction from limited viewpoints, all without requiring additional annotations. To address the current inadequacies in 3D reconstruction from monocular video benchmarks, we collected the PlanetZoo dataset. Our experimental evaluations on standard benchmarks and the PlanetZoo dataset affirm that S3O provides more accurate 3D reconstruction, and plausible skeletons, and reduces the training time by approximately 60% compared to the state-of-the-art, thus advancing the state of the art in dynamic object reconstruction. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

arXiv:2404.16193 [pdf, other]

Improving Multi-label Recognition using Class Co-Occurrence Probabilities

Authors: Samyak Rawlekar, Shubhang Bhatnagar, Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja

Abstract: Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such c… ▽ More Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2403.14977 [pdf, other]

Piecewise-Linear Manifolds for Deep Metric Learning

Authors: Shubhang Bhatnagar, Narendra Ahuja

Abstract: Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear… ▽ More Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, hel** improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Accepted at CPAL 2024 (Oral)

arXiv:2401.08809 [pdf, other]

Learning Implicit Representation for Reconstructing Articulated Objects

Authors: Hao Zhang, Fang Li, Samyak Rawlekar, Narendra Ahuja

Abstract: 3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid materi… ▽ More 3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients, specifying the articulation of the local surface. (4) Time-Varying Transformations, which specify the skeletal motion and surface deformation parameters. We introduce an algorithm that uses physical constraints as regularization terms and iteratively estimates both implicit and explicit representations. Our method is category-agnostic, thus eliminating the need for category-specific skeletons, we show that our method outperforms state-of-the-art across standard video datasets. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR 2024. Code: https://github.com/haoz19/LIMR

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.05538 [pdf, other]

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Authors: Hao Zhang, Fang Li, Lu Qi, Ming-Hsuan Yang, Narendra Ahuja

Abstract: Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We… ▽ More Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. We propose soft assignment and mask split methodologies that enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks. △ Less

Submitted 8 February, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2310.16383 [pdf, other]

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

Authors: Hao Zhang, Fang Li, Narendra Ahuja

Abstract: In this paper, we address the challenge of decomposing Neural Radiance Fields (NeRF) into objects from an open vocabulary, a critical task for object manipulation in 3D reconstruction and view synthesis. Current techniques for NeRF decomposition involve a trade-off between the flexibility of processing open-vocabulary queries and the accuracy of 3D segmentation. We present, Open-vocabulary Embedde… ▽ More In this paper, we address the challenge of decomposing Neural Radiance Fields (NeRF) into objects from an open vocabulary, a critical task for object manipulation in 3D reconstruction and view synthesis. Current techniques for NeRF decomposition involve a trade-off between the flexibility of processing open-vocabulary queries and the accuracy of 3D segmentation. We present, Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage large-scale, off-the-shelf, segmentation models like the Segment Anything Model (SAM) and introduce an integrate-and-distill paradigm with hierarchical embeddings to achieve both the flexibility of open-vocabulary querying and 3D segmentation accuracy. Open-NeRF first utilizes large-scale foundation models to generate hierarchical 2D mask proposals from varying viewpoints. These proposals are then aligned via tracking approaches and integrated within the 3D space and subsequently distilled into the 3D field. This process ensures consistent recognition and granularity of objects from different viewpoints, even in challenging scenarios involving occlusion and indistinct features. Our experimental results show that the proposed Open-NeRF outperforms state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF decomposition, guided by open-vocabulary queries, enabling novel applications in robotics and vision-language interaction in open-world 3D scenes. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted by WACV 2024

arXiv:2309.07430 [pdf, other]

doi 10.1038/s41591-024-02855-5

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Authors: Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerova, Nidhi Rohatgi, Poonam Hosamani, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, John Pauly, Akshay S. Chaudhari

Abstract: Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs,… ▽ More Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP), their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Quantitative assessments with syntactic, semantic, and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with ten physicians evaluates summary completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care. △ Less

Submitted 11 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: 27 pages, 19 figures

Journal ref: Nature Medicine, 2024

arXiv:2308.04643 [pdf, other]

doi 10.1109/IROS55552.2023.10342147

Long-Distance Gesture Recognition using Dynamic Neural Networks

Authors: Shubhang Bhatnagar, Sharath Gopal, Narendra Ahuja, Liu Ren

Abstract: Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a dr… ▽ More Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a drone. Methods made for short-distance recognition are unable to perform well on long-distance recognition due to gestures occupying only a small portion of the input data. Their performance is especially worse in resource constrained settings where they are not able to effectively focus their limited compute on the gesturing subject. We propose a novel, accurate and efficient method for the recognition of gestures from longer distances. It uses a dynamic neural network to select features from gesture-containing spatial regions of the input sensor data for further processing. This helps the network focus on features important for gesture recognition while discarding background features early on, thus making it more compute efficient compared to other techniques. We demonstrate the performance of our method on the LD-ConGR long-distance dataset where it outperforms previous state-of-the-art methods on recognition accuracy and compute efficiency. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

Journal ref: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 1307-1312

arXiv:2305.17295 [pdf, other]

Rate-Distortion Theory in Coding for Machines and its Application

Authors: Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V. Srinivasa Somayaulu, Omesh Tickoo, Anderson de Andrade, Ivan V. Bajic

Abstract: Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of images and video. As a result, a growing need for efficient compression methods optimized for machine vision, rather than human vision, has emerged. To meet this growing demand, several methods have been developed for image and video coding for machines. Unfortunately, while there is a… ▽ More Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of images and video. As a result, a growing need for efficient compression methods optimized for machine vision, rather than human vision, has emerged. To meet this growing demand, several methods have been developed for image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilize this newfound understanding to improve several methods for learnable image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks such as classification, instance segmentation, and object detection. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.10718 [pdf, other]

Effects of Architectures on Continual Semantic Segmentation

Authors: Tobias Kalb, Niket Ahuja, **gxing Zhou, Jürgen Beyerer

Abstract: Research in the field of Continual Semantic Segmentation is mainly investigating novel learning algorithms to overcome catastrophic forgetting of neural networks. Most recent publications have focused on improving learning algorithms without distinguishing effects caused by the choice of neural architecture.Therefore, we study how the choice of neural network architecture affects catastrophic forg… ▽ More Research in the field of Continual Semantic Segmentation is mainly investigating novel learning algorithms to overcome catastrophic forgetting of neural networks. Most recent publications have focused on improving learning algorithms without distinguishing effects caused by the choice of neural architecture.Therefore, we study how the choice of neural network architecture affects catastrophic forgetting in class- and domain-incremental semantic segmentation. Specifically, we compare the well-researched CNNs to recently proposed Transformers and Hybrid architectures, as well as the impact of the choice of novel normalization layers and different decoder heads. We find that traditional CNNs like ResNet have high plasticity but low stability, while transformer architectures are much more stable. When the inductive biases of CNN architectures are combined with transformers in hybrid architectures, it leads to higher plasticity and stability. The stability of these models can be explained by their ability to learn general features that are robust against distribution shifts. Experiments with different normalization layers show that Continual Normalization achieves the best trade-off in terms of adaptability and stability of the model. In the class-incremental setting, the choice of the normalization layer has much less impact. Our experiments suggest that the right choice of architecture can significantly reduce forgetting even with naive fine-tuning and confirm that for real-world applications, the architecture is an important factor in designing a continual learning model. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: Currently under Review

arXiv:2211.12650 [pdf, other]

FRE: A Fast Method For Anomaly Detection And Segmentation

Authors: Ibrahima Ndiour, Nilesh Ahuja, Utku Genc, Omesh Tickoo

Abstract: This paper presents a fast and principled approach for solving the visual anomaly detection and segmentation problem. In this setup, we have access to only anomaly-free training data and want to detect and identify anomalies of an arbitrary nature on test data. We propose the application of linear statistical dimensionality reduction techniques on the intermediate features produced by a pretrained… ▽ More This paper presents a fast and principled approach for solving the visual anomaly detection and segmentation problem. In this setup, we have access to only anomaly-free training data and want to detect and identify anomalies of an arbitrary nature on test data. We propose the application of linear statistical dimensionality reduction techniques on the intermediate features produced by a pretrained DNN on the training data, in order to capture the low-dimensional subspace truly spanned by said features. We show that the \emph{feature reconstruction error} (FRE), which is the $\ell_2$-norm of the difference between the original feature in the high-dimensional space and the pre-image of its low-dimensional reduced embedding, is extremely effective for anomaly detection. Further, using the same feature reconstruction error concept on intermediate convolutional layers, we derive FRE maps that provide pixel-level spatial localization of the anomalies in the image (i.e. segmentation). Experiments using standard anomaly detection datasets and DNN architectures demonstrate that our method matches or exceeds best-in-class quality performance, but at a fraction of the computational and memory cost required by the state of the art. It can be trained and run very efficiently, even on a traditional CPU. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2203.10422

arXiv:2210.16472 [pdf, other]

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian

Abstract: There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (… ▽ More There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its separated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recursively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2210.06540 [pdf]

Blockchain for Unmanned Underwater Drones: Research Issues, Challenges, Trends and Future Directions

Authors: Neelu Jyoti Ahuja, Adarsh Kumar, Monika Thapliyal, Sarthika Dutt, Tanesh Kumar, Diego Augusto De Jesus Pacheco, Charalambos Konstantinou, Kim-Kwang Raymond Choo

Abstract: Underwater drones have found a place in oceanography, oceanic research, bathymetric surveys, military, surveillance, monitoring, undersea exploration, mining, commercial diving, photography and several other activities. Drones housed with several sensors and complex propulsion systems help oceanographic scientists and undersea explorers to map the seabed, study waves, view dead zones, analyze fish… ▽ More Underwater drones have found a place in oceanography, oceanic research, bathymetric surveys, military, surveillance, monitoring, undersea exploration, mining, commercial diving, photography and several other activities. Drones housed with several sensors and complex propulsion systems help oceanographic scientists and undersea explorers to map the seabed, study waves, view dead zones, analyze fish counts, predict tidal wave behaviors, aid in finding shipwrecks, building windfarms, examine oil platforms located in deep seas and inspect nuclear reactors in the ship vessels. While drones can be explicitly programmed for specific missions, data security and privacy are crucial issues of serious concern. Blockchain has emerged as a key enabling technology, amongst other disruptive technological enablers, to address security, data sharing, storage, process tracking, collaboration and resource management. This study presents a comprehensive review on the utilization of Blockchain in different underwater applications, discussing use cases and detailing benefits. Potential challenges of underwater applications addressed by Blockchain have been detailed. This work identifies knowledge gaps between theoretical research and real-time Blockchain integration in realistic underwater drone applications. The key limitations for effective integration of Blockchain in real-time integration in UUD applications, along with directions for future research have been presented. △ Less

Submitted 12 October, 2022; originally announced October 2022.

arXiv:2208.11596 [pdf, other]

A Low-Complexity Approach to Rate-Distortion Optimized Variable Bit-Rate Compression for Split DNN Computing

Authors: Parual Datta, Nilesh Ahuja, V. Srinivasa Somayazulu, Omesh Tickoo

Abstract: Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads, wherein a DNN model is split into two parts, one of which is executed on a mobile/client device and the other on an edge-server (or cloud). Data compression is applied to the intermediate tensor from the DNN that needs to be transmitted, addressing the challenge of optimizing the rate-accuracy-complexity… ▽ More Split computing has emerged as a recent paradigm for implementation of DNN-based AI workloads, wherein a DNN model is split into two parts, one of which is executed on a mobile/client device and the other on an edge-server (or cloud). Data compression is applied to the intermediate tensor from the DNN that needs to be transmitted, addressing the challenge of optimizing the rate-accuracy-complexity trade-off. Existing split-computing approaches adopt ML-based data compression, but require that the parameters of either the entire DNN model, or a significant portion of it, be retrained for different compression levels. This incurs a high computational and storage burden: training a full DNN model from scratch is computationally demanding, maintaining multiple copies of the DNN parameters increases storage requirements, and switching the full set of weights during inference increases memory bandwidth. In this paper, we present an approach that addresses all these challenges. It involves the systematic design and training of bottleneck units - simple, low-cost neural networks - that can be inserted at the point of split. Our approach is remarkably lightweight, both during training and inference, highly effective and achieves excellent rate-distortion performance at a small fraction of the compute and storage overhead compared to existing methods. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Comments: ICPR 2022

arXiv:2206.07988 [pdf, other]

PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality

Authors: Prashant Kodali, Tanmay Sachan, Akshay Goindani, Anmol Goel, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru

Abstract: Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine generated code-mixed text is an open problem. In our submission to HinglishEval, a shared-task coll… ▽ More Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine generated code-mixed text is an open problem. In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2203.10422 [pdf, other]

Subspace Modeling for Fast Out-Of-Distribution and Anomaly Detection

Authors: Ibrahima J. Ndiour, Nilesh A. Ahuja, Omesh Tickoo

Abstract: This paper presents a fast, principled approach for detecting anomalous and out-of-distribution (OOD) samples in deep neural networks (DNN). We propose the application of linear statistical dimensionality reduction techniques on the semantic features produced by a DNN, in order to capture the low-dimensional subspace truly spanned by said features. We show that the "feature reconstruction error" (… ▽ More This paper presents a fast, principled approach for detecting anomalous and out-of-distribution (OOD) samples in deep neural networks (DNN). We propose the application of linear statistical dimensionality reduction techniques on the semantic features produced by a DNN, in order to capture the low-dimensional subspace truly spanned by said features. We show that the "feature reconstruction error" (FRE), which is the $\ell_2$-norm of the difference between the original feature in the high-dimensional space and the pre-image of its low-dimensional reduced embedding, is highly effective for OOD and anomaly detection. To generalize to intermediate features produced at any given layer, we extend the methodology by applying nonlinear kernel-based methods. Experiments using standard image datasets and DNN architectures demonstrate that our method meets or exceeds best-in-class quality performance, but at a fraction of the computational and memory cost required by the state of the art. It can be trained and run very efficiently, even on a traditional CPU. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: arXiv admin note: text overlap with arXiv:2012.04250

arXiv:2202.08341 [pdf, other]

Anomalib: A Deep Learning Library for Anomaly Detection

Authors: Samet Akcay, Dick Ameln, Ashwin Vaidya, Barath Lakshmanan, Nilesh Ahuja, Utku Genc

Abstract: This paper introduces anomalib, a novel library for unsupervised anomaly detection and localization. With reproducibility and modularity in mind, this open-source library provides algorithms from the literature and a set of tools to design custom anomaly detection algorithms via a plug-and-play approach. Anomalib comprises state-of-the-art anomaly detection algorithms that achieve top performance… ▽ More This paper introduces anomalib, a novel library for unsupervised anomaly detection and localization. With reproducibility and modularity in mind, this open-source library provides algorithms from the literature and a set of tools to design custom anomaly detection algorithms via a plug-and-play approach. Anomalib comprises state-of-the-art anomaly detection algorithms that achieve top performance on the benchmarks and that can be used off-the-shelf. In addition, the library provides components to design custom algorithms that could be tailored towards specific needs. Additional tools, including experiment trackers, visualizers, and hyper-parameter optimizers, make it simple to design and implement anomaly detection models. The library also supports OpenVINO model optimization and quantization for real-time deployment. Overall, anomalib is an extensive library for the design, implementation, and deployment of unsupervised anomaly detection models from data to the edge. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2201.06741 [pdf]

HashSet -- A Dataset For Hashtag Segmentation

Authors: Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru

Abstract: Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways -- transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used… ▽ More Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways -- transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task -- STAN, BOUN -- are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models. △ Less

Submitted 17 January, 2022; originally announced January 2022.

arXiv:2111.10971 [pdf, other]

Tracking Grow-Finish Pigs Across Large Pens Using Multiple Cameras

Authors: Aniket Shirke, Aziz Saifuddin, Achleshwar Luthra, Jiangong Li, Tawni Williams, Xiaodan Hu, Aneesh Kotnana, Okan Kocabalkanli, Narendra Ahuja, Angela Green-Miller, Isabella Condotta, Ryan N. Dilger, Matthew Caesar

Abstract: Increasing demand for meat products combined with farm labor shortages has resulted in a need to develop new real-time solutions to monitor animals effectively. Significant progress has been made in continuously locating individual pigs using tracking-by-detection methods. However, these methods fail for oblong pens because a single fixed camera does not cover the entire floor at adequate resoluti… ▽ More Increasing demand for meat products combined with farm labor shortages has resulted in a need to develop new real-time solutions to monitor animals effectively. Significant progress has been made in continuously locating individual pigs using tracking-by-detection methods. However, these methods fail for oblong pens because a single fixed camera does not cover the entire floor at adequate resolution. We address this problem by using multiple cameras, placed such that the visual fields of adjacent cameras overlap, and together they span the entire floor. Avoiding breaks in tracking requires inter-camera handover when a pig crosses from one camera's view into that of an adjacent camera. We identify the adjacent camera and the shared pig location on the floor at the handover time using inter-view homography. Our experiments involve two grow-finish pens, housing 16-17 pigs each, and three RGB cameras. Our algorithm first detects pigs using a deep learning-based object detection model (YOLO) and creates their local tracking IDs using a multi-object tracking algorithm (DeepSORT). We then use inter-camera shared locations to match multiple views and generate a global ID for each pig that holds throughout tracking. To evaluate our approach, we provide five two-minutes long video sequences with fully annotated global identities. We track pigs in a single camera view with a Multi-Object Tracking Accuracy and Precision of 65.0% and 54.3% respectively and achieve a Camera Handover Accuracy of 74.0%. We open-source our code and annotated dataset at https://github.com/AIFARMS/multi-camera-pig-tracking △ Less

Submitted 21 November, 2021; originally announced November 2021.

Comments: 6 pages, 4 figures, Accepted at the CVPR 2021 CV4Animals workshop

arXiv:2110.03446 [pdf, other]

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian

Abstract: Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) be… ▽ More Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high. Towards this end, we introduce Neural Uncertainty Quantifier (NUQ) - a stochastic quantification of the model's predictive uncertainty, and use it to weigh the MSE loss. We propose a hierarchical, variational framework to derive NUQ in a principled manner using a deep, Bayesian graphical model. Our experiments on four benchmark stochastic video prediction datasets show that our proposed framework trains more effectively compared to the state-of-the-art models (especially when the training sets are small), while demonstrating better video generation quality and diversity against several evaluation metrics. △ Less

Submitted 5 October, 2021; originally announced October 2021.

Comments: Accepted at ICCV 2021 (Oral)

arXiv:2109.11955 [pdf, other]

Visual Scene Graphs for Audio Source Separation

Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian

Abstract: State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinc… ▽ More State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: Accepted at ICCV 2021

arXiv:2109.09166 [pdf, other]

Unsupervised 3D Pose Estimation for Hierarchical Dance Video Recognition

Authors: Xiaodan Hu, Narendra Ahuja

Abstract: Dance experts often view dance as a hierarchy of information, spanning low-level (raw images, image sequences), mid-levels (human poses and bodypart movements), and high-level (dance genre). We propose a Hierarchical Dance Video Recognition framework (HDVR). HDVR estimates 2D pose sequences, tracks dancers, and then simultaneously estimates corresponding 3D poses and 3D-to-2D imaging parameters, w… ▽ More Dance experts often view dance as a hierarchy of information, spanning low-level (raw images, image sequences), mid-levels (human poses and bodypart movements), and high-level (dance genre). We propose a Hierarchical Dance Video Recognition framework (HDVR). HDVR estimates 2D pose sequences, tracks dancers, and then simultaneously estimates corresponding 3D poses and 3D-to-2D imaging parameters, without requiring ground truth for 3D poses. Unlike most methods that work on a single person, our tracking works on multiple dancers, under occlusions. From the estimated 3D pose sequence, HDVR extracts body part movements, and therefrom dance genre. The resulting hierarchical dance representation is explainable to experts. To overcome noise and interframe correspondence ambiguities, we enforce spatial and temporal motion smoothness and photometric continuity over time. We use an LSTM network to extract 3D movement subsequences from which we recognize the dance genre. For experiments, we have identified 154 movement types, of 16 body parts, and assembled a new University of Illinois Dance (UID) Dataset, containing 1143 video clips of 9 genres covering 30 hours, annotated with movement and genre labels. Our experimental results demonstrate that our algorithms outperform the state-of-the-art 3D pose estimation methods, which also enhances our dance recognition performance. △ Less

Submitted 19 September, 2021; originally announced September 2021.

Comments: To appear in ICCV2021

arXiv:2109.06873 [pdf, other]

Robust Contrastive Active Learning with Feature-guided Query Strategies

Authors: Ranganath Krishnan, Nilesh Ahuja, Alok Sinha, Mahesh Subedar, Omesh Tickoo, Ravi Iyer

Abstract: We introduce supervised contrastive active learning (SCAL) and propose efficient query strategies in active learning based on the feature similarity (featuresim) and principal component analysis based feature-reconstruction error (fre) to select informative data samples with diverse feature representations. We demonstrate our proposed method achieves state-of-the-art accuracy, model calibration an… ▽ More We introduce supervised contrastive active learning (SCAL) and propose efficient query strategies in active learning based on the feature similarity (featuresim) and principal component analysis based feature-reconstruction error (fre) to select informative data samples with diverse feature representations. We demonstrate our proposed method achieves state-of-the-art accuracy, model calibration and reduces sampling bias in an active learning setup for balanced and imbalanced datasets on image classification tasks. We also evaluate robustness of model to distributional shift derived from different query strategies in active learning setting. Using extensive experiments, we show that our proposed approach outperforms high performing compute-intensive methods by a big margin resulting in 9.9% lower mean corruption error, 7.2% lower expected calibration error under dataset shift and 8.9% higher AUROC for out-of-distribution detection. △ Less

Submitted 14 August, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: 20 pages with appendix. arXiv admin note: text overlap with arXiv:2109.06321

arXiv:2109.06321 [pdf, other]

Mitigating Sampling Bias and Improving Robustness in Active Learning

Authors: Ranganath Krishnan, Alok Sinha, Nilesh Ahuja, Mahesh Subedar, Omesh Tickoo, Ravi Iyer

Abstract: This paper presents simple and efficient methods to mitigate sampling bias in active learning while achieving state-of-the-art accuracy and model robustness. We introduce supervised contrastive active learning by leveraging the contrastive loss for active learning under a supervised setting. We propose an unbiased query strategy that selects informative data samples of diverse feature representati… ▽ More This paper presents simple and efficient methods to mitigate sampling bias in active learning while achieving state-of-the-art accuracy and model robustness. We introduce supervised contrastive active learning by leveraging the contrastive loss for active learning under a supervised setting. We propose an unbiased query strategy that selects informative data samples of diverse feature representations with our methods: supervised contrastive active learning (SCAL) and deep feature modeling (DFM). We empirically demonstrate our proposed methods reduce sampling bias, achieve state-of-the-art accuracy and model calibration in an active learning setup with the query computation 26x faster than Bayesian active learning by disagreement and 11x faster than CoreSet. The proposed SCAL method outperforms by a big margin in robustness to dataset shift and out-of-distribution. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: Human in the Loop Learning workshop at International Conference on Machine Learning (ICML 2021)

arXiv:2105.03270 [pdf, other]

Energy-Based Anomaly Detection and Localization

Authors: Ergin Utku Genc, Nilesh Ahuja, Ibrahima J Ndiour, Omesh Tickoo

Abstract: This brief sketches initial progress towards a unified energy-based solution for the semi-supervised visual anomaly detection and localization problem. In this setup, we have access to only anomaly-free training data and want to detect and identify anomalies of an arbitrary nature on test data. We employ the density estimates from the energy-based model (EBM) as normalcy scores that can be used to… ▽ More This brief sketches initial progress towards a unified energy-based solution for the semi-supervised visual anomaly detection and localization problem. In this setup, we have access to only anomaly-free training data and want to detect and identify anomalies of an arbitrary nature on test data. We employ the density estimates from the energy-based model (EBM) as normalcy scores that can be used to discriminate normal images from anomalous ones. Further, we back-propagate the gradients of the energy score with respect to the image in order to generate a gradient map that provides pixel-level spatial localization of the anomalies in the image. In addition to the spatial localization, we show that simple processing of the gradient map can also provide alternative normalcy scores that either match or surpass the detection performance obtained with the energy value. To quantitatively validate the performance of the proposed method, we conduct experiments on the MVTec industrial dataset. Though still preliminary, our results are very promising and reveal the potential of EBMs for simultaneously detecting and localizing unforeseen anomalies in images. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Comments: 9 pages, 3 figures, as submitted to EBM ICLR 2021 workshop

arXiv:2012.04250 [pdf, other]

Out-Of-Distribution Detection With Subspace Techniques And Probabilistic Modeling Of Features

Authors: Ibrahima Ndiour, Nilesh Ahuja, Omesh Tickoo

Abstract: This paper presents a principled approach for detecting out-of-distribution (OOD) samples in deep neural networks (DNN). Modeling probability distributions on deep features has recently emerged as an effective, yet computationally cheap method to detect OOD samples in DNN. However, the features produced by a DNN at any given layer do not fully occupy the corresponding high-dimensional feature spac… ▽ More This paper presents a principled approach for detecting out-of-distribution (OOD) samples in deep neural networks (DNN). Modeling probability distributions on deep features has recently emerged as an effective, yet computationally cheap method to detect OOD samples in DNN. However, the features produced by a DNN at any given layer do not fully occupy the corresponding high-dimensional feature space. We apply linear statistical dimensionality reduction techniques and nonlinear manifold-learning techniques on the high-dimensional features in order to capture the true subspace spanned by the features. We hypothesize that such lower-dimensional feature embeddings can mitigate the curse of dimensionality, and enhance any feature-based method for more efficient and effective performance. In the context of uncertainty estimation and OOD, we show that the log-likelihood score obtained from the distributions learnt on this lower-dimensional subspace is more discriminative for OOD detection. We also show that the feature reconstruction error, which is the $L_2$-norm of the difference between the original feature and the pre-image of its embedding, is highly effective for OOD detection and in some cases superior to the log-likelihood scores. The benefits of our approach are demonstrated on image features by detecting OOD images, using popular DNN architectures on commonly used image datasets such as CIFAR10, CIFAR100, and SVHN. △ Less

Submitted 8 December, 2020; originally announced December 2020.

arXiv:2007.12130 [pdf, other]

Sound2Sight: Generating Visual Dynamics from Sound and Context

Authors: Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja

Abstract: Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on… ▽ More Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures that are consistent with the provided audio and the past context. Moreover, to improve the quality and coherence of the generated frames, we propose a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. We empirically evaluate our approach, vis-á-vis closely-related prior methods, on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums dataset. Our extensive experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality, while also producing diverse video content. △ Less

Submitted 23 July, 2020; originally announced July 2020.

Comments: Accepted at ECCV 2020

arXiv:1912.08434 [pdf, other]

Tree pyramidal adaptive importance sampling

Authors: Javier Felip, Nilesh Ahuja, Omesh Tickoo

Abstract: This paper introduces Tree-Pyramidal Adaptive Importance Sampling (TP-AIS), a novel iterated sampling method that outperforms state-of-the-art approaches like deterministic mixture population Monte Carlo (DM-PMC), mixture population Monte Carlo (M-PMC) and layered adaptive importance sampling (LAIS). TP-AIS iteratively builds a proposal distribution parameterized by a tree pyramid, where each tree… ▽ More This paper introduces Tree-Pyramidal Adaptive Importance Sampling (TP-AIS), a novel iterated sampling method that outperforms state-of-the-art approaches like deterministic mixture population Monte Carlo (DM-PMC), mixture population Monte Carlo (M-PMC) and layered adaptive importance sampling (LAIS). TP-AIS iteratively builds a proposal distribution parameterized by a tree pyramid, where each tree leaf spans a subspace that represents its importance density. After each new sample operation, a set of tree leaves are subdivided for improving the approximation of the proposal distribution to the target density. Unlike the rest of the methods in the literature, TP-AIS is parameter free and requires no tuning to achieve its best performance. We evaluate TP-AIS with different complexity randomized target probability density functions (PDF) and also analyze its application to different dimensions. The results are compared to state-of-the-art iterative importance sampling approaches and other baseline MCMC approaches using Normalized Effective Sample Size (N-ESS), Jensen-Shannon Divergence, and time complexity. △ Less

Submitted 23 March, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

Comments: 20 pages + 13 pages of additional result plots and evaluation details

arXiv:1912.01206 [pdf, ps, other]

Deep Probabilistic Models to Detect Data Poisoning Attacks

Authors: Mahesh Subedar, Nilesh Ahuja, Ranganath Krishnan, Ibrahima J. Ndiour, Omesh Tickoo

Abstract: Data poisoning attacks compromise the integrity of machine-learning models by introducing malicious training samples to influence the results during test time. In this work, we investigate backdoor data poisoning attack on deep neural networks (DNNs) by inserting a backdoor pattern in the training images. The resulting attack will misclassify poisoned test samples while maintaining high accuracies… ▽ More Data poisoning attacks compromise the integrity of machine-learning models by introducing malicious training samples to influence the results during test time. In this work, we investigate backdoor data poisoning attack on deep neural networks (DNNs) by inserting a backdoor pattern in the training images. The resulting attack will misclassify poisoned test samples while maintaining high accuracies for the clean test-set. We present two approaches for detection of such poisoned samples by quantifying the uncertainty estimates associated with the trained models. In the first approach, we model the outputs of the various layers (deep features) with parametric probability distributions learnt from the clean held-out dataset. At inference, the likelihoods of deep features w.r.t these distributions are calculated to derive uncertainty estimates. In the second approach, we use Bayesian deep neural networks trained with mean-field variational inference to estimate model uncertainty associated with the predictions. The uncertainty estimates from these methods are used to discriminate clean from the poisoned samples. △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: To appear in Bayesian Deep Learning Workshop at NeurIPS 2019

arXiv:1909.11786 [pdf, other]

Probabilistic Modeling of Deep Features for Out-of-Distribution and Adversarial Detection

Authors: Nilesh A. Ahuja, Ibrahima Ndiour, Trushant Kalyanpur, Omesh Tickoo

Abstract: We present a principled approach for detecting out-of-distribution (OOD) and adversarial samples in deep neural networks. Our approach consists in modeling the outputs of the various layers (deep features) with parametric probability distributions once training is completed. At inference, the likelihoods of the deep features w.r.t the previously learnt distributions are calculated and used to deri… ▽ More We present a principled approach for detecting out-of-distribution (OOD) and adversarial samples in deep neural networks. Our approach consists in modeling the outputs of the various layers (deep features) with parametric probability distributions once training is completed. At inference, the likelihoods of the deep features w.r.t the previously learnt distributions are calculated and used to derive uncertainty estimates that can discriminate in-distribution samples from OOD samples. We explore the use of two classes of multivariate distributions for modeling the deep features - Gaussian and Gaussian mixture - and study the trade-off between accuracy and computational complexity. We demonstrate benefits of our approach on image features by detecting OOD images and adversarially-generated images, using popular DNN architectures on MNIST and CIFAR10 datasets. We show that more precise modeling of the feature distributions result in significantly improved detection of OOD and adversarial samples; up to 12 percentage points in AUPR and AUROC metrics. We further show that our approach remains extremely effective when applied to video data and associated spatio-temporal features by detecting adversarial samples on activity classification tasks using UCF101 dataset, and the C3D network. To our knowledge, our methodology is the first one reported for reliably detecting white-box adversarial framing, a state-of-the-art adversarial attack for video classifiers. △ Less

Submitted 25 September, 2019; originally announced September 2019.

arXiv:1905.13307 [pdf, other]

Real-time Approximate Bayesian Computation for Scene Understanding

Authors: Javier Felip, Nilesh Ahuja, David Gómez-Gutiérrez, Omesh Tickoo, Vikash Mansinghka

Abstract: Consider scene understanding problems such as predicting where a person is probably reaching, or inferring the pose of 3D objects from depth images, or inferring the probable street crossings of pedestrians at a busy intersection. This paper shows how to solve these problems using Approximate Bayesian Computation. The underlying generative models are built from realistic simulation software, wrapp… ▽ More Consider scene understanding problems such as predicting where a person is probably reaching, or inferring the pose of 3D objects from depth images, or inferring the probable street crossings of pedestrians at a busy intersection. This paper shows how to solve these problems using Approximate Bayesian Computation. The underlying generative models are built from realistic simulation software, wrapped in a Bayesian error model for the gap between simulation outputs and real data. The simulators are drawn from off-the-shelf computer graphics, video game, and traffic simulation code. The paper introduces two techniques for speeding up inference that can be used separately or in combination. The first is to train neural surrogates of the simulators, using a simple form of domain randomization to make the surrogates more robust to the gap between the simulation and reality. The second is to adaptively discretize the latent variables using a Tree-pyramid approach adapted from computer graphics. This paper also shows performance and accuracy measurements on real-world problems, establishing that it is feasible to solve these problems in real-time. △ Less

Submitted 22 May, 2019; originally announced May 2019.

arXiv:1807.09810 [pdf, other]

Coreset-Based Neural Network Compression

Authors: Abhimanyu Dubey, Moitreya Chatterjee, Narendra Ahuja

Abstract: We propose a novel Convolutional Neural Network (CNN) compression algorithm based on coreset representations of filters. We exploit the redundancies extant in the space of CNN weights and neuronal activations (across samples) in order to obtain compression. Our method requires no retraining, is easy to implement, and obtains state-of-the-art compression performance across a wide variety of CNN arc… ▽ More We propose a novel Convolutional Neural Network (CNN) compression algorithm based on coreset representations of filters. We exploit the redundancies extant in the space of CNN weights and neuronal activations (across samples) in order to obtain compression. Our method requires no retraining, is easy to implement, and obtains state-of-the-art compression performance across a wide variety of CNN architectures. Coupled with quantization and Huffman coding, we create networks that provide AlexNet-like accuracy, with a memory footprint that is $832\times$ smaller than the original AlexNet, while also introducing significant reductions in inference time as well. Additionally these compressed networks when fine-tuned, successfully generalize to other domains as well. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: Camera-Ready version for ECCV 2018

arXiv:1804.00650 [pdf, other]

DeepMVS: Learning Multi-view Stereopsis

Authors: Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, Jia-Bin Huang

Abstract: We present DeepMVS, a deep convolutional neural network (ConvNet) for multi-view stereo reconstruction. Taking an arbitrary number of posed images as input, we first produce a set of plane-sweep volumes and use the proposed DeepMVS network to predict high-quality disparity maps. The key contributions that enable these results are (1) supervised pretraining on a photorealistic synthetic dataset, (2… ▽ More We present DeepMVS, a deep convolutional neural network (ConvNet) for multi-view stereo reconstruction. Taking an arbitrary number of posed images as input, we first produce a set of plane-sweep volumes and use the proposed DeepMVS network to predict high-quality disparity maps. The key contributions that enable these results are (1) supervised pretraining on a photorealistic synthetic dataset, (2) an effective method for aggregating information across a set of unordered images, and (3) integrating multi-layer feature activations from the pre-trained VGG-19 network. We validate the efficacy of DeepMVS using the ETH3D Benchmark. Our results show that DeepMVS compares favorably against state-of-the-art conventional MVS algorithms and other ConvNet based methods, particularly for near-textureless regions and thin structures. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Comments: CVPR 2018. Project page: https://phuang17.github.io/DeepMVS/ Code: https://github.com/phuang17/DeepMVS

arXiv:1710.04200 [pdf, other]

Joint Image Filtering with Deep Convolutional Networks

Authors: Yijun Li, Jia-Bin Huang, Narendra Ahuja, Ming-Hsuan Yang

Abstract: Joint image filters leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods either rely on various explicit filter constructions or hand-designed objective functions, thereby making it difficult to understand, improve, and accelerate these filters in a coherent fra… ▽ More Joint image filters leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods either rely on various explicit filter constructions or hand-designed objective functions, thereby making it difficult to understand, improve, and accelerate these filters in a coherent framework. In this paper, we propose a learning-based approach for constructing joint filters based on Convolutional Neural Networks. In contrast to existing methods that consider only the guidance image, the proposed algorithm can selectively transfer salient structures that are consistent with both guidance and target images. We show that the model trained on a certain type of data, e.g., RGB and depth images, generalizes well to other modalities, e.g., flash/non-Flash and RGB/NIR images. We validate the effectiveness of the proposed joint filter through extensive experimental evaluations with state-of-the-art methods. △ Less

Submitted 2 January, 2019; v1 submitted 11 October, 2017; originally announced October 2017.

Comments: Accepted by TPAMI

arXiv:1710.02139 [pdf, other]

Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Authors: Shun Zhang, Jia-Bin Huang, Jongwoo Lim, Yihong Gong, **jun Wang, Narendra Ahuja, Ming-Hsuan Yang

Abstract: Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations.… ▽ More Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we use the contextual constraints to generate a large number of training samples for a given video, and further adapt the pre-trained face CNN to specific videos using discovered training samples. Using these training samples, we optimize the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity via minimizing a triplet loss function. With the learned discriminative features, we apply the hierarchical clustering algorithm to link tracklets across multiple shots to generate trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques. △ Less

Submitted 5 October, 2017; originally announced October 2017.

Comments: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking/

arXiv:1710.01992 [pdf, other]

Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks

Authors: Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, Ming-Hsuan Yang

Abstract: Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate… ▽ More Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of run-time and image quality. △ Less

Submitted 9 August, 2018; v1 submitted 4 October, 2017; originally announced October 2017.

Comments: The code and datasets are available at http://vllab.ucmerced.edu/wlai24/LapSRN/

arXiv:1704.03915 [pdf, other]

Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution

Authors: Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, Ming-Hsuan Yang

Abstract: Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals,… ▽ More Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. Our method does not require the bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. We train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, our network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy. △ Less

Submitted 9 October, 2017; v1 submitted 12 April, 2017; originally announced April 2017.

Comments: This work is accepted in CVPR 2017. The code and datasets are available on http://vllab.ucmerced.edu/wlai24/LapSRN/

arXiv:1605.06325 [pdf, other]

Superpixel Hierarchy

Authors: Xing Wei, Qingxiong Yang, Yihong Gong, Ming-Hsuan Yang, Narendra Ahuja

Abstract: Superpixel segmentation is becoming ubiquitous in computer vision. In practice, an object can either be represented by a number of segments in finer levels of detail or included in a surrounding region at coarser levels of detail, and thus a superpixel segmentation hierarchy is useful for applications that require different levels of image segmentation detail depending on the particular image obje… ▽ More Superpixel segmentation is becoming ubiquitous in computer vision. In practice, an object can either be represented by a number of segments in finer levels of detail or included in a surrounding region at coarser levels of detail, and thus a superpixel segmentation hierarchy is useful for applications that require different levels of image segmentation detail depending on the particular image objects segmented. Unfortunately, there is no method that can generate all scales of superpixels accurately in real-time. As a result, a simple yet effective algorithm named Super Hierarchy (SH) is proposed in this paper. It is as accurate as the state-of-the-art but 1-2 orders of magnitude faster. The proposed method can be directly integrated with recent efficient edge detectors like the structured forest edges to significantly outperforms the state-of-the-art in terms of segmentation accuracy. Quantitative and qualitative evaluation on a number of computer vision applications was conducted, demonstrating that the proposed method is the top performer. △ Less

Submitted 20 May, 2016; originally announced May 2016.

arXiv:1504.07846 [pdf, other]

Incorporating Road Networks into Territory Design

Authors: Nitin Ahuja, Matthias Bender, Peter Sanders, Christian Schulz, Andreas Wagner

Abstract: Given a set of basic areas, the territory design problem asks to create a predefined number of territories, each containing at least one basic area, such that an objective function is optimized. Desired properties of territories often include a reasonable balance, compact form, contiguity and small average journey times which are usually encoded in the objective function or formulated as constrain… ▽ More Given a set of basic areas, the territory design problem asks to create a predefined number of territories, each containing at least one basic area, such that an objective function is optimized. Desired properties of territories often include a reasonable balance, compact form, contiguity and small average journey times which are usually encoded in the objective function or formulated as constraints. We address the territory design problem by develo** graph theoretic models that also consider the underlying road network. The derived graph models enable us to tackle the territory design problem by modifying graph partitioning algorithms and mixed integer programming formulations so that the objective of the planning problem is taken into account. We test and compare the algorithms on several real world instances. △ Less

Submitted 5 May, 2015; v1 submitted 29 April, 2015; originally announced April 2015.

arXiv:1409.1412 [pdf]

MEEP: Multihop Energy Efficient Protocol For Heterogeneous Wireless Sensor Network

Authors: Surender Kumar, Manish Prateek, N. J. Ahuja, Bharat Bhushan

Abstract: Energy conservation of sensor nodes for increasing the network life is the most crucial design goal while develo** efficient routing protocol for wireless sensor networks. Recent technological advances help in the development of wide variety of sensor nodes. Heterogeneity takes the advantage of different types of sensor nodes and improves the energy efficiency and network life. Generally sensors… ▽ More Energy conservation of sensor nodes for increasing the network life is the most crucial design goal while develo** efficient routing protocol for wireless sensor networks. Recent technological advances help in the development of wide variety of sensor nodes. Heterogeneity takes the advantage of different types of sensor nodes and improves the energy efficiency and network life. Generally sensors are deployed randomly and densely in a sensing region so short distance multihop communication reduces the long distance transmission in the sensor network. In this research paper MEEP (multihop energy efficient protocol for heterogeneous sensor network) is proposed. The proposed protocol combines the idea of clustering and multihop communication. Heterogeneity is created in the network by using some nodes of high energy. Low energy nodes use a residual energy based scheme to become a cluster head. High energy nodes act as the relay nodes for low energy cluster head when they are not performing the duty of a cluster head to save their energy further. Protocol also suggests a sleep state for nodes in the cluster formation process for saving energy and increasing the life of sensor network. Simulation result shows that the proposed scheme is better than other two level heterogeneous sensor network protocol like SEP in energy efficiency and network life. △ Less

Submitted 2 September, 2014; originally announced September 2014.

Comments: 9 Pages, 10 Figures, http://www.ijcst.org/Volume5/Issue3/2014. arXiv admin note: substantial text overlap with arXiv:1408.3110

arXiv:1408.3110 [pdf]

doi 10.5120/15383-4047

MEECDA: Multihop Energy Efficient Clustering and Data Aggregation Protocol for HWSN

Authors: Surender Kumar, Manish Prateek, N. J. Ahuja, Bharat Bhushan

Abstract: Wireless sensor network consists of large number of inexpensive tiny sensors which are connected with low power wireless communications. Most of the routing and data dissemination protocols of WSN assume a homogeneous network architecture, in which all sensors have the same capabilities in terms of battery power, communication, sensing, storage, and processing. However the continued advances in mi… ▽ More Wireless sensor network consists of large number of inexpensive tiny sensors which are connected with low power wireless communications. Most of the routing and data dissemination protocols of WSN assume a homogeneous network architecture, in which all sensors have the same capabilities in terms of battery power, communication, sensing, storage, and processing. However the continued advances in miniaturization of processors and low-power communications have enabled the development of a wide variety of nodes. When more than one type of node is integrated into a WSN, it is called heterogeneous. Multihop short distance communication is an important scheme to reduce the energy consumption in a sensor network because nodes are densely deployed in a WSN. In this paper M-EECDA (Multihop Energy Efficient Clustering & Data Aggregation Protocol for Heterogeneous WSN) is proposed and analyzed. The protocol combines the idea of multihop communications and clustering for achieving the best performance in terms of network life and energy consumption. M-EECDA introduces a sleep state and three tier architecture for some cluster heads to save energy of the network. M-EECDA consists of three types of sensor nodes: normal, advance and super. To become cluster head in a round normal nodes use residual energy based scheme. Advance and super nodes further act as relay node to reduce the transmission load of a normal node cluster head when they are not cluster heads in a round. △ Less

Submitted 13 August, 2014; originally announced August 2014.

Comments: 8 pages, 11 figures. available at http://ijcaonline.org/2014. arXiv admin note: substantial text overlap with arXiv:1408.2914

arXiv:1408.2914 [pdf]

doi 10.5120/15384-4072

DE-LEACH: Distance and Energy Aware LEACH

Authors: Surender Kumar, Manish Prateek, N. J. Ahuja, Bharat Bhushan

Abstract: Wireless sensor network consists of large number of tiny sensor nodes which are usually deployed in a harsh environment. Self configuration and infrastructure less are the two fundamental properties of sensor networks. Sensor nodes are highly energy constrained devices because they are battery operated devices and due to harsh environment deployment it is impossible to change or recharge their bat… ▽ More Wireless sensor network consists of large number of tiny sensor nodes which are usually deployed in a harsh environment. Self configuration and infrastructure less are the two fundamental properties of sensor networks. Sensor nodes are highly energy constrained devices because they are battery operated devices and due to harsh environment deployment it is impossible to change or recharge their battery. Energy conservation and prolonging the network life are two major challenges in a sensor network. Communication consumes the large portion of WSN energy. Several protocols have been proposed to realize power- efficient communication in a wireless sensor network. Cluster based routing protocols are best known for increasing energy efficiency, stability and network lifetime of WSNs. Low Energy Adaptive Clustering Hierarchy (LEACH) is an important protocol in this class. One of the disadvantages of LEACH is that it does not consider the nodes energy and distance for the election of cluster head. This paper proposes a new energy efficient clustering protocol DE-LEACH for homogeneous wireless sensor network which is an extension of LEACH. DE-LEACH elects cluster head on the basis of distance and residual energy of the nodes. Proposed protocol increases the network life, stability and throughput of sensor network and simulations result shows that DE-LEACH is better than LEACH. △ Less

Submitted 13 August, 2014; originally announced August 2014.

Comments: 7 pages, 5 figures. available online at http://ijcaonline.org/2014

arXiv:1109.2389 [pdf, other]

A Probabilistic Framework for Discriminative Dictionary Learning

Authors: Bernard Ghanem, Narendra Ahuja

Abstract: In this paper, we address the problem of discriminative dictionary learning (DDL), where sparse linear representation and classification are combined in a probabilistic framework. As such, a single discriminative dictionary and linear binary classifiers are learned jointly. By encoding sparse representation and discriminative classification models in a MAP setting, we propose a general optimizatio… ▽ More In this paper, we address the problem of discriminative dictionary learning (DDL), where sparse linear representation and classification are combined in a probabilistic framework. As such, a single discriminative dictionary and linear binary classifiers are learned jointly. By encoding sparse representation and discriminative classification models in a MAP setting, we propose a general optimization framework that allows for a data-driven tradeoff between faithful representation and accurate classification. As opposed to previous work, our learning methodology is capable of incorporating a diverse family of classification cost functions (including those used in popular boosting methods), while avoiding the need for involved optimization techniques. We show that DDL can be solved by a sequence of updates that make use of well-known and well-studied sparse coding and dictionary learning algorithms from the literature. To validate our DDL framework, we apply it to digit classification and face recognition and test it on standard benchmarks. △ Less

Submitted 12 September, 2011; originally announced September 2011.

Comments: 10 pages, 4 figures, conference, dictionary learning, sparse coding

arXiv:1109.2388 [pdf, other]

MIS-Boost: Multiple Instance Selection Boosting

Authors: Emre Akbas, Bernard Ghanem, Narendra Ahuja

Abstract: In this paper, we present a new multiple instance learning (MIL) method, called MIS-Boost, which learns discriminative instance prototypes by explicit instance selection in a boosting framework. Unlike previous instance selection based MIL methods, we do not restrict the prototypes to a discrete set of training instances but allow them to take arbitrary values in the instance feature space. We als… ▽ More In this paper, we present a new multiple instance learning (MIL) method, called MIS-Boost, which learns discriminative instance prototypes by explicit instance selection in a boosting framework. Unlike previous instance selection based MIL methods, we do not restrict the prototypes to a discrete set of training instances but allow them to take arbitrary values in the instance feature space. We also do not restrict the total number of prototypes and the number of selected-instances per bag; these quantities are completely data-driven. We show that MIS-Boost outperforms state-of-the-art MIL methods on a number of benchmark datasets. We also apply MIS-Boost to large-scale image classification, where we show that the automatically selected prototypes map to visually meaningful image regions. △ Less

Submitted 12 September, 2011; originally announced September 2011.

arXiv:1102.1292 [pdf, other]

Modeling Dynamic Swarms

Authors: Bernard Ghanem, Narendra Ahuja

Abstract: This paper proposes the problem of modeling video sequences of dynamic swarms (DS). We define DS as a large layout of stochastically repetitive spatial configurations of dynamic objects (swarm elements) whose motions exhibit local spatiotemporal interdependency and stationarity, i.e., the motions are similar in any small spatiotemporal neighborhood. Examples of DS abound in nature, e.g., herds of… ▽ More This paper proposes the problem of modeling video sequences of dynamic swarms (DS). We define DS as a large layout of stochastically repetitive spatial configurations of dynamic objects (swarm elements) whose motions exhibit local spatiotemporal interdependency and stationarity, i.e., the motions are similar in any small spatiotemporal neighborhood. Examples of DS abound in nature, e.g., herds of animals and flocks of birds. To capture the local spatiotemporal properties of the DS, we present a probabilistic model that learns both the spatial layout of swarm elements and their joint dynamics that are modeled as linear transformations. To this end, a spatiotemporal neighborhood is associated with each swarm element, in which local stationarity is enforced both spatially and temporally. We assume that the prior on the swarm dynamics is distributed according to an MRF in both space and time. Embedding this model in a MAP framework, we iterate between learning the spatial layout of the swarm and its dynamics. We learn the swarm transformations using ICM, which iterates between estimating these transformations and updating their distribution in the spatiotemporal neighborhoods. We demonstrate the validity of our method by conducting experiments on real video sequences. Real sequences of birds, geese, robot swarms, and pedestrians evaluate the applicability of our model to real world data. △ Less

Submitted 7 February, 2011; originally announced February 2011.

Comments: 11 pages, 17 figures, conference paper, computer vision

Showing 1–50 of 50 results for author: Ahuja, N