Search | arXiv e-print repository

SlideSLAM: Sparse, Lightweight, Decentralized Metric-Semantic SLAM for Multi-Robot Navigation

Authors: Xu Liu, Jiuzhou Lei, Ankit Prabhu, Yuezhan Tao, Igor Spasojevic, Pratik Chaudhari, Nikolay Atanasov, Vijay Kumar

Abstract: This paper develops a real-time decentralized metric-semantic Simultaneous Localization and Map** (SLAM) approach that leverages a sparse and lightweight object-based representation to enable a heterogeneous robot team to autonomously explore 3D environments featuring indoor, urban, and forested areas without relying on GPS. We use a hierarchical metric-semantic representation of the environment… ▽ More This paper develops a real-time decentralized metric-semantic Simultaneous Localization and Map** (SLAM) approach that leverages a sparse and lightweight object-based representation to enable a heterogeneous robot team to autonomously explore 3D environments featuring indoor, urban, and forested areas without relying on GPS. We use a hierarchical metric-semantic representation of the environment, including high-level sparse semantic maps of object models and low-level voxel maps. We leverage the informativeness and viewpoint invariance of the high-level semantic map to obtain an effective semantics-driven place-recognition algorithm for inter-robot loop closure detection across aerial and ground robots with different sensing modalities. A communication module is designed to track each robot's observations and those of other robots within the communication range. Such observations are then used to construct a merged map. Our framework enables real-time decentralized operations onboard robots, allowing them to opportunistically leverage communication. We integrate and deploy our proposed framework on three types of aerial and ground robots. Extensive experimental results show an average localization error of 0.22 meters in position and -0.16 degrees in orientation, an object map** F1 score of 0.92, and a communication packet size of merely 2-3 megabytes per kilometer trajectory with 1,000 landmarks. The project website can be found at https://xurobotics.github.io/slideslam/. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Preliminary release

arXiv:2404.09447 [pdf, other]

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Authors: Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

Abstract: Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with z… ▽ More Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with zero forgetting, capable of adapting to continually growing vocabularies without the need for retraining or large memory costs. Our training-free approach, kNN-CLIP, leverages a database of instance embeddings to enable open-vocabulary segmentation approaches to continually expand their vocabulary on any given domain with a single-pass through data, while only storing embeddings minimizing both compute and memory costs. This method achieves state-of-the-art mIoU performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 10 pages, 3 figures

arXiv:2404.06405 [pdf, other]

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry

Authors: Shiven Sinha, Ameya Prabhu, Ponnurangam Kumaraguru, Siddharth Bhat, Matthias Bethge

Abstract: Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 2… ▽ More Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist. △ Less

Submitted 11 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: Work in Progress. Released for wider feedback

arXiv:2404.05764 [pdf]

Study of the effect of Sharpness on Blind Video Quality Assessment

Authors: Anantha Prabhu, David Pratap, Narayana Darapeni, Anwesh P R

Abstract: Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical met… ▽ More Introduction: Video Quality Assessment (VQA) is one of the important areas of study in this modern era, where video is a crucial component of communication with applications in every field. Rapid technology developments in mobile technology enabled anyone to create videos resulting in a varied range of video quality scenarios. Objectives: Though VQA was present for some time with the classical metrices like SSIM and PSNR, the advent of machine learning has brought in new techniques of VQAs which are built upon Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs). Methods: Over the past years various research studies such as the BVQA which performed video quality assessment of nature-based videos using DNNs exposed the powerful capabilities of machine learning algorithms. BVQA using DNNs explored human visual system effects such as content dependency and time-related factors normally known as temporal effects. Results: This study explores the sharpness effect on models like BVQA. Sharpness is the measure of the clarity and details of the video image. Sharpness typically involves analyzing the edges and contrast of the image to determine the overall level of detail and sharpness. Conclusion: This study uses the existing video quality databases such as CVD2014. A comparative study of the various machine learning parameters such as SRCC and PLCC during the training and testing are presented along with the conclusion. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.04125 [pdf, other]

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Authors: Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge

Abstract: Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream conce… ▽ More Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found. △ Less

Submitted 8 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: Extended version of the short paper accepted at DPFM, ICLR'24

arXiv:2402.19472 [pdf, other]

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for no… ▽ More Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: the high cost of evaluating a growing number of models across an ever-expanding sample set. To address this challenge, we also introduce an efficient evaluation framework: Sort \& Search (S&S), which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples, enabling cost-effective lifelong benchmarking. Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (1000x reduction) on a single A100 GPU, with low approximation error. As such, lifelong benchmarks offer a robust, practical solution to the "benchmark exhaustion" problem. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.14015 [pdf, other]

Corrective Machine Unlearning

Authors: Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal

Abstract: Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input do… ▽ More Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input domains. Often, all manipulated training samples are not known, and only a small, representative subset of the affected data is flagged. We formalize "Corrective Machine Unlearning" as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacy-oriented unlearning. We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, SSD, achieves limited success in unlearning adverse effects with just a small portion of the manipulated samples, showing the tractability of this setting. We hope our work spurs research towards develo** better methods for corrective unlearning and offers practitioners a new strategy to handle data integrity challenges arising from web-scale training. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 17 pages, 7 figures

arXiv:2402.08823 [pdf, other]

RanDumb: A Simple Approach that Questions the Efficacy of Continual Representation Learning

Authors: Ameya Prabhu, Shiven Sinha, Ponnurangam Kumaraguru, Philip H. S. Torr, Ozan Sener, Puneet K. Dokania

Abstract: We propose RanDumb to examine the efficacy of continual representation learning. RanDumb embeds raw pixels using a fixed random transform which approximates an RBF-Kernel, initialized before seeing any data, and learns a simple linear classifier on top. We present a surprising and consistent finding: RanDumb significantly outperforms the continually learned representations using deep networks acro… ▽ More We propose RanDumb to examine the efficacy of continual representation learning. RanDumb embeds raw pixels using a fixed random transform which approximates an RBF-Kernel, initialized before seeing any data, and learns a simple linear classifier on top. We present a surprising and consistent finding: RanDumb significantly outperforms the continually learned representations using deep networks across numerous continual learning benchmarks, demonstrating the poor performance of representation learning in these scenarios. RanDumb stores no exemplars and performs a single pass over the data, processing one sample at a time. It complements GDumb, operating in a low-exemplar regime where GDumb has especially poor performance. We reach the same consistent conclusions when RanDumb is extended to scenarios with pretrained models replacing the random transform with pretrained feature extractor. Our investigation is both surprising and alarming as it questions our understanding of how to effectively design and train models that require efficient continual representation learning, and necessitates a principled reinvestigation of the widely explored problem formulation itself. Our code is available at https://github.com/drimpossible/RanDumb. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Tech Report

arXiv:2311.11293 [pdf, other]

From Categories to Classifier: Name-Only Continual Learning by Exploring the Web

Authors: Ameya Prabhu, Hasan Abed Al Kader Hammoud, Ser-Nam Lim, Bernard Ghanem, Philip H. S. Torr, Adel Bibi

Abstract: Continual Learning (CL) often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. In this scenario, learners adapt to new category shifts using only category names without the luxury of annot… ▽ More Continual Learning (CL) often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. In this scenario, learners adapt to new category shifts using only category names without the luxury of annotated training data. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification. We investigate the reliability of our web data and find them comparable, and in some cases superior, to manually annotated datasets. Additionally, we show that by harnessing the web, we can create support sets that surpass state-of-the-art name-only classification that create support sets using generative models or image retrieval from LAION-5B, achieving up to 25% boost in accuracy. When applied across varied continual learning contexts, our method consistently exhibits a small performance gap in comparison to models trained on manually annotated datasets. We present EvoTrends, a class-incremental dataset made from the web to capture real-world trends, created in just minutes. Overall, this paper underscores the potential of using uncurated webly-supervised data to mitigate the challenges associated with manual data labeling in continual learning. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2310.02162 [pdf, other]

TreeScope: An Agricultural Robotics Dataset for LiDAR-Based Map** of Trees in Forests and Orchards

Authors: Derek Cheng, Fernando Cladera Ojeda, Ankit Prabhu, Xu Liu, Alan Zhu, Patrick Corey Green, Reza Ehsani, Pratik Chaudhari, Vijay Kumar

Abstract: Data collection for forestry, timber, and agriculture currently relies on manual techniques which are labor-intensive and time-consuming. We seek to demonstrate that robotics offers improvements over these techniques and accelerate agricultural research, beginning with semantic segmentation and diameter estimation of trees in forests and orchards. We present TreeScope v1.0, the first robotics data… ▽ More Data collection for forestry, timber, and agriculture currently relies on manual techniques which are labor-intensive and time-consuming. We seek to demonstrate that robotics offers improvements over these techniques and accelerate agricultural research, beginning with semantic segmentation and diameter estimation of trees in forests and orchards. We present TreeScope v1.0, the first robotics dataset for precision agriculture and forestry addressing the counting and map** of trees in forestry and orchards. TreeScope provides LiDAR data from agricultural environments collected with robotics platforms, such as UAV and mobile robot platforms carried by vehicles and human operators. In the first release of this dataset, we provide ground-truth data with over 1,800 manually annotated semantic labels for tree stems and field-measured tree diameters. We share benchmark scripts for these tasks that researchers may use to evaluate the accuracy of their algorithms. Finally, we run our open-source diameter estimation and off-the-shelf semantic segmentation algorithms and share our baseline results. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Submitted to 2024 IEEE International Conference on Robotics and Automation (ICRA 2024) for review

arXiv:2309.04502 [pdf, other]

On the Efficacy of Multi-scale Data Samplers for Vision Applications

Authors: Elvis Nunez, Thomas Merth, Anish Prabhu, Mehrdad Farajtabar, Mohammad Rastegari, Sachin Mehta, Maxwell Horton

Abstract: Multi-scale resolution training has seen an increased adoption across multiple vision tasks, including classification and detection. Training with smaller resolutions enables faster training at the expense of a drop in accuracy. Conversely, training with larger resolutions has been shown to improve performance, but memory constraints often make this infeasible. In this paper, we empirically study… ▽ More Multi-scale resolution training has seen an increased adoption across multiple vision tasks, including classification and detection. Training with smaller resolutions enables faster training at the expense of a drop in accuracy. Conversely, training with larger resolutions has been shown to improve performance, but memory constraints often make this infeasible. In this paper, we empirically study the properties of multi-scale training procedures. We focus on variable batch size multi-scale data samplers that randomly sample an input resolution at each training iteration and dynamically adjust their batch size according to the resolution. Such samplers have been shown to improve model accuracy beyond standard training with a fixed batch size and resolution, though it is not clear why this is the case. We explore the properties of these data samplers by performing extensive experiments on ResNet-101 and validate our conclusions across multiple architectures, tasks, and datasets. We show that multi-scale samplers behave as implicit data regularizers and accelerate training speed. Compared to models trained with single-scale samplers, we show that models trained with multi-scale samplers retain or improve accuracy, while being better-calibrated and more robust to scaling and data distribution shifts. We additionally extend a multi-scale variable batch sampler with a simple curriculum that progressively grows resolutions throughout training, allowing for a compute reduction of more than 30%. We show that the benefits of multi-scale training extend to detection and instance segmentation tasks, where we observe a 37% reduction in training FLOPs along with a 3-4% mAP increase on MS-COCO using a Mask R-CNN model. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2308.06658 [pdf, other]

Robust Localization of Aerial Vehicles via Active Control of Identical Ground Vehicles

Authors: Igor Spasojevic, Xu Liu, Ankit Prabhu, Alejandro Ribeiro, George J. Pappas, Vijay Kumar

Abstract: This paper addresses the problem of active collaborative localization in heterogeneous robot teams with unknown data association. It involves positioning a small number of identical unmanned ground vehicles (UGVs) at desired positions so that an unmanned aerial vehicle (UAV) can, through unlabelled measurements of UGVs, uniquely determine its global pose. We model the problem as a sequential two p… ▽ More This paper addresses the problem of active collaborative localization in heterogeneous robot teams with unknown data association. It involves positioning a small number of identical unmanned ground vehicles (UGVs) at desired positions so that an unmanned aerial vehicle (UAV) can, through unlabelled measurements of UGVs, uniquely determine its global pose. We model the problem as a sequential two player game, in which the first player positions the UGVs and the second identifies the two distinct hypothetical poses of the UAV at which the sets of measurements to the UGVs differ by as little as possible. We solve the underlying problem from the vantage point of the first player for a subclass of measurement models using a mixture of local optimization and exhaustive search procedures. Real-world experiments with a team of UAV and UGVs show that our method can achieve centimeter-level global localization accuracy. We also show that our method consistently outperforms random positioning of UGVs by a large margin, with as much as a 90% reduction in position and angular estimation error. Our method can tolerate a significant amount of random as well as non-stochastic measurement noise. This indicates its potential for reliable state estimation on board size, weight, and power (SWaP) constrained UAVs. This work enables robust localization in perceptually-challenged GPS-denied environments, thus paving the road for large-scale multi-robot navigation and map**. △ Less

Submitted 12 August, 2023; originally announced August 2023.

Comments: To appear in IROS 2023

arXiv:2306.09479 [pdf, other]

Inverse Scaling: When Bigger Isn't Better

Authors: Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zheng** Zhou, Najoung Kim , et al. (2 additional authors not shown)

Abstract: Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling… ▽ More Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models. △ Less

Submitted 12 May, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: Published in TMLR (2023), 39 pages

Journal ref: Transactions on Machine Learning Research (TMLR), 10/2023, https://openreview.net/forum?id=DwgRm72GQF

arXiv:2305.09275 [pdf, other]

Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right?

Authors: Hasan Abed Al Kader Hammoud, Ameya Prabhu, Ser-Nam Lim, Philip H. S. Torr, Adel Bibi, Bernard Ghanem

Abstract: We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy, which measures the accuracy of the model on the immediate next few samples. However, we show that this metric is unreliable, as even vacuous blind classifiers, which do not use input images for prediction, can achieve unrealistically high online accuracy by e… ▽ More We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy, which measures the accuracy of the model on the immediate next few samples. However, we show that this metric is unreliable, as even vacuous blind classifiers, which do not use input images for prediction, can achieve unrealistically high online accuracy by exploiting spurious label correlations in the data stream. Our study reveals that existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information, suggesting that they unintentionally learn spurious label correlations. To address this issue, we propose a novel metric for measuring adaptation based on the accuracy on the near-future samples, where spurious correlations are removed. We benchmark existing OCL approaches using our proposed metric on large-scale datasets under various computational budgets and find that better generalization can be achieved by retaining and reusing past seen information. We believe that our proposed metric can aid in the development of truly adaptive OCL methods. We provide code to reproduce our results at https://github.com/drimpossible/EvalOCL. △ Less

Submitted 16 May, 2023; originally announced May 2023.

arXiv:2305.09253 [pdf, other]

Online Continual Learning Without the Storage Constraint

Authors: Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, Ozan Sener

Abstract: Traditional online continual learning (OCL) research has primarily focused on mitigating catastrophic forgetting with fixed and limited storage allocation throughout an agent's lifetime. However, a broad range of real-world applications are primarily constrained by computational costs rather than storage limitations. In this paper, we target such applications, investigating the online continual le… ▽ More Traditional online continual learning (OCL) research has primarily focused on mitigating catastrophic forgetting with fixed and limited storage allocation throughout an agent's lifetime. However, a broad range of real-world applications are primarily constrained by computational costs rather than storage limitations. In this paper, we target such applications, investigating the online continual learning problem under relaxed storage constraints and limited computational budgets. We contribute a simple algorithm, which updates a kNN classifier continually along with a fixed, pretrained feature extractor. We selected this algorithm due to its exceptional suitability for online continual learning. It can adapt to rapidly changing streams, has zero stability gap, operates within tiny computational budgets, has low storage requirements by only storing features, and has a consistency property: It never forgets previously seen data. These attributes yield significant improvements, allowing our proposed algorithm to outperform existing methods by over 20% in accuracy on two large-scale OCL datasets: Continual LOCalization (CLOC) with 39M images and 712 classes and Continual Google Landmarks V2 (CGLM) with 580K images and 10,788 classes, even when existing methods retain all previously seen images. Furthermore, we achieve this superior performance with considerably reduced computational and storage expenses. We provide code to reproduce our results at github.com/drimpossible/ACM. △ Less

Submitted 2 November, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: Tech Report [Additional Experiments and Improved ACM]

arXiv:2303.11165 [pdf, other]

Computationally Budgeted Continual Learning: What Does Matter?

Authors: Ameya Prabhu, Hasan Abed Al Kader Hammoud, Puneet Dokania, Philip H. S. Torr, Ser-Nam Lim, Bernard Ghanem, Adel Bibi

Abstract: Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data. Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training. This is unreasonable for applications in-the-wild, where systems are primarily… ▽ More Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data. Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training. This is unreasonable for applications in-the-wild, where systems are primarily constrained by computational and time budgets, not storage. We revisit this problem with a large-scale benchmark and analyze the performance of traditional CL approaches in a compute-constrained setting, where effective memory samples used in training can be implicitly restricted as a consequence of limited computation. We conduct experiments evaluating various CL sampling strategies, distillation losses, and partial fine-tuning on two large-scale datasets, namely ImageNet2K and Continual Google Landmarks V2 in data incremental, class incremental, and time incremental settings. Through extensive experiments amounting to a total of over 1500 GPU-hours, we find that, under compute-constrained setting, traditional CL approaches, with no exception, fail to outperform a simple minimal baseline that samples uniformly from memory. Our conclusions are consistent in a different number of stream time steps, e.g., 20 to 200, and under several computational budgets. This suggests that most existing CL methods are particularly too computationally expensive for realistic budgeted deployment. Code for this project is available at: https://github.com/drimpossible/BudgetCL. △ Less

Submitted 14 July, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2302.01047 [pdf, other]

Real-Time Evaluation in Online Continual Learning: A New Hope

Authors: Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H. S. Torr, Bernard Ghanem

Abstract: Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions.… ▽ More Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods. △ Less

Submitted 24 March, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: Accepted at CVPR'23 as Highlight (Top 2.5%)

arXiv:2211.16882 [pdf, other]

MVRackLay: Monocular Multi-View Layout Estimation for Warehouse Racks and Shelves

Authors: Pranjali Pathre, Anurag Sahu, Ashwin Rao, Avinash Prabhu, Meher Shashwat Nigam, Tanvi Karandikar, Harit Pandya, K. Madhava Krishna

Abstract: In this paper, we propose and showcase, for the first time, monocular multi-view layout estimation for warehouse racks and shelves. Unlike typical layout estimation methods, MVRackLay estimates multi-layered layouts, wherein each layer corresponds to the layout of a shelf within a rack. Given a sequence of images of a warehouse scene, a dual-headed Convolutional-LSTM architecture outputs segmented… ▽ More In this paper, we propose and showcase, for the first time, monocular multi-view layout estimation for warehouse racks and shelves. Unlike typical layout estimation methods, MVRackLay estimates multi-layered layouts, wherein each layer corresponds to the layout of a shelf within a rack. Given a sequence of images of a warehouse scene, a dual-headed Convolutional-LSTM architecture outputs segmented racks, the front and the top view layout of each shelf within a rack. With minimal effort, such an output is transformed into a 3D rendering of all racks, shelves and objects on the shelves, giving an accurate 3D depiction of the entire warehouse scene in terms of racks, shelves and the number of objects on each shelf. MVRackLay generalizes to a diverse set of warehouse scenes with varying number of objects on each shelf, number of shelves and in the presence of other such racks in the background. Further, MVRackLay shows superior performance vis-a-vis its single view counterpart, RackLay, in layout accuracy, quantized in terms of the mean IoU and mAP metrics. We also showcase a multi-view stitching of the 3D layouts resulting in a representation of the warehouse scene with respect to a global reference frame akin to a rendering of the scene from a SLAM pipeline. To the best of our knowledge, this is the first such work to portray a 3D rendering of a warehouse scene in terms of its semantic components - Racks, Shelves and Objects - all from a single monocular camera. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Journal ref: IEEE International Conference on Robotics and Biomimetics (ROBIO) 2022

arXiv:2211.02946 [pdf, other]

HREyes: Design, Development, and Evaluation of a Novel Method for AUVs to Communicate Information and Gaze Direction

Authors: Michael Fulton, Aditya Prabhu, Junaed Sattar

Abstract: We present the design, development, and evaluation of HREyes: biomimetic communication devices which use light to communicate information and, for the first time, gaze direction from AUVs to humans. First, we introduce two types of information displays using the HREye devices: active lucemes and ocular lucemes. Active lucemes communicate information explicitly through animations, while ocular luce… ▽ More We present the design, development, and evaluation of HREyes: biomimetic communication devices which use light to communicate information and, for the first time, gaze direction from AUVs to humans. First, we introduce two types of information displays using the HREye devices: active lucemes and ocular lucemes. Active lucemes communicate information explicitly through animations, while ocular lucemes communicate gaze direction implicitly by mimicking human eyes. We present a human study in which our system is compared to the use of an embedded digital display that explicitly communicates information to a diver by displaying text. Our results demonstrate accurate recognition of active lucemes for trained interactants, limited intuitive understanding of these lucemes for untrained interactants, and relatively accurate perception of gaze direction for all interactants. The results on active luceme recognition demonstrate more accurate recognition than previous light-based communication systems for AUVs (albeit with different phrase sets). Additionally, the ocular lucemes we introduce in this work represent the first method for communicating gaze direction from an AUV, a critical aspect of nonverbal communication used in collaborative work. With readily available hardware as well as open-source and easily re-configurable programming, HREyes can be easily integrated into any AUV with the physical space for the devices and used to communicate effectively with divers in any underwater environment with appropriate visibility. △ Less

Submitted 5 November, 2022; originally announced November 2022.

Comments: Under submission at ICRA23

arXiv:2209.08465 [pdf, other]

doi 10.1109/ICRA48891.2023.10161564

Active Metric-Semantic Map** by Multiple Aerial Robots

Authors: Xu Liu, Ankit Prabhu, Fernando Cladera, Ian D. Miller, Lifeng Zhou, Camillo J. Taylor, Vijay Kumar

Abstract: Traditional approaches for active map** focus on building geometric maps. For most real-world applications, however, actionable information is related to semantically meaningful objects in the environment. We propose an approach to the active metric-semantic map** problem that enables multiple heterogeneous robots to collaboratively build a map of the environment. The robots actively explore t… ▽ More Traditional approaches for active map** focus on building geometric maps. For most real-world applications, however, actionable information is related to semantically meaningful objects in the environment. We propose an approach to the active metric-semantic map** problem that enables multiple heterogeneous robots to collaboratively build a map of the environment. The robots actively explore to minimize the uncertainties in both semantic (object classification) and geometric (object modeling) information. We represent the environment using informative but sparse object models, each consisting of a basic shape and a semantic class label, and characterize uncertainties empirically using a large amount of real-world data. Given a prior map, we use this model to select actions for each robot to minimize uncertainties. The performance of our algorithm is demonstrated through multi-robot experiments in diverse real-world environments. The proposed framework is applicable to a wide range of real-world problems, such as precision agriculture, infrastructure inspection, and asset map** in factories. A demo video can be found at https://youtu.be/S86SgXi54oU. △ Less

Submitted 13 August, 2023; v1 submitted 17 September, 2022; originally announced September 2022.

Comments: ICRA 2023 (2023 International Conference on Robotics and Automation)

Journal ref: ICRA 2023 (2023 International Conference on Robotics and Automation)

arXiv:2207.10237 [pdf, other]

SPIN: An Empirical Evaluation on Sharing Parameters of Isotropic Networks

Authors: Chien-Yu Lin, Anish Prabhu, Thomas Merth, Sachin Mehta, Anurag Ranjan, Maxwell Horton, Mohammad Rastegari

Abstract: Recent isotropic networks, such as ConvMixer and vision transformers, have found significant success across visual recognition tasks, matching or outperforming non-isotropic convolutional neural networks (CNNs). Isotropic architectures are particularly well-suited to cross-layer weight sharing, an effective neural network compression technique. In this paper, we perform an empirical evaluation on… ▽ More Recent isotropic networks, such as ConvMixer and vision transformers, have found significant success across visual recognition tasks, matching or outperforming non-isotropic convolutional neural networks (CNNs). Isotropic architectures are particularly well-suited to cross-layer weight sharing, an effective neural network compression technique. In this paper, we perform an empirical evaluation on methods for sharing parameters in isotropic networks (SPIN). We present a framework to formalize major weight sharing design decisions and perform a comprehensive empirical evaluation of this design space. Guided by our experimental results, we propose a weight sharing strategy to generate a family of models with better overall efficiency, in terms of FLOPs and parameters versus accuracy, compared to traditional scaling methods alone, for example compressing ConvMixer by 1.9x while improving accuracy on ImageNet. Finally, we perform a qualitative study to further understand the behavior of weight sharing in isotropic architectures. The code is available at https://github.com/apple/ml-spin. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: Accepted at ECCV 2022

arXiv:2202.04433 [pdf, other]

Co-WIN: Really Winning? Analysing Inequity in India's Vaccination Response

Authors: Tanvi Karandikar, Avinash Prabhu, Mehul Mathur, Megha Arora, Hemank Lamba, Ponnurangam Kumaraguru

Abstract: The COVID-19 pandemic has so far accounted for reported 5.5M deaths worldwide, with 8.7% of these coming from India. The pandemic exacerbated the weakness of the Indian healthcare system. As of January 20, 2022, India is the second worst affected country with 38.2M reported cases and 487K deaths. According to epidemiologists, vaccines are an essential tool to prevent the spread of the pandemic. In… ▽ More The COVID-19 pandemic has so far accounted for reported 5.5M deaths worldwide, with 8.7% of these coming from India. The pandemic exacerbated the weakness of the Indian healthcare system. As of January 20, 2022, India is the second worst affected country with 38.2M reported cases and 487K deaths. According to epidemiologists, vaccines are an essential tool to prevent the spread of the pandemic. India's vaccination drive began on January 16, 2021 with governmental policies being introduced to prioritize different populations of the society. Through the course of the vaccination drive, multiple new policies were also introduced to ensure that vaccines are readily available and vaccination coverage is increased. However, at the same time, some of the government policies introduced led to unintended inequities in the populations being targeted. In this report, we enumerate and analyze the inequities that existed in India's vaccination policy drive, and also compute the effect of the new policies that were introduced. We analyze these potential inequities not only qualitatively but also quantitatively by leveraging the data that was made available through the government portals. Specifically, (a) we discover inequities that might exist in the policies, (b) we quantify the effect of new policies introduced to increase vaccination coverage, and (c) we also point the data discrepancies that exist across different data sources. △ Less

Submitted 5 June, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

arXiv:2201.06640 [pdf, other]

Towards Adversarial Evaluations for Inexact Machine Unlearning

Authors: Shashwat Goel, Ameya Prabhu, Amartya Sanyal, Ser-Nam Lim, Philip Torr, Ponnurangam Kumaraguru

Abstract: Machine Learning models face increased concerns regarding the storage of personal user data and adverse impacts of corrupted data like backdoors or systematic bias. Machine Unlearning can address these by allowing post-hoc deletion of affected training data from a learned model. Achieving this task exactly is computationally expensive; consequently, recent works have proposed inexact unlearning al… ▽ More Machine Learning models face increased concerns regarding the storage of personal user data and adverse impacts of corrupted data like backdoors or systematic bias. Machine Unlearning can address these by allowing post-hoc deletion of affected training data from a learned model. Achieving this task exactly is computationally expensive; consequently, recent works have proposed inexact unlearning algorithms to solve this approximately as well as evaluation methods to test the effectiveness of these algorithms. In this work, we first outline some necessary criteria for evaluation methods and show no existing evaluation satisfies them all. Then, we design a stronger black-box evaluation method called the Interclass Confusion (IC) test which adversarially manipulates data during training to detect the insufficiency of unlearning procedures. We also propose two analytically motivated baseline methods~(EU-k and CF-k) which outperform several popular inexact unlearning methods. Overall, we demonstrate how adversarial evaluation strategies can help in analyzing various unlearning phenomena which can guide the development of stronger unlearning algorithms. △ Less

Submitted 22 February, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: Tech Report

arXiv:2112.00448 [pdf, other]

On-Device Spatial Attention based Sequence Learning Approach for Scene Text Script Identification

Authors: Rutika Moharir, Arun D Prabhu, Sukumar Moharana, Gopi Ramena, Rachit S Munjal

Abstract: Automatic identification of script is an essential component of a multilingual OCR engine. In this paper, we present an efficient, lightweight, real-time and on-device spatial attention based CNN-LSTM network for scene text script identification, feasible for deployment on resource constrained mobile devices. Our network consists of a CNN, equipped with a spatial attention module which helps reduc… ▽ More Automatic identification of script is an essential component of a multilingual OCR engine. In this paper, we present an efficient, lightweight, real-time and on-device spatial attention based CNN-LSTM network for scene text script identification, feasible for deployment on resource constrained mobile devices. Our network consists of a CNN, equipped with a spatial attention module which helps reduce the spatial distortions present in natural images. This allows the feature extractor to generate rich image representations while ignoring the deformities and thereby, enhancing the performance of this fine grained classification task. The network also employs residue convolutional blocks to build a deep network to focus on the discriminative features of a script. The CNN learns the text feature representation by identifying each character as belonging to a particular script and the long term spatial dependencies within the text are captured using the sequence learning capabilities of the LSTM layers. Combining the spatial attention mechanism with the residue convolutional blocks, we are able to enhance the performance of the baseline CNN to build an end-to-end trainable network for script identification. The experimental results on several standard benchmarks demonstrate the effectiveness of our method. The network achieves competitive accuracy with state-of-the-art methods and is superior in terms of network size, with a total of just 1.1 million parameters and inference time of 2.7 milliseconds. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: Accepted for publication in CVIP 2021

arXiv:2111.12395 [pdf, other]

I'll be back: Examining Restored Accounts On Twitter

Authors: Arnav Kapoor, Rishi Raj Jain, Avinash Prabhu, Tanvi Karandikar, Ponnurangam Kumaraguru

Abstract: Online social networks like Twitter actively monitor their platform to identify accounts that go against their rules. Twitter enforces account level moderation, i.e. suspension of a Twitter account in severe cases of platform abuse. A point of note is that these suspensions are sometimes temporary and even incorrect. Twitter provides a redressal mechanism to 'restore' suspended accounts. We refer… ▽ More Online social networks like Twitter actively monitor their platform to identify accounts that go against their rules. Twitter enforces account level moderation, i.e. suspension of a Twitter account in severe cases of platform abuse. A point of note is that these suspensions are sometimes temporary and even incorrect. Twitter provides a redressal mechanism to 'restore' suspended accounts. We refer to all suspended accounts who later have their suspension reversed as 'restored accounts'. In this paper, we release the firstever dataset and methodology 1 to identify restored accounts. We inspect account properties and tweets of these restored accounts to get key insights into the effects of suspension.We build a prediction model to classify an account into normal, suspended or restored. We use SHAP values to interpret this model and identify important features. SHAP (SHapley Additive exPlanations) is a method to explain individual predictions. We show that profile features like date of account creation and the ratio of retweets to total tweets are more important than content-based features like sentiment scores and Ekman emotion scores when it comes to classification of an account as normal, suspended or restored. We investigate restored accounts further in the pre-suspension and post-restoration phases. We see that the number of tweets per account drop by 53.95% in the post-restoration phase, signifying less 'spammy' behaviour after reversal of suspension. However, there was no substantial difference in the content of the tweets posted in the pre-suspension and post-restoration phases. △ Less

Submitted 24 November, 2021; originally announced November 2021.

arXiv:2110.15923 [pdf, other]

Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter

Authors: Tanvi Karandikar, Avinash Prabhu, Avinash Tulasi, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twit… ▽ More Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twitter, which end up being suspended or deleted from the platform. Such accounts aim to modify the reach of content by inorganically interacting with particular handles. These interactions are a threat to the integrity of the platform, as such activity has the potential to affect entire results of democratic processes. In this work, we design a feature extraction framework which compactly captures potentially insidious interaction patterns. Our proposed features are designed to bring out communities amongst the users that work to boost the content of particular accounts. We use Hyperbolic Hierarchical Clustering (HypHC) which represents the features in the hyperbolic manifold to further separate such communities. HypHC gives the added benefit of representing these features in a lower dimensional space -- thus serving as a dimensionality reduction technique. We use these features to distinguish between different classes of users that emerged in the aftermath of the 2019 General Elections of India. Amongst the users active on Twitter during the elections, 2.8% of the users participating were suspended and 1% of the users were deleted from the platform. We demonstrate the effectiveness of our proposed features in differentiating between regular users (users who were neither suspended nor deleted), suspended users and deleted users. By leveraging HypHC in our pipeline, we obtain F1 scores of upto 93%. △ Less

Submitted 1 November, 2021; v1 submitted 29 October, 2021; originally announced October 2021.

arXiv:2110.14197 [pdf, other]

Encoder-Decoder Networks for Analyzing Thermal and Power Delivery Networks

Authors: Vidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu, Nikhil Patil, Palkesh Jain, Sachin S. Sapatnekar

Abstract: Power delivery network (PDN) analysis and thermal analysis are computationally expensive tasks that are essential for successful IC design. Algorithmically, both these analyses have similar computational structure and complexity as they involve the solution to a partial differential equation of the same form. This paper converts these analyses into image-to-image and sequence-to-sequence translati… ▽ More Power delivery network (PDN) analysis and thermal analysis are computationally expensive tasks that are essential for successful IC design. Algorithmically, both these analyses have similar computational structure and complexity as they involve the solution to a partial differential equation of the same form. This paper converts these analyses into image-to-image and sequence-to-sequence translation tasks, which allows leveraging a class of machine learning models with an encoder-decoder-based generative (EDGe) architecture to address the time-intensive nature of these tasks. For PDN analysis, we propose two networks: (i) IREDGe: a full-chip static and dynamic IR drop predictor and (ii) EMEDGe: electromigration (EM) hotspot classifier based on input power, power grid distribution, and power pad distribution patterns. For thermal analysis, we propose ThermEDGe, a full-chip static and dynamic temperature estimator based on input power distribution patterns for thermal analysis. These networks are transferable across designs synthesized within the same technology and packing solution. The networks predict on-chip IR drop, EM hotspot locations, and temperature in milliseconds with negligibly small errors against commercial tools requiring several hours. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: 26 pages, 17 figures, Submitted to TODAES for review. arXiv admin note: text overlap with arXiv:2009.09009

arXiv:2110.04252 [pdf, other]

LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time

Authors: Elvis Nunez, Maxwell Horton, Anish Prabhu, Anurag Ranjan, Ali Farhadi, Mohammad Rastegari

Abstract: When deploying deep learning models to a device, it is traditionally assumed that available computational resources (compute, memory, and power) remain static. However, real-world computing systems do not always provide stable resource guarantees. Computational resources need to be conserved when load from other processes is high or battery power is low. Inspired by recent works on neural network… ▽ More When deploying deep learning models to a device, it is traditionally assumed that available computational resources (compute, memory, and power) remain static. However, real-world computing systems do not always provide stable resource guarantees. Computational resources need to be conserved when load from other processes is high or battery power is low. Inspired by recent works on neural network subspaces, we propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models that range from highly efficient to highly accurate. Our models require no retraining, thus our subspace of models can be deployed entirely on-device to allow adaptive network compression at inference time. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. We achieve accuracies on-par with standard models when testing our uncompressed models, and maintain high accuracy for sparsity rates above 90% when testing our compressed models. We also demonstrate that our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2110.03860 [pdf, other]

Token Pooling in Vision Transformers

Authors: Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, Oncel Tuzel

Abstract: Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To impr… ▽ More Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To improve the computational complexity of all layers, we propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations. We show that, under mild assumptions, softmax-attention acts as a high-dimensional low-pass (smoothing) filter. Thus, its output contains redundancy that can be pruned to achieve a better trade-off between the computational cost and accuracy. Our new technique accurately approximates a set of tokens by minimizing the reconstruction error caused by downsampling. We solve this optimization problem via cost-efficient clustering. We rigorously analyze and compare to prior downsampling methods. Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling. Token Pooling is a simple and effective operator that can benefit many architectures. Applied to DeiT, it achieves the same ImageNet top-1 accuracy using 42% fewer computations. △ Less

Submitted 11 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023

arXiv:2105.07795 [pdf, other]

doi 10.1109/IJCNN52387.2021.9534319

STRIDE : Scene Text Recognition In-Device

Authors: Rachit S Munjal, Arun D Prabhu, Nikhil Arora, Sukumar Moharana, Gopi Ramena

Abstract: Optical Character Recognition (OCR) systems have been widely used in various applications for extracting semantic information from images. To give the user more control over their privacy, an on-device solution is needed. The current state-of-the-art models are too heavy and complex to be deployed on-device. We develop an efficient lightweight scene text recognition (STR) system, which has only 0.… ▽ More Optical Character Recognition (OCR) systems have been widely used in various applications for extracting semantic information from images. To give the user more control over their privacy, an on-device solution is needed. The current state-of-the-art models are too heavy and complex to be deployed on-device. We develop an efficient lightweight scene text recognition (STR) system, which has only 0.88M parameters and performs real-time text recognition. Attention modules tend to boost the accuracy of STR networks but are generally slow and not optimized for device inference. So, we propose the use of convolution attention modules to the text recognition networks, which aims to provide channel and spatial attention information to the LSTM module by adding very minimal computational cost. It boosts our word accuracy on ICDAR 13 dataset by almost 2\%. We also introduce a novel orientation classifier module, to support the simultaneous recognition of both horizontal and vertical text. The proposed model surpasses on-device metrics of inference time and memory footprint and achieves comparable accuracy when compared to the leading commercial and other open-source OCR engines. We deploy the system on-device with an inference speed of 2.44 ms per word on the Exynos 990 chipset device and achieve an accuracy of 88.4\% on ICDAR-13 dataset. △ Less

Submitted 17 May, 2021; originally announced May 2021.

Comments: accepted in IJCNN 2021

arXiv:2104.01261 [pdf, other]

Small World Student Network at the University of Texas at Dallas in Times of Social Distancing

Authors: Kailash Subramanian, Joshua M. Williams, Daniel C. DeAnda, Aditya A. Agrawal, Andrei Racila, Aditi R. Prabhu, Lawrence Redlinger, Christopher Wendt, Ravi Prakash

Abstract: To limit the spread of the novel coronavirus on college campuses, a common strategy for the Fall 2020 and Spring 2021 terms has been to offer instruction weighted toward hybrid or fully online modalities. Colleges are now considering whether and how to expand hybrid or fully in-person instruction for future terms, and learn lessons from this experience for future use. Our paper uses Fall 2019 enro… ▽ More To limit the spread of the novel coronavirus on college campuses, a common strategy for the Fall 2020 and Spring 2021 terms has been to offer instruction weighted toward hybrid or fully online modalities. Colleges are now considering whether and how to expand hybrid or fully in-person instruction for future terms, and learn lessons from this experience for future use. Our paper uses Fall 2019 enrollment data for a medium-sized public American university to analyze whether some student grou**s by class standing or course level are more susceptible to the spread of infectious disease through academic enrollment networks. Replicating Weeden and Cornwell [8], we find that enrollment networks at the institution are "small worlds" characterized by high clustering, short average path lengths, and multiple independent connections. Connectivity decreases as class standing (graduate vs. undergraduate; senior vs. freshman) and course level increase; as students move from generalized to specialized course loads, networks cluster by major. Holding other factors constant, policies focusing on in-person instruction for lower division students conflict with the greater risk of infectious spread through a lower division network in the absence of additional steps to minimize academic connectivity. There are academic and financial incentives for emphasizing the freshman experience, including concerns about student attrition from the first to second academic year and recou** costs of infrastructure investments in dormitories. Possible solutions could include (i) restricting face-to-face or hybrid instruction to courses in students' academic majors, which would disrupt larger networks into smaller ones and thus restrict the spread of infection across majors, and (ii) take a "scalpel" approach to instruction modes by moving online courses most likely to facilitate epidemic spread. △ Less

Submitted 2 April, 2021; originally announced April 2021.

arXiv:2104.00795 [pdf, other]

No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks

Authors: Shyamgopal Karthik, Ameya Prabhu, Puneet K. Dokania, Vineet Gandhi

Abstract: There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we… ▽ More There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-$k$ predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates. △ Less

Submitted 1 April, 2021; originally announced April 2021.

arXiv:2103.09174 [pdf, other]

doi 10.1145/3490035.3490263

Monocular Multi-Layer Layout Estimation for Warehouse Racks

Authors: Meher Shashwat Nigam, Avinash Prabhu, Anurag Sahu, Puru Gupta, Tanvi Karandikar, N. Sai Shankar, Ravi Kiran Sarvadevabhatla, K. Madhava Krishna

Abstract: Given a monocular colour image of a warehouse rack, we aim to predict the bird's-eye view layout for each shelf in the rack, which we term as multi-layer layout prediction. To this end, we present RackLay, a deep neural network for real-time shelf layout estimation from a single image. Unlike previous layout estimation methods, which provide a single layout for the dominant ground plane alone, Rac… ▽ More Given a monocular colour image of a warehouse rack, we aim to predict the bird's-eye view layout for each shelf in the rack, which we term as multi-layer layout prediction. To this end, we present RackLay, a deep neural network for real-time shelf layout estimation from a single image. Unlike previous layout estimation methods, which provide a single layout for the dominant ground plane alone, RackLay estimates the top-view and front-view layout for each shelf in the considered rack populated with objects. RackLay's architecture and its variants are versatile and estimate accurate layouts for diverse scenes characterized by varying number of visible shelves in an image, large range in shelf occupancy factor and varied background clutter. Given the extreme paucity of datasets in this space and the difficulty involved in acquiring real data from warehouses, we additionally release a flexible synthetic dataset generation pipeline WareSynth which allows users to control the generation process and tailor the dataset according to contingent application. The ablations across architectural variants and comparison with strong prior baselines vindicate the efficacy of RackLay as an apt architecture for the novel problem of multi-layered layout estimation. We also show that fusing the top-view and front-view enables 3D reasoning applications such as metric free space estimation for the considered rack. △ Less

Submitted 28 October, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: Visit our project repository at https://github.com/Avinash2468/RackLay

arXiv:2101.06914 [pdf, other]

Capitol (Pat)riots: A comparative study of Twitter and Parler

Authors: Hitkul, Avinash Prabhu, Dipanwita Guhathakurta, Jivitesh jain, Mallika Subramanian, Manvith Reddy, Shradha Sehgal, Tanvi Karandikar, Amogh Gulati, Udit Arora, Rajiv Ratn Shah, Ponnurangam Kumaraguru

Abstract: On 6 January 2021, a mob of right-wing conservatives stormed the USA Capitol Hill interrupting the session of congress certifying 2020 Presidential election results. Immediately after the start of the event, posts related to the riots started to trend on social media. A social media platform which stood out was a free speech endorsing social media platform Parler; it is being claimed as the platfo… ▽ More On 6 January 2021, a mob of right-wing conservatives stormed the USA Capitol Hill interrupting the session of congress certifying 2020 Presidential election results. Immediately after the start of the event, posts related to the riots started to trend on social media. A social media platform which stood out was a free speech endorsing social media platform Parler; it is being claimed as the platform on which the riots were planned and talked about. Our report presents a contrast between the trending content on Parler and Twitter around the time of riots. We collected data from both platforms based on the trending hashtags and draw comparisons based on what are the topics being talked about, who are the people active on the platforms and how organic is the content generated on the two platforms. While the content trending on Twitter had strong resentments towards the event and called for action against rioters and inciters, Parler content had a strong conservative narrative echoing the ideas of voter fraud similar to the attacking mob. We also find a disproportionately high manipulation of traffic on Parler when compared to Twitter. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2012.13427 [pdf]

Reproducible Workflow

Authors: Anirudh Prabhu, Peter Fox

Abstract: Reproducibility has been consistently identified as an important component of scientific research. Although there is a general consensus on the importance of reproducibility along with the other commonly used 'R' terminology (i.e., Replicability, Repeatability etc.), there is some disagreement on the usage of these terms, including conflicting definitions used by different parts of the research co… ▽ More Reproducibility has been consistently identified as an important component of scientific research. Although there is a general consensus on the importance of reproducibility along with the other commonly used 'R' terminology (i.e., Replicability, Repeatability etc.), there is some disagreement on the usage of these terms, including conflicting definitions used by different parts of the research community. In this encyclopedia article, we explore the different definitions used in scientific literature (specifically pertaining to computational research), whether there is a need for a single standardized definition and provide an alternative based on non-functional requirements. We also describe the role of reproducibility (and other R's) in scientific workflows. △ Less

Submitted 24 December, 2020; originally announced December 2020.

Comments: 7 pages, no figures. Submitted as an entry to the "Encyclopedia of Mathematical Geosciences."

arXiv:2012.02990 [pdf, other]

doi 10.1109/ICSC50631.2021.00030

Codeswitched Sentence Creation using Dependency Parsing

Authors: Dhruval Jain, Arun D Prabhu, Shubham Vatsal, Gopi Ramena, Naresh Purre

Abstract: Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) task… ▽ More Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks. △ Less

Submitted 5 December, 2020; originally announced December 2020.

arXiv:2012.02819 [pdf, other]

doi 10.1109/ICSC50631.2021.00033

On-Device Sentence Similarity for SMS Dataset

Authors: Arun D Prabhu, Nikhil Arora, Shubham Vatsal, Gopi Ramena, Sukumar Moharana, Naresh Purre

Abstract: Determining the sentence similarity between Short Message Service (SMS) texts/sentences plays a significant role in mobile device industry. Gauging the similarity between SMS data is thus necessary for various applications like enhanced searching and navigation, clubbing together SMS of similar type when given a custom label or tag is provided by user irrespective of their sender etc. The problem… ▽ More Determining the sentence similarity between Short Message Service (SMS) texts/sentences plays a significant role in mobile device industry. Gauging the similarity between SMS data is thus necessary for various applications like enhanced searching and navigation, clubbing together SMS of similar type when given a custom label or tag is provided by user irrespective of their sender etc. The problem faced with SMS data is its incomplete structure and grammatical inconsistencies. In this paper, we propose a unique pipeline for evaluating the text similarity between SMS texts. We use Part of Speech (POS) model for keyword extraction by taking advantage of the partial structure embedded in SMS texts and similarity comparisons are carried out using statistical methods. The proposed pipeline deals with major semantic variations across SMS data as well as makes it effective for its application on-device (mobile phone). To showcase the capabilities of our work, our pipeline has been designed with an inclination towards one of the possible applications of SMS text similarity discussed in one of the following sections but nonetheless guarantees scalability for other applications as well. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2011.10251 [pdf, other]

doi 10.1109/ICPR48806.2021.9412222

On-Device Text Image Super Resolution

Authors: Dhruval Jain, Arun D Prabhu, Gopi Ramena, Manoj Goyal, Debi Prasanna Mohanty, Sukumar Moharana, Naresh Purre

Abstract: Recent research on super-resolution (SR) has witnessed major developments with the advancements of deep convolutional neural networks. There is a need for information extraction from scenic text images or even document images on device, most of which are low-resolution (LR) images. Therefore, SR becomes an essential pre-processing step as Bicubic Upsampling, which is conventionally present in smar… ▽ More Recent research on super-resolution (SR) has witnessed major developments with the advancements of deep convolutional neural networks. There is a need for information extraction from scenic text images or even document images on device, most of which are low-resolution (LR) images. Therefore, SR becomes an essential pre-processing step as Bicubic Upsampling, which is conventionally present in smartphones, performs poorly on LR images. To give the user more control over his privacy, and to reduce the carbon footprint by reducing the overhead of cloud computing and hours of GPU usage, executing SR models on the edge is a necessity in the recent times. There are various challenges in running and optimizing a model on resource-constrained platforms like smartphones. In this paper, we present a novel deep neural network that reconstructs sharper character edges and thus boosts OCR confidence. The proposed architecture not only achieves significant improvement in PSNR over bicubic upsampling on various benchmark datasets but also runs with an average inference time of 11.7 ms per image. We have outperformed state-of-the-art on the Text330 dataset. We also achieve an OCR accuracy of 75.89% on the ICDAR 2015 TextSR dataset, where ground truth has an accuracy of 78.10%. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: Accepted to the International Conference on Pattern Recognition(ICPR), 2020

arXiv:2009.09009 [pdf, other]

Thermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks

Authors: Vidya A. Chhabria, Vipul Ahuja, Ashwath Prabhu, Nikhil Patil, Palkesh Jain, Sachin S. Sapatnekar

Abstract: Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two… ▽ More Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based on input power, power grid distribution, and power pad distribution patterns. The models are design-independent and must be trained just once for a particular technology and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict the on-chip temperature and IR drop contours in milliseconds (in contrast with commercial tools that require several hours or more) and provide an average error of 0.6% and 0.008% respectively. △ Less

Submitted 18 September, 2020; originally announced September 2020.

Comments: Accepted in ASP-DAC 2021 conference

arXiv:2006.02609 [pdf, other]

Simple Unsupervised Multi-Object Tracking

Authors: Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi

Abstract: Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for develo** better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestep** the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first gene… ▽ More Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for develo** better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestep** the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers by showing promise in simpler unsupervised alternatives. △ Less

Submitted 3 June, 2020; originally announced June 2020.

arXiv:1911.11433 [pdf, other]

"You might also like this model": Data Driven Approach for Recommending Deep Learning Models for Unknown Image Datasets

Authors: Ameya Prabhu, Riddhiman Dasgupta, Anush Sankaran, Srikanth Tamilselvam, Senthil Mani

Abstract: For an unknown (new) classification dataset, choosing an appropriate deep learning architecture is often a recursive, time-taking, and laborious process. In this research, we propose a novel technique to recommend a suitable architecture from a repository of known models. Further, we predict the performance accuracy of the recommended architecture on the given unknown dataset, without the need for… ▽ More For an unknown (new) classification dataset, choosing an appropriate deep learning architecture is often a recursive, time-taking, and laborious process. In this research, we propose a novel technique to recommend a suitable architecture from a repository of known models. Further, we predict the performance accuracy of the recommended architecture on the given unknown dataset, without the need for training the model. We propose a model encoder approach to learn a fixed length representation of deep learning architectures along with its hyperparameters, in an unsupervised fashion. We manually curate a repository of image datasets with corresponding known deep learning models and show that the predicted accuracy is a good estimator of the actual accuracy. We discuss the implications of the proposed approach for three benchmark images datasets and also the challenges in using the approach for text modality. To further increase the reproducibility of the proposed approach, the entire implementation is made publicly available along with the trained models. △ Less

Submitted 20 May, 2020; v1 submitted 26 November, 2019; originally announced November 2019.

Comments: NeurIPS 2019, New in ML Group

arXiv:1909.09389 [pdf, other]

Sampling Bias in Deep Active Classification: An Empirical Study

Authors: Ameya Prabhu, Charles Dognin, Maneesh Singh

Abstract: The exploding cost and time needed for data labeling and model training are bottlenecks for training DNN models on large datasets. Identifying smaller representative data samples with strategies like active learning can help mitigate such bottlenecks. Previous works on active learning in NLP identify the problem of sampling bias in the samples acquired by uncertainty-based querying and develop cos… ▽ More The exploding cost and time needed for data labeling and model training are bottlenecks for training DNN models on large datasets. Identifying smaller representative data samples with strategies like active learning can help mitigate such bottlenecks. Previous works on active learning in NLP identify the problem of sampling bias in the samples acquired by uncertainty-based querying and develop costly approaches to address it. Using a large empirical study, we demonstrate that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) is robust to sampling biases and to various algorithmic choices (query size and strategies) unlike that suggested by traditional literature. We also show that FTZ based query strategy produces sample sets similar to those from more sophisticated approaches (e.g ensemble networks). Finally, we show the effectiveness of the selected samples by creating tiny high-quality datasets, and utilizing them for fast and cheap training of large models. Based on the above, we propose a simple baseline for deep active text classification that outperforms the state-of-the-art. We expect the presented work to be useful and informative for dataset compression and for problems involving active, semi-supervised or online learning scenarios. Code and models are available at: https://github.com/drimpossible/Sampling-Bias-Active-Learning △ Less

Submitted 20 September, 2019; originally announced September 2019.

Comments: Accepted at EMNLP 2019

arXiv:1906.02256 [pdf, other]

Butterfly Transform: An Efficient FFT Based Neural Architecture Design

Authors: Keivan Alizadeh Vahid, Anish Prabhu, Ali Farhadi, Mohammad Rastegari

Abstract: In this paper, we show that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs. Pointwise convolutions, which we refer to as channel fusions, are the main computational bottleneck in the state-of-the-art efficient CNNs (e.g. MobileNets ). We introduce a set of criteria for chann… ▽ More In this paper, we show that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs. Pointwise convolutions, which we refer to as channel fusions, are the main computational bottleneck in the state-of-the-art efficient CNNs (e.g. MobileNets ). We introduce a set of criteria for channel fusion and prove that BFT yields an asymptotically optimal FLOP count with respect to these criteria. By replacing pointwise convolutions with BFT, we reduce the computational complexity of these layers from O(n^2) to O(n\log n) with respect to the number of channels. Our experimental evaluations show that our method results in significant accuracy gains across a wide range of network architectures, especially at low FLOP ranges. For example, BFT results in up to a 6.75% absolute Top-1 improvement for MobileNetV1, 4.4 \% for ShuffleNet V2 and 5.4% for MobileNetV3 on ImageNet under a similar number of FLOPS. Notably, ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime. △ Less

Submitted 16 April, 2020; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1809.05839 [pdf, other]

A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Authors: Gautham Krishna G, Karthik Subramanian Nathan, Yogesh Kumar B, Ankith A Prabhu, Ajay Kannan, Vineeth Vijayaraghavan

Abstract: Human computer interaction facilitates intelligent communication between humans and computers, in which gesture recognition plays a prominent role. This paper proposes a machine learning system to identify dynamic gestures using tri-axial acceleration data acquired from two public datasets. These datasets, uWave and Sony, were acquired using accelerometers embedded in Wii remotes and smartwatches,… ▽ More Human computer interaction facilitates intelligent communication between humans and computers, in which gesture recognition plays a prominent role. This paper proposes a machine learning system to identify dynamic gestures using tri-axial acceleration data acquired from two public datasets. These datasets, uWave and Sony, were acquired using accelerometers embedded in Wii remotes and smartwatches, respectively. A dynamic gesture signed by the user is characterized by a generic set of features extracted across time and frequency domains. The system was analyzed from an end-user perspective and was modelled to operate in three modes. The modes of operation determine the subsets of data to be used for training and testing the system. From an initial set of seven classifiers, three were chosen to evaluate each dataset across all modes rendering the system towards mode-neutrality and dataset-independence. The proposed system is able to classify gestures performed at varying speeds with minimum preprocessing, making it computationally efficient. Moreover, this system was found to run on a low-cost embedded platform - Raspberry Pi Zero (USD 5), making it economically viable. △ Less

Submitted 16 September, 2018; originally announced September 2018.

Comments: Accepted at IEEE Future of Information and Communications Conference (FICC 2018)

arXiv:1804.03867 [pdf, other]

Hybrid Binary Networks: Optimizing for Accuracy, Efficiency and Memory

Authors: Ameya Prabhu, Vishal Batchu, Rohit Gajawada, Sri Aurobindo Munagala, Anoop Namboodiri

Abstract: Binarization is an extreme network compression approach that provides large computational speedups along with energy and memory savings, albeit at significant accuracy costs. We investigate the question of where to binarize inputs at layer-level granularity and show that selectively binarizing the inputs to specific layers in the network could lead to significant improvements in accuracy while pre… ▽ More Binarization is an extreme network compression approach that provides large computational speedups along with energy and memory savings, albeit at significant accuracy costs. We investigate the question of where to binarize inputs at layer-level granularity and show that selectively binarizing the inputs to specific layers in the network could lead to significant improvements in accuracy while preserving most of the advantages of binarization. We analyze the binarization tradeoff using a metric that jointly models the input binarization-error and computational cost and introduce an efficient algorithm to select layers whose inputs are to be binarized. Practical guidelines based on insights obtained from applying the algorithm to a variety of models are discussed. Experiments on Imagenet dataset using AlexNet and ResNet-18 models show 3-4% improvements in accuracy over fully binarized networks with minimal impact on compression and computational speed. The improvements are even more substantial on sketch datasets like TU-Berlin, where we match state-of-the-art accuracy as well, getting over 8% increase in accuracies. We further show that our approach can be applied in tandem with other forms of compression that deal with individual layers or overall model compression (e.g., SqueezeNets). Unlike previous quantization approaches, we are able to binarize the weights in the last layers of a network, which often have a large number of parameters, resulting in significant improvement in accuracy over fully binarized models. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: Accepted in WACV'18 (Oral)

arXiv:1804.02941 [pdf, other]

Distribution-Aware Binarization of Neural Networks for Sketch Recognition

Authors: Ameya Prabhu, Vishal Batchu, Sri Aurobindo Munagala, Rohit Gajawada, Anoop Namboodiri

Abstract: Deep neural networks are highly effective at a range of computational tasks. However, they tend to be computationally expensive, especially in vision-related problems, and also have large memory requirements. One of the most effective methods to achieve significant improvements in computational/spatial efficiency is to binarize the weights and activations in a network. However, naive binarization… ▽ More Deep neural networks are highly effective at a range of computational tasks. However, they tend to be computationally expensive, especially in vision-related problems, and also have large memory requirements. One of the most effective methods to achieve significant improvements in computational/spatial efficiency is to binarize the weights and activations in a network. However, naive binarization results in accuracy drops when applied to networks for most tasks. In this work, we present a highly generalized, distribution-aware approach to binarizing deep networks that allows us to retain the advantages of a binarized network, while reducing accuracy drops. We also develop efficient implementations for our proposed approach across different architectures. We present a theoretical analysis of the technique to show the effective representational power of the resulting layers, and explore the forms of data they model best. Experiments on popular datasets show that our technique offers better accuracies than naive binarization, while retaining the same benefits that binarization provides - with respect to run-time compression, reduction of computational costs, and power consumption. △ Less

Submitted 9 April, 2018; originally announced April 2018.

Comments: Accepted at WACV '18 (Oral)

arXiv:1711.08757 [pdf, other]

Deep Expander Networks: Efficient Deep Networks from Graph Theory

Authors: Ameya Prabhu, Girish Varma, Anoop Namboodiri

Abstract: Efficient CNN designs like ResNets and DenseNet were proposed to improve accuracy vs efficiency trade-offs. They essentially increased the connectivity, allowing efficient information flow across layers. Inspired by these techniques, we propose to model connections between filters of a CNN using graphs which are simultaneously sparse and well connected. Sparsity results in efficiency while well co… ▽ More Efficient CNN designs like ResNets and DenseNet were proposed to improve accuracy vs efficiency trade-offs. They essentially increased the connectivity, allowing efficient information flow across layers. Inspired by these techniques, we propose to model connections between filters of a CNN using graphs which are simultaneously sparse and well connected. Sparsity results in efficiency while well connectedness can preserve the expressive power of the CNNs. We use a well-studied class of graphs from theoretical computer science that satisfies these properties known as Expander graphs. Expander graphs are used to model connections between filters in CNNs to design networks called X-Nets. We present two guarantees on the connectivity of X-Nets: Each node influences every node in a layer in logarithmic steps, and the number of paths between two sets of nodes is proportional to the product of their sizes. We also propose efficient training and inference algorithms, making it possible to train deeper and wider X-Nets effectively. Expander based models give a 4% improvement in accuracy on MobileNet over grouped convolutions, a popular technique, which has the same sparsity but worse connectivity. X-Nets give better performance trade-offs than the original ResNet and DenseNet-BC architectures. We achieve model sizes comparable to state-of-the-art pruning techniques using our simple architecture design, without any pruning. We hope that this work motivates other approaches to utilize results from graph theory to develop efficient network architectures. △ Less

Submitted 26 July, 2018; v1 submitted 23 November, 2017; originally announced November 2017.

Comments: ECCV'18

arXiv:1611.00472 [pdf, other]

Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text

Authors: Ameya Prabhu, Aditya Joshi, Manish Shrivastava, Vasudeva Varma

Abstract: Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the… ▽ More Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%. △ Less

Submitted 2 November, 2016; originally announced November 2016.

Comments: Accepted paper at COLING 2016

arXiv:1610.09756 [pdf, other]

Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity

Authors: Vinayak Athavale, Shreenivas Bharadwaj, Monik Pamecha, Ameya Prabhu, Manish Shrivastava

Abstract: In this paper we describe an end to end Neural Model for Named Entity Recognition NER) which is based on Bi-Directional RNN-LSTM. Almost all NER systems for Hindi use Language Specific features and handcrafted rules with gazetteers. Our model is language independent and uses no domain specific features or any handcrafted rules. Our models rely on semantic information in the form of word vectors wh… ▽ More In this paper we describe an end to end Neural Model for Named Entity Recognition NER) which is based on Bi-Directional RNN-LSTM. Almost all NER systems for Hindi use Language Specific features and handcrafted rules with gazetteers. Our model is language independent and uses no domain specific features or any handcrafted rules. Our models rely on semantic information in the form of word vectors which are learnt by an unsupervised learning algorithm on an unannotated corpus. Our model attained state of the art performance in both English and Hindi without the use of any morphological analysis or without using gazetteers of any sort. △ Less

Submitted 16 November, 2016; v1 submitted 30 October, 2016; originally announced October 2016.

Comments: 7 pages

Report number: https://aclweb.org/anthology/W/W16/W16-6320.pdf

Showing 1–49 of 49 results for author: Prabhu, A