Search | arXiv e-print repository

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

Authors: Shixiong Qi, K. K. Ramakrishnan, Myung** Lee

Abstract: Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management.… ▽ More Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2401.00057 [pdf, other]

Generalization properties of contrastive world models

Authors: Kandan Ramakrishnan, R. James Cotton, Xaq Pitkow, Andreas S. Tolias

Abstract: Recent work on object-centric world models aim to factorize representations in terms of objects in a completely unsupervised or self-supervised manner. Such world models are hypothesized to be a key component to address the generalization problem. While self-supervision has shown improved performance however, OOD generalization has not been systematically and explicitly tested. In this paper, we c… ▽ More Recent work on object-centric world models aim to factorize representations in terms of objects in a completely unsupervised or self-supervised manner. Such world models are hypothesized to be a key component to address the generalization problem. While self-supervision has shown improved performance however, OOD generalization has not been systematically and explicitly tested. In this paper, we conduct an extensive study on the generalization properties of contrastive world model. We systematically test the model under a number of different OOD generalization scenarios such as extrapolation to new object attributes, introducing new conjunctions or new attributes. Our experiments show that the contrastive world model fails to generalize under the different OOD tests and the drop in performance depends on the extent to which the samples are OOD. When visualizing the transition updates and convolutional feature maps, we observe that any changes in object attributes (such as previously unseen colors, shapes, or conjunctions of color and shape) breaks down the factorization of object representations. Overall, our work highlights the importance of object-centric representations for generalization and current models are limited in their capacity to learn such representations required for human-level generalization. △ Less

Submitted 29 December, 2023; originally announced January 2024.

Comments: Accepted at the NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice

arXiv:2311.18259 [pdf, other]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/ △ Less

Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

arXiv:2307.08763 [pdf, other]

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman

Abstract: Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequ… ▽ More Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art. △ Less

Submitted 29 October, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023

arXiv:2307.06385 [pdf, other]

Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

Authors: Kalyan Ramakrishnan

Abstract: Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a b… ▽ More Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well. △ Less

Submitted 19 July, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.15850 [pdf, other]

SpotEM: Efficient Video Search for Episodic Memory

Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

Abstract: The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve effici… ▽ More The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: Published in ICML 2023

arXiv:2306.09324 [pdf, other]

Single-Stage Visual Query Localization in Egocentric Videos

Authors: Hanwen Jiang, Santhosh Kumar Ramakrishnan, Kristen Grauman

Abstract: Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline re… ▽ More Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our experiments demonstrate that our approach outperforms prior VQL methods by 20% accuracy while obtaining a 10x improvement in inference speed. VQLoC is also the top entry on the Ego4D VQ2D challenge leaderboard. Project page: https://hwjiang1510.github.io/VQLoC/ △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: Winner of Ego4D VQ2D challenge 2023

arXiv:2304.13541 [pdf, other]

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Authors: Aditya Dhakal, Sameer G. Kulkarni, K. K. Ramakrishnan

Abstract: Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of chall… ▽ More Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput. △ Less

Submitted 31 March, 2023; originally announced April 2023.

arXiv:2303.04404 [pdf, other]

MiddleNet: A Unified, High-Performance NFV and Middlebox Framework with eBPF and DPDK

Authors: Shixiong Qi, Ziteng Zeng, Leslie Monis, K. K. Ramakrishnan

Abstract: Traditional network resident functions (e.g., firewalls, network address translation) and middleboxes (caches, load balancers) have moved from purpose-built appliances to software-based components. However, L2/L3 network functions (NFs) are being implemented on Network Function Virtualization (NFV) platforms that extensively exploit kernel-bypass technology. They often use DPDK for zero-copy deliv… ▽ More Traditional network resident functions (e.g., firewalls, network address translation) and middleboxes (caches, load balancers) have moved from purpose-built appliances to software-based components. However, L2/L3 network functions (NFs) are being implemented on Network Function Virtualization (NFV) platforms that extensively exploit kernel-bypass technology. They often use DPDK for zero-copy delivery and high performance. On the other hand, L4/L7 middleboxes, which have a greater emphasis on functionality, take advantage of a full-fledged kernel-based system. L2/L3 NFs and L4/L7 middleboxes continue to be handled by distinct platforms on different nodes. This paper proposes MiddleNet that develops a unified network resident function framework that supports L2/L3 NFs and L4/L7 middleboxes. MiddleNet supports function chains that are essential in both NFV and middlebox environments. MiddleNet uses the Data Plane Development Kit (DPDK) library for zero-copy packet delivery without interrupt-based processing, to enable the "bump-in-the-wire" L2/L3 processing performance required of NFV. To support L4/L7 middlebox functionality, MiddleNet utilizes a consolidated, kernel-based protocol stack for processing, avoiding a dedicated protocol stack for each function. MiddleNet fully exploits the event-driven capabilities of the extended Berkeley Packet Filter (eBPF) and seamlessly integrates it with shared memory for high-performance communication in L4/L7 middlebox function chains. The overheads for MiddleNet in L4/L7 are strictly load-proportional, without needing the dedicated CPU cores of DPDK-based approaches. MiddleNet supports flow-dependent packet processing by leveraging Single Root I/O Virtualization (SR-IOV) to dynamically select the packet processing needed (Layers 2 - 7). Our experimental results show that MiddleNet achieves high performance in such a unified environment. △ Less

Submitted 30 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2301.07799 [pdf, other]

doi 10.1016/j.neunet.2023.01.007

A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

Authors: Megan M. Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, Sébastien M. R. Arnold, Ese Ben-Iwhiwhu, Andrew P. Brna, Ethan Brooks, Ryan C. Brown, Zachary Daniels, Anurag Daram, Fabien Delattre, Ryan Dellana, Eric Eaton, Haotian Fu, Kristen Grauman, Jesse Hostetler, Shariq Iqbal, Cassandra Kent, Nicholas Ketz, Soheil Kolouri, George Konidaris, Dhireesha Kudithipudi, Erik Learned-Miller, Seungwon Lee , et al. (22 additional authors not shown)

Abstract: Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through th… ▽ More Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future. △ Less

Submitted 18 January, 2023; originally announced January 2023.

Comments: To appear in Neural Networks

arXiv:2301.00746 [pdf, other]

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

Abstract: Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window output… ▽ More Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and top** the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: {\small\url{http://vision.cs.utexas.edu/projects/naq}}. △ Less

Submitted 25 March, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: 13 pages, 7 figures, appearing in CVPR 2023

arXiv:2210.05633 [pdf, other]

Habitat-Matterport 3D Semantics Dataset

Authors: Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, Alexander William Clegg, Devendra Singh Chaplot

Abstract: We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior… ▽ More We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022. △ Less

Submitted 12 October, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: 15 Pages, 11 Figures, 6 Tables

arXiv:2209.10001 [pdf, other]

Building Flexible, Low-Cost Wireless Access Networks With Magma

Authors: Shaddi Hasan, Amar Padmanabhan, Bruce Davie, Jennifer Rexford, Ulas Kozat, Hunter Gatewood, Shruti Sanadhya, Nick Yurchenko, Tariq Al-Khasib, Oriol Batalla, Marie Bremner, Andrei Lee, Evgeniy Makeev, Scott Moeller, Alex Rodriguez, Pravin Shelar, Karthik Subraveti, Sudarshan Kandi, Alejandro Xoconostle, Praveen Kumar Ramakrishnan, Xiaochen Tian, Anoop Tomar

Abstract: Billions of people remain without Internet access due to availability or affordability of service. In this paper, we present Magma, an open and flexible system for building low-cost wireless access networks. Magma aims to connect users where operator economics are difficult due to issues such as low population density or income levels, while preserving features expected in cellular networks such a… ▽ More Billions of people remain without Internet access due to availability or affordability of service. In this paper, we present Magma, an open and flexible system for building low-cost wireless access networks. Magma aims to connect users where operator economics are difficult due to issues such as low population density or income levels, while preserving features expected in cellular networks such as authentication and billing policies. To achieve this, and in contrast to traditional cellular networks, Magma adopts an approach that extensively leverages Internet design patterns, terminating access network-specific protocols at the edge and abstracting the access network from the core architecture. This decision allows Magma to refactor the wireless core using SDN (software-defined networking) principles and leverage other techniques from modern distributed systems. In doing so, Magma lowers cost and operational complexity for network operators while achieving resilience, scalability, and rich policy support. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: 15 pages, 10 figures, to be published in the 20th USENIX Symposium on Networked Systems Design and Implementation (2023), source code available at https://github.com/magma/magma

arXiv:2207.11365 [pdf, other]

EgoEnv: Human-centric environment representations from egocentric video

Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

Abstract: First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocen… ▽ More First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge. Project page: https://vision.cs.utexas.edu/projects/ego-env/ △ Less

Submitted 9 November, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

Comments: Published in NeurIPS 2023 (Oral)

arXiv:2206.15383 [pdf, other]

doi 10.1007/s41683-023-00115-1

Integrated Photonic Platforms for Quantum Technology: A Review

Authors: Rohit K Ramakrishnan, Aravinth Balaji Ravichandran, Arpita Mishra, Archana Kaushalram, Gopalkrishna Hegde, Srinivas Talabattula, Peter P Rohde

Abstract: Quantum information processing has conceptually changed the way we process and transmit information. Quantum physics, which explains the strange behaviour of matter at the microscopic dimensions, has matured into a quantum technology that can harness this strange behaviour for technological applications with far-reaching consequences, which uses quantum bits (qubits) for information processing. Ex… ▽ More Quantum information processing has conceptually changed the way we process and transmit information. Quantum physics, which explains the strange behaviour of matter at the microscopic dimensions, has matured into a quantum technology that can harness this strange behaviour for technological applications with far-reaching consequences, which uses quantum bits (qubits) for information processing. Experiments suggest that photons are the most successful candidates for realising qubits, which indicates that integrated photonic platforms will play a crucial role in realising quantum technology. This paper surveys the various photonic platforms based on different materials for quantum information processing. The future of this technology depends on the successful materials that can be used to universally realise quantum devices, similar to silicon, which shaped the industry towards the end of the last century. Though a prediction is implausible at this point, we provide an overview of the current status of research on the platforms based on various materials. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: 48 pages, 3 figures

arXiv:2206.15376 [pdf, other]

doi 10.1007/s41745-022-00336-7

The Quantum Internet: A Hardware Review

Authors: Rohit K. Ramakrishnan, Aravinth Balaji Ravichandran, Ishwar Kaushik, Gopalkrishna Hegde, Srinivas Talabattula, Peter P. Rohde

Abstract: In the century following its discovery, applications for quantum physics are opening a new world of technological possibilities. With the current decade witnessing quantum supremacy, quantum technologies are already starting to change the ways information is generated, transmitted, stored and processed. The next major milestone in quantum technology is already rapidly emerging -- the quantum inter… ▽ More In the century following its discovery, applications for quantum physics are opening a new world of technological possibilities. With the current decade witnessing quantum supremacy, quantum technologies are already starting to change the ways information is generated, transmitted, stored and processed. The next major milestone in quantum technology is already rapidly emerging -- the quantum internet. Since light is the most logical candidate for quantum communication, quantum photonics is a critical enabling technology. This paper reviews the hardware aspects of the quantum internet, mainly from a photonics perspective. Though a plethora of quantum technologies and devices have emerged in recent years, we are more focused on devices or components that may enable the quantum internet. Our approach is primarily qualitative, providing a broad overview of the necessary technologies for a large-scale quantum internet. △ Less

Submitted 1 June, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

Comments: 38 pages, 1 table

arXiv:2205.14836 [pdf, other]

doi 10.1021/acs.jctc.3c00114

Chemical bonding in large systems using projected population analysis from real-space density functional theory calculations

Authors: Kartick Ramakrishnan, Sai Krishna Kishore Nori, Seung-Cheol Lee, Gour P Das, Satadeep Bhattacharjee, Phani Motamarri

Abstract: We present an efficient and scalable computational approach for conducting projected population analysis from real-space finite-element (FE) based Kohn-Sham density functional theory calculations (DFT-FE). This work provides an important direction towards extracting chemical bonding information from large-scale DFT calculations on materials systems involving thousands of atoms while accommodating… ▽ More We present an efficient and scalable computational approach for conducting projected population analysis from real-space finite-element (FE) based Kohn-Sham density functional theory calculations (DFT-FE). This work provides an important direction towards extracting chemical bonding information from large-scale DFT calculations on materials systems involving thousands of atoms while accommodating periodic, semi-periodic or fully non-periodic boundary conditions. Towards this, we derive the relevant mathematical expressions and develop efficient numerical implementation procedures that are scalable on multi-node CPU architectures to compute the projected overlap and Hamilton populations. The population analysis is accomplished by projecting either the self-consistently converged FE discretized Kohn-Sham orbitals, or the FE discretized Hamiltonian onto a subspace spanned by a localized atom-centred basis set. The proposed methods are implemented in a unified framework within DFT-FE code where the ground-state DFT calculations and the population analysis are performed on the same FE grid. We further benchmark the accuracy and performance of this approach on representative material systems involving periodic and non-periodic DFT calculations with LOBSTER, a widely used projected population analysis code. Finally, we discuss a case study demonstrating the advantages of our scalable approach to extract the quantitative chemical bonding information of hydrogen chemisorbed in large silicon nanoparticles alloyed with carbon, a candidate material for hydrogen storage. △ Less

Submitted 23 June, 2023; v1 submitted 29 May, 2022; originally announced May 2022.

Comments: 24 Figures, 6 Tables, 57 pages with references and supplementary information

arXiv:2204.03729 [pdf]

Martensitic transformation in V_3Si single crystal: ^51V NMR evidence for coexistence of cubic and tetragonal phases

Authors: A. A. Gapud, S. K. Ramakrishnan, E. L. Green, A. P. Reyes

Abstract: The Martensitic transformation (MT) in A15 binary-alloy superconductor V_3Si, though studied extensively, has not yet been conclusively linked with a transition to superconductivity. Previous NMR studies have mainly been on powder samples and with little emphasis on temperature dependence during the transformation. Here we study a high-quality single crystal, where quadrupolar splitting of NMR spe… ▽ More The Martensitic transformation (MT) in A15 binary-alloy superconductor V_3Si, though studied extensively, has not yet been conclusively linked with a transition to superconductivity. Previous NMR studies have mainly been on powder samples and with little emphasis on temperature dependence during the transformation. Here we study a high-quality single crystal, where quadrupolar splitting of NMR spectra for ^51V allowed us to distinguish between spectra from transverse chains of V as a function of temperature. Our data revealed that (1) the MT is not abrupt, but rather there is a microscopic coexistence of pre-transformed cubic phase and transformed tetragonal phase over a few K below and above Tm, while (2) no pre-transformed phase can be found at Tc, and (3) the Martensitic lengthening of one axis occurs predominantly in a plane perpendicular to the crystal growth axis, as twinned domains. △ Less

Submitted 7 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Revised manuscript submitted 3 June 2022 to Physica C

arXiv:2203.05492 [pdf, other]

An Empirical Study of Low Precision Quantization for TinyML

Authors: Shaojie Zhuo, Hongyu Chen, Ramchalam Kinattinkara Ramakrishnan, Tommy Chen, Chen Feng, Yicheng Lin, Parker Zhang, Liang Shen

Abstract: Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important model compression technique that can greatly reduce both memory consumption and computation cost of model inference. In this study, we focus on post-training quanti… ▽ More Tiny machine learning (tinyML) has emerged during the past few years aiming to deploy machine learning models to embedded AI processors with highly constrained memory and computation capacity. Low precision quantization is an important model compression technique that can greatly reduce both memory consumption and computation cost of model inference. In this study, we focus on post-training quantization (PTQ) algorithms that quantize a model to low-bit (less than 8-bit) precision with only a small set of calibration data and benchmark them on different tinyML use cases. To achieve a fair comparison, we build a simulated quantization framework to investigate recent PTQ algorithms. Furthermore, we break down those algorithms into essential components and re-assembled a generic PTQ pipeline. With ablation study on different alternatives of components in the pipeline, we reveal key design choices when performing low precision quantization. We hope this work could provide useful data points and shed lights on the future research of low precision quantization. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: tinyML Research Symposium 2022

arXiv:2202.02440 [pdf, other]

Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

Authors: Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman

Abstract: In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality.… ▽ More In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin. △ Less

Submitted 28 April, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: CVPR 2022. Project page: https://vision.cs.utexas.edu/projects/zsel/

arXiv:2201.10029 [pdf, other]

PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Authors: Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman

Abstract: State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that… ▽ More State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training. Code and pre-trained models are available: https://vision.cs.utexas.edu/projects/poni/ △ Less

Submitted 17 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: 8 pages + supplementary. Accepted in CVPR 2022

arXiv:2110.07058 [pdf, other]

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/ △ Less

Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

arXiv:2109.08238 [pdf, other]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Authors: Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, Dhruv Batra

Abstract: We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in te… ▽ More We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset. △ Less

Submitted 16 September, 2021; originally announced September 2021.

Comments: 21 pages, 14 figures

arXiv:2106.03601 [pdf, other]

doi 10.18293/SEKE2021-129

Analyzing Open-Source Serverless Platforms: Characteristics and Performance

Authors: Junfeng Li, Sameer G. Kulkarni, K. K. Ramakrishnan, Dan Li

Abstract: Serverless computing is increasingly popular because of its lower cost and easier deployment. Several cloud service providers (CSPs) offer serverless computing on their public clouds, but it may bring the vendor lock-in risk. To avoid this limitation, many open-source serverless platforms come out to allow developers to freely deploy and manage functions on self-hosted clouds. However, building ef… ▽ More Serverless computing is increasingly popular because of its lower cost and easier deployment. Several cloud service providers (CSPs) offer serverless computing on their public clouds, but it may bring the vendor lock-in risk. To avoid this limitation, many open-source serverless platforms come out to allow developers to freely deploy and manage functions on self-hosted clouds. However, building effective functions requires much expertise and thorough comprehension of platform frameworks and features that affect performance. It is a challenge for a service developer to differentiate and select the appropriate serverless platform for different demands and scenarios. Thus, we elaborate the frameworks and event processing models of four popular open-source serverless platforms and identify their salient idiosyncrasies. We analyze the root causes of performance differences between different service exporting and auto-scaling modes on those platforms. Further, we provide several insights for future work, such as auto-scaling and metric collection. △ Less

Submitted 4 June, 2021; originally announced June 2021.

arXiv:2102.06185 [pdf, other]

Zeoco: An insight into daily carbon footprint consumption

Authors: Karthik Ramakrishnan, Gokul P, Preet Batavia, Shreesh Tripathi

Abstract: Climate change, which is now considered one of the biggest threats to humanity, is also the reason behind various other environmental concerns. Continued negligence might lead us to an irreparably damaged environment. After the partial failure of the Paris Agreement, it is quite evident that we as individuals need to come together to bring about a change on a large scale to have a significant impa… ▽ More Climate change, which is now considered one of the biggest threats to humanity, is also the reason behind various other environmental concerns. Continued negligence might lead us to an irreparably damaged environment. After the partial failure of the Paris Agreement, it is quite evident that we as individuals need to come together to bring about a change on a large scale to have a significant impact. This paper discusses our approach towards obtaining a realistic measure of the carbon footprint index being consumed by a user through day-to-day activities performed via a smart phone app and offering incentives in weekly and monthly leader board rankings along with a reward system. The app helps ease out decision makings on tasks like travel, shop**, electricity consumption, and gain a different and rather numerical perspective over the daily choices. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Comments: 4 Pages, 2 Figures(Flowcharts)

ACM Class: D.2.4

arXiv:2102.02337 [pdf, other]

Environment Predictive Coding for Embodied Agents

Authors: Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

Abstract: We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out porti… ▽ More We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigation-oriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: 9 pages, 6 figures, appendix

arXiv:2011.10608 [pdf, other]

Large Scale Neural Architecture Search with Polyharmonic Splines

Authors: Ulrich Finkler, Michele Merler, Rameswar Panda, Mayoore S. Jaiswal, Hui Wu, Kandan Ramakrishnan, Chun-Fu Chen, Minsik Cho, David Kung, Rogerio Feris, Bishwaranjan Bhattacharjee

Abstract: Neural Architecture Search (NAS) is a powerful tool to automatically design deep neural networks for many tasks, including image classification. Due to the significant computational burden of the search phase, most NAS methods have focused so far on small, balanced datasets. All attempts at conducting NAS at large scale have employed small proxy sets, and then transferred the learned architectures… ▽ More Neural Architecture Search (NAS) is a powerful tool to automatically design deep neural networks for many tasks, including image classification. Due to the significant computational burden of the search phase, most NAS methods have focused so far on small, balanced datasets. All attempts at conducting NAS at large scale have employed small proxy sets, and then transferred the learned architectures to larger datasets by replicating or stacking the searched cells. We propose a NAS method based on polyharmonic splines that can perform search directly on large scale, imbalanced target datasets. We demonstrate the effectiveness of our method on the ImageNet22K benchmark[16], which contains 14 million images distributed in a highly imbalanced manner over 21,841 categories. By exploring the search space of the ResNet [23] and Big-Little Net ResNext [11] architectures directly on ImageNet22K, our polyharmonic splines NAS method designed a model which achieved a top-1 accuracy of 40.03% on ImageNet22K, an absolute improvement of 3.13% over the state of the art with similar global batch size [15]. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2010.11757 [pdf, ps, other]

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

Authors: Chun-Fu Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan

Abstract: In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified… ▽ More In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch. △ Less

Submitted 29 March, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: CVPR 2021 camera-ready version. Codes and models are available on https://github.com/IBM/action-recognition-pytorch

arXiv:2008.13453 [pdf, other]

doi 10.1109/TNET.2021.3132279

CoShare: An Efficient Approach for Redundancy Allocation in NFV

Authors: Yordanos Tibebu Woldeyohannes, Besmir Tola, Yuming Jiang, K. K. Ramakrishnan

Abstract: An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. On the one hand this offers great flexibility in allocation of redundant instances, but on the other hand it makes the allocation a unique and difficult challenge. One particular concern is that there is inherent correlation among nodes due to t… ▽ More An appealing feature of Network Function Virtualization (NFV) is that in an NFV-based network, a network function (NF) instance may be placed at any node. On the one hand this offers great flexibility in allocation of redundant instances, but on the other hand it makes the allocation a unique and difficult challenge. One particular concern is that there is inherent correlation among nodes due to the structure of the network, thus requiring special care in this allocation. To this aim, our novel approach, called CoShare, is proposed. Firstly, its design takes into consideration the effect of network structural dependency, which might result in the unavailability of nodes of a network after failure of a node. Secondly, to efficiently make use of resources, CoShare proposes the idea of shared reservation, where multiple flows may be allowed to share the same reserved backup capacity at an NF instance. Furthermore, CoShare factors in the heterogeneity in nodes, NF instances and availability requirements of flows in the design. The results from a number of experiments conducted using realistic network topologies show that the integration of structural dependency allows meeting availability requirements for more flows compared to a baseline approach. Specifically, CoShare is able to meet diverse availability requirements in a resource-efficient manner, requiring, e.g., up to 85% in some studied cases, less resource overbuild than the baseline approach that uses the idea of dedicated reservation commonly adopted for redundancy allocation in NFV. △ Less

Submitted 22 November, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

Journal ref: IEEE/ACM Transactions on Networking, early access 2021

arXiv:2008.09622 [pdf, other]

Learning to Set Waypoints for Audio-Visual Navigation

Authors: Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman

Abstract: In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navig… ▽ More In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation. Project: http://vision.cs.utexas.edu/projects/audio_visual_waypoints. △ Less

Submitted 11 February, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

Comments: Accepted to ICLR 2021

arXiv:2008.09285 [pdf, other]

Occupancy Anticipation for Efficient Exploration and Navigation

Authors: Santhosh K. Ramakrishnan, Ziad Al-Halah, Kristen Grauman

Abstract: State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent. We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. In doing so, the agent builds its spatial awarene… ▽ More State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent. We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. In doing so, the agent builds its spatial awareness more rapidly, which facilitates efficient exploration and navigation in 3D environments. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment, with performance significantly better than strong baselines. Furthermore, when deployed for the sequential decision-making tasks of exploration and navigation, our model outperforms state-of-the-art methods on the Gibson and Matterport3D datasets. Our approach is the winning entry in the 2020 Habitat PointNav Challenge. Project page: http://vision.cs.utexas.edu/projects/occupancy_anticipation/ △ Less

Submitted 25 August, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

Comments: Accepted in ECCV 2020. 19 pages, 6 figures, appendix at end

arXiv:2008.03602 [pdf, other]

Spatial Sharing of GPU for Autotuning DNN models

Authors: Aditya Dhakal, Junguk Cho, Sameer G. Kulkarni, K. K. Ramakrishnan, Puneet Sharma

Abstract: GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources… ▽ More GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several DNNs on the GPU and can improve GPU utilization, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference is tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an interdependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. While a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the causes that impact the tuning of a model at different amounts of GPU resources. We present many techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by up to 75 percent and increase throughput by a factor of 5. △ Less

Submitted 8 August, 2020; originally announced August 2020.

arXiv:2006.13314 [pdf, other]

NASTransfer: Analyzing Architecture Transferability in Large Scale Neural Architecture Search

Authors: Rameswar Panda, Michele Merler, Mayoore Jaiswal, Hui Wu, Kandan Ramakrishnan, Ulrich Finkler, Chun-Fu Chen, Minsik Cho, David Kung, Rogerio Feris, Bishwaranjan Bhattacharjee

Abstract: Neural Architecture Search (NAS) is an open and challenging problem in machine learning. While NAS offers great promise, the prohibitive computational demand of most of the existing NAS methods makes it difficult to directly search the architectures on large-scale tasks. The typical way of conducting large scale NAS is to search for an architectural building block on a small dataset (either using… ▽ More Neural Architecture Search (NAS) is an open and challenging problem in machine learning. While NAS offers great promise, the prohibitive computational demand of most of the existing NAS methods makes it difficult to directly search the architectures on large-scale tasks. The typical way of conducting large scale NAS is to search for an architectural building block on a small dataset (either using a proxy set from the large dataset or a completely different small scale dataset) and then transfer the block to a larger dataset. Despite a number of recent results that show the promise of transfer from proxy datasets, a comprehensive evaluation of different NAS methods studying the impact of different source datasets has not yet been addressed. In this work, we propose to analyze the architecture transferability of different NAS methods by performing a series of experiments on large scale benchmarks such as ImageNet1K and ImageNet22K. We find that: (i) The size and domain of the proxy set does not seem to influence architecture performance on the target dataset. On average, transfer performance of architectures searched using completely different small datasets (e.g., CIFAR10) perform similarly to the architectures searched directly on proxy target datasets. However, design of proxy sets has considerable impact on rankings of different NAS methods. (ii) While different NAS methods show similar performance on a source dataset (e.g., CIFAR10), they significantly differ on the transfer performance to a large dataset (e.g., ImageNet1K). (iii) Even on large datasets, random sampling baseline is very competitive, but the choice of the appropriate combination of proxy set and search strategy can provide significant improvement over it. We believe that our extensive empirical analysis will prove useful for future design of NAS algorithms. △ Less

Submitted 11 February, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

Comments: 19 pages, 19 Figures, 6 Tables

MSC Class: 68T05 ACM Class: I.2.6; I.4

arXiv:2004.08320 [pdf]

A Computational Model of Levodopa-Induced Toxicity in Substantia Nigra Pars Compacta in Parkinson's Disease

Authors: Vignayanandam R. Muddapu, Karthik Vijayakumar, Keerthiga Ramakrishnan, V Srinivasa Chakravarthy

Abstract: Parkinson's disease (PD) is caused by the progressive loss of dopaminergic cells in substantia nigra pars compacta (SNc). The root cause of this cell loss in PD is still not decisively elucidated. A recent line of thinking traces the cause of PD neurodegeneration to metabolic deficiency. Due to exceptionally high energy demand, SNc neurons exhibit a higher basal metabolic rate and higher oxygen co… ▽ More Parkinson's disease (PD) is caused by the progressive loss of dopaminergic cells in substantia nigra pars compacta (SNc). The root cause of this cell loss in PD is still not decisively elucidated. A recent line of thinking traces the cause of PD neurodegeneration to metabolic deficiency. Due to exceptionally high energy demand, SNc neurons exhibit a higher basal metabolic rate and higher oxygen consumption rate, which results in oxidative stress. Recently, we have suggested that the excitotoxic loss of SNc cells might be due to energy deficiency occurring at different levels of neural hierarchy. Levodopa (LDOPA), a precursor of dopamine, which is used as a symptom-relieving treatment for PD, leads to outcomes that are both positive and negative. Several researchers suggested that LDOPA might be harmful to SNc cells due to oxidative stress. The role of LDOPA in the course of PD pathogenesis is still debatable. We hypothesize that energy deficiency can lead to LDOPA-induced toxicity (LIT) in two ways: by promoting dopamine-induced oxidative stress and by exacerbating excitotoxicity in SNc. We present a multiscale computational model of SNc-striatum system, which will help us in understanding the mechanism behind neurodegeneration postulated above and provides insights for develo** disease-modifying therapeutics. It was observed that SNc terminals are more vulnerable to energy deficiency than SNc somas. During LDOPA therapy, it was observed that higher LDOPA dosage results in increased loss of somas and terminals in SNc. It was also observed that co-administration of LDOPA and glutathione (antioxidant) evades LDOPA-induced toxicity in SNc neurons. We show that our proposed model was able to capture LDOPA-induced toxicity in SNc, caused by energy deficiency. △ Less

Submitted 1 April, 2020; originally announced April 2020.

arXiv:2001.02192 [pdf, other]

An Exploration of Embodied Visual Exploration

Authors: Santhosh K. Ramakrishnan, Dinesh Jayaraman, Kristen Grauman

Abstract: Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (i… ▽ More Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we first present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them. We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework with two photorealistic simulated 3D environments, a state-of-the-art exploration architecture, and diverse evaluation metrics. Our experimental results offer insights and suggest new performance metrics and baselines for future work in visual exploration. Code, models and data are publicly available: https://github.com/facebookresearch/exploring_exploration △ Less

Submitted 20 August, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: 30 main + 21 appendix pages, 23 figures

arXiv:1911.07449 [pdf, other]

doi 10.1145/3366623.3368139

Understanding Open Source Serverless Platforms: Design Considerations and Performance

Authors: Junfeng Li, Sameer G. Kulkarni, K. K. Ramakrishnan, Dan Li

Abstract: Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular op… ▽ More Serverless computing is increasingly popular because of the promise of lower cost and the convenience it provides to users who do not need to focus on server management. This has resulted in the availability of a number of proprietary and open-source serverless solutions. We seek to understand how the performance of serverless computing depends on a number of design issues using several popular open-source serverless platforms. We identify the idiosyncrasies affecting performance (throughput and latency) for different open-source serverless platforms. Further, we observe that just having either resource-based (CPU and memory) or workload-based (request per second (RPS) or concurrent requests) auto-scaling is inadequate to address the needs of the serverless platforms. △ Less

Submitted 12 December, 2019; v1 submitted 18 November, 2019; originally announced November 2019.

Journal ref: Proceedings of the 5th International Workshop on Serverless Computing, Pages 37-42, 2019

arXiv:1911.00232 [pdf, other]

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Authors: Mathew Monfort, Bowen Pan, Kandan Ramakrishnan, Alex Andonian, Barry A McNamara, Alex Lascelles, Quanfu Fan, Dan Gutfreund, Rogerio Feris, Aude Oliva

Abstract: Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not… ▽ More Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we present the Multi-Moments in Time dataset (M-MiT) which includes over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning, provide improved methods for visualizing and interpreting models trained for multi-label action detection and show the strength of transferring models trained on M-MiT to smaller datasets. △ Less

Submitted 27 September, 2021; v1 submitted 1 November, 2019; originally announced November 2019.

arXiv:1909.04567 [pdf, other]

Differentiable Mask for Pruning Convolutional and Recurrent Networks

Authors: Ramchalam Kinattinkara Ramakrishnan, Eyyüb Sari, Vahid Partovi Nia

Abstract: Pruning is one of the most effective model reduction techniques. Deep networks require massive computation and such models need to be compressed to bring them on edge devices. Most existing pruning techniques are focused on vision-based models like convolutional networks, while text-based models are still evolving. The emergence of multi-modal multi-task learning calls for a general method that wo… ▽ More Pruning is one of the most effective model reduction techniques. Deep networks require massive computation and such models need to be compressed to bring them on edge devices. Most existing pruning techniques are focused on vision-based models like convolutional networks, while text-based models are still evolving. The emergence of multi-modal multi-task learning calls for a general method that works on vision and text architectures simultaneously. We introduce a \emph{differentiable mask}, that induces sparsity on various granularity to fill this gap. We apply our method successfully to prune weights, filters, subnetwork of a convolutional architecture, as well as nodes of a recurrent network. △ Less

Submitted 29 April, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

arXiv:1906.11407 [pdf, other]

doi 10.1126/scirobotics.aaw6326

Emergence of Exploratory Look-Around Behaviors through Active Observation Completion

Authors: Santhosh K. Ramakrishnan, Dinesh Jayaraman, Kristen Grauman

Abstract: Standard computer vision systems assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself. We address the problem of learning to look around: how can an agent learn to acquire informative visual observations? We propose a reinforcement learning solution, where the agent is rewarded for reduc… ▽ More Standard computer vision systems assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself. We address the problem of learning to look around: how can an agent learn to acquire informative visual observations? We propose a reinforcement learning solution, where the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment. Specifically, the agent is trained to select a short sequence of glimpses after which it must infer the appearance of its full environment. To address the challenge of sparse rewards, we further introduce sidekick policy learning, which exploits the asymmetry in observability between training and test time. The proposed methods learn observation policies that not only perform the completion task for which they are trained, but also generalize to exhibit useful "look-around" behavior for a range of active perception tasks. △ Less

Submitted 26 June, 2019; originally announced June 2019.

Comments: Main paper 7 figures, supplementary 6 figures. Published in Science Robotics 2019

arXiv:1905.05675 [pdf, other]

The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence

Authors: Radoslaw Martin Cichy, Gemma Roig, Alex Andonian, Kshitij Dwivedi, Benjamin Lahner, Alex Lascelles, Yalda Mohsenzadeh, Kandan Ramakrishnan, Aude Oliva

Abstract: In the last decade, artificial intelligence (AI) models inspired by the brain have made unprecedented progress in performing real-world perceptual tasks like object classification and speech recognition. Recently, researchers of natural intelligence have begun using those AI models to explore how the brain performs such tasks. These developments suggest that future progress will benefit from incre… ▽ More In the last decade, artificial intelligence (AI) models inspired by the brain have made unprecedented progress in performing real-world perceptual tasks like object classification and speech recognition. Recently, researchers of natural intelligence have begun using those AI models to explore how the brain performs such tasks. These developments suggest that future progress will benefit from increased interaction between disciplines. Here we introduce the Algonauts Project as a structured and quantitative communication channel for interdisciplinary interaction between natural and artificial intelligence researchers. The project's core is an open challenge with a quantitative benchmark whose goal is to account for brain data through computational models. This project has the potential to provide better models of natural intelligence and to gather findings that advance AI. The 2019 Algonauts Project focuses on benchmarking computational models predicting human brain activity when people look at pictures of objects. The 2019 edition of the Algonauts Project is available online: http://algonauts.csail.mit.edu/. △ Less

Submitted 14 May, 2019; originally announced May 2019.

Comments: 4 pages, 2 figures

arXiv:1904.00775 [pdf, other]

Deep Demosaicing for Edge Implementation

Authors: Ramchalam Kinattinkara Ramakrishnan, Shangling Jui, Vahid Patrovi Nia

Abstract: Most digital cameras use sensors coated with a Color Filter Array (CFA) to capture channel components at every pixel location, resulting in a mosaic image that does not contain pixel values in all channels. Current research on reconstructing these missing channels, also known as demosaicing, introduces many artifacts, such as zipper effect and false color. Many deep learning demosaicing techniques… ▽ More Most digital cameras use sensors coated with a Color Filter Array (CFA) to capture channel components at every pixel location, resulting in a mosaic image that does not contain pixel values in all channels. Current research on reconstructing these missing channels, also known as demosaicing, introduces many artifacts, such as zipper effect and false color. Many deep learning demosaicing techniques outperform other classical techniques in reducing the impact of artifacts. However, most of these models tend to be over-parametrized. Consequently, edge implementation of the state-of-the-art deep learning-based demosaicing algorithms on low-end edge devices is a major challenge. We provide an exhaustive search of deep neural network architectures and obtain a pareto front of Color Peak Signal to Noise Ratio (CPSNR) as the performance criterion versus the number of parameters as the model complexity that beats the state-of-the-art. Architectures on the pareto front can then be used to choose the best architecture for a variety of resource constraints. Simple architecture search methods such as exhaustive search and grid search require some conditions of the loss function to converge to the optimum. We clarify these conditions in a brief theoretical study. △ Less

Submitted 23 May, 2019; v1 submitted 26 March, 2019; originally announced April 2019.

Comments: Accepted in the 16th International Conference of Image Analysis and Recognition (ICIAR 2019)

arXiv:1807.11010 [pdf, other]

Sidekick Policy Learning for Active Visual Exploration

Authors: Santhosh K. Ramakrishnan, Kristen Grauman

Abstract: We consider an active visual exploration scenario, where an agent must intelligently select its camera motions to efficiently reconstruct the full environment from only a limited set of narrow field-of-view glimpses. While the agent has full observability of the environment during training, it has only partial observability once deployed, being constrained by what portions it has seen and what cam… ▽ More We consider an active visual exploration scenario, where an agent must intelligently select its camera motions to efficiently reconstruct the full environment from only a limited set of narrow field-of-view glimpses. While the agent has full observability of the environment during training, it has only partial observability once deployed, being constrained by what portions it has seen and what camera motions are permissible. We introduce sidekick policy learning to capitalize on this imbalance of observability. The main idea is a preparatory learning phase that attempts simplified versions of the eventual exploration task, then guides the agent via reward sha** or initial policy supervision. To support interpretation of the resulting policies, we also develop a novel policy visualization technique. Results on active visual exploration tasks with 360 scenes and 3D objects show that sidekicks consistently improve performance and convergence rates over existing methods. Code, data and demos are available. △ Less

Submitted 29 July, 2018; originally announced July 2018.

Comments: 26 pages, 13 figures, to appear in ECCV 2018

arXiv:1801.03150 [pdf, other]

Moments in Time Dataset: one million videos for event understanding

Authors: Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, Aude Oliva

Abstract: We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and audito… ▽ More We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and auditory events can be symmetrical in time ("opening" is "closing" in reverse), and either transient or sustained. We describe the annotation process of our dataset (each video is tagged with one action or activity label among 339 different classes), analyze its scale and diversity in comparison to other large-scale video datasets for action recognition, and report results of several baseline models addressing separately, and jointly, three modalities: spatial, temporal and auditory. The Moments in Time dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. △ Less

Submitted 16 February, 2019; v1 submitted 9 January, 2018; originally announced January 2018.

arXiv:1711.09648 [pdf, ps, other]

Transfer Learning in CNNs Using Filter-Trees

Authors: Suresh Kirthi Kumaraswamy, PS Sastry, KR Ramakrishnan

Abstract: Convolutional Neural Networks (CNNs) are very effective for many pattern recognition tasks. However, training deep CNNs needs extensive computation and large training data. In this paper we propose Bank of Filter-Trees (BFT) as a trans- fer learning mechanism for improving efficiency of learning CNNs. A filter-tree corresponding to a filter in k^{th} convolu- tional layer of a CNN is a subnetwork… ▽ More Convolutional Neural Networks (CNNs) are very effective for many pattern recognition tasks. However, training deep CNNs needs extensive computation and large training data. In this paper we propose Bank of Filter-Trees (BFT) as a trans- fer learning mechanism for improving efficiency of learning CNNs. A filter-tree corresponding to a filter in k^{th} convolu- tional layer of a CNN is a subnetwork consisting of the filter along with all its connections to filters in all preceding layers. An ensemble of such filter-trees created from the k^{th} layers of many CNNs learnt on different but related tasks, forms the BFT. To learn a new CNN, we sample from the BFT to select a set of filter trees. This fixes the target net up to the k th layer and only the remaining network would be learnt using train- ing data of new task. Through simulations we demonstrate the effectiveness of this idea of BFT. This method constitutes a novel transfer learning technique where transfer is at a sub- network level; transfer can be effected from multiple source networks; and, with no finetuning of the transferred weights, the performance achieved is on par with networks that are trained from scratch. △ Less

Submitted 27 November, 2017; originally announced November 2017.

Comments: 8 pages, 3 figures

arXiv:1706.02331 [pdf, other]

CoMaL Tracking: Tracking Points at the Object Boundaries

Authors: Santhosh K. Ramakrishnan, Swarna Kamlam Ravindran, Anurag Mittal

Abstract: Traditional point tracking algorithms such as the KLT use local 2D information aggregation for feature detection and tracking, due to which their performance degrades at the object boundaries that separate multiple objects. Recently, CoMaL Features have been proposed that handle such a case. However, they proposed a simple tracking framework where the points are re-detected in each frame and match… ▽ More Traditional point tracking algorithms such as the KLT use local 2D information aggregation for feature detection and tracking, due to which their performance degrades at the object boundaries that separate multiple objects. Recently, CoMaL Features have been proposed that handle such a case. However, they proposed a simple tracking framework where the points are re-detected in each frame and matched. This is inefficient and may also lose many points that are not re-detected in the next frame. We propose a novel tracking algorithm to accurately and efficiently track CoMaL points. For this, the level line segment associated with the CoMaL points is matched to MSER segments in the next frame using shape-based matching and the matches are further filtered using texture-based matching. Experiments show improvements over a simple re-detect-and-match framework as well as KLT in terms of speed/accuracy on different real-world applications, especially at the object boundaries. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: 10 pages, 10 figures, to appear in 1st Joint BMTT-PETS Workshop on Tracking and Surveillance, CVPR 2017

arXiv:1706.01757 [pdf, other]

doi 10.1016/j.cortex.2017.09.019

Visual pathways from the perspective of cost functions and multi-task deep neural networks

Authors: H. Steven Scholte, Max M. Losch, Kandan Ramakrishnan, Edward H. F. de Haan, Sander M. Bohte

Abstract: Vision research has been shaped by the seminal insight that we can understand the higher-tier visual cortex from the perspective of multiple functional pathways with different goals. In this paper, we try to give a computational account of the functional organization of this system by reasoning from the perspective of multi-task deep neural networks. Machine learning has shown that tasks become ea… ▽ More Vision research has been shaped by the seminal insight that we can understand the higher-tier visual cortex from the perspective of multiple functional pathways with different goals. In this paper, we try to give a computational account of the functional organization of this system by reasoning from the perspective of multi-task deep neural networks. Machine learning has shown that tasks become easier to solve when they are decomposed into subtasks with their own cost function. We hypothesize that the visual system optimizes multiple cost functions of unrelated tasks and this causes the emergence of a ventral pathway dedicated to vision for perception, and a dorsal pathway dedicated to vision for action. To evaluate the functional organization in multi-task deep neural networks, we propose a method that measures the contribution of a unit towards each task, applying it to two networks that have been trained on either two related or two unrelated tasks, using an identical stimulus set. Results show that the network trained on the unrelated tasks shows a decreasing degree of feature representation sharing towards higher-tier layers while the network trained on related tasks uniformly shows high degree of sharing. We conjecture that the method we propose can be used to analyze the anatomical and functional organization of the visual system and beyond. We predict that the degree to which tasks are related is a good descriptor of the degree to which they share downstream cortical-units. △ Less

Submitted 16 September, 2017; v1 submitted 6 June, 2017; originally announced June 2017.

Comments: 16 pages, 5 figures

arXiv:1704.02516 [pdf, other]

An Empirical Evaluation of Visual Question Answering for Novel Objects

Authors: Santhosh K. Ramakrishnan, Ambar Pal, Gaurav Sharma, Anurag Mittal

Abstract: We study the problem of answering questions about images in the harder setting, where the test questions and corresponding images contain novel objects, which were not queried about in the training data. Such setting is inevitable in real world-owing to the heavy tailed distribution of the visual categories, there would be some objects which would not be annotated in the train set. We show that th… ▽ More We study the problem of answering questions about images in the harder setting, where the test questions and corresponding images contain novel objects, which were not queried about in the training data. Such setting is inevitable in real world-owing to the heavy tailed distribution of the visual categories, there would be some objects which would not be annotated in the train set. We show that the performance of two popular existing methods drop significantly (up to 28%) when evaluated on novel objects cf. known objects. We propose methods which use large existing external corpora of (i) unlabeled text, i.e. books, and (ii) images tagged with classes, to achieve novel object based visual question answering. We do systematic empirical studies, for both an oracle case where the novel objects are known textually, as well as a fully automatic case without any explicit knowledge of the novel objects, but with the minimal assumption that the novel objects are semantically related to the existing objects in training. The proposed methods for novel object based visual question answering are modular and can potentially be used with many visual question answering architectures. We show consistent improvements with the two popular architectures and give qualitative analysis of the cases where the model does well and of those where it fails to bring improvements. △ Less

Submitted 8 April, 2017; originally announced April 2017.

Comments: 11 pages, 4 figures, accepted in CVPR 2017 (poster)

arXiv:1606.02599 [pdf, other]

SDNFV: Flexible and Dynamic Software Defined Control of an Application- and Flow-Aware Data Plane

Authors: Wei Zhang, Guyue Liu, Timothy Wood, K. K. Ramakrishnan, **ho Hwang

Abstract: Software Defined Networking (SDN) promises greater flexibility for directing packet flows, and Network Function Virtualization promises to enable dynamic management of software-based network functions. However, the current divide between an intelligent control plane and an overly simple, stateless data plane results in the inability to exploit the flexibility of a software based network. In this p… ▽ More Software Defined Networking (SDN) promises greater flexibility for directing packet flows, and Network Function Virtualization promises to enable dynamic management of software-based network functions. However, the current divide between an intelligent control plane and an overly simple, stateless data plane results in the inability to exploit the flexibility of a software based network. In this paper we propose SDNFV, a framework that expands the capabilities of network processing-and-forwarding elements to flexibly manage packet flows, while retaining both a high performance data plane and an easily managed control plane. SDNFV proposes a hierarchical control framework where decisions are made across the SDN controller, a host-level manager, and individual VMs to best exploit state available at each level. This increases the network's flexibility compared to existing SDNs where controllers often make decisions solely based on the first packet header of a flow. SDNFV intelligently places network services across hosts and connects them in sequential and parallel chains, giving both the SDN controller and individual network functions the ability to enhance and update flow rules to adapt to changing conditions. Our prototype demonstrates how to efficiently and flexibly reroute flows based on data plane state such as packet payloads and traffic characteristics. △ Less

Submitted 8 June, 2016; originally announced June 2016.

arXiv:1510.08530 [pdf, other]

SAID: A Control Protocol for Scalable and Adaptive Information Dissemination in ICN

Authors: Jiachen Chen, Mayutan Arumaithurai, Xiaoming Fu, K. K. Ramakrishnan

Abstract: Information dissemination applications (video, news, social media, etc.) with large number of receivers need to be efficient but also have limited loss tolerance. The new Information-Centric Networks (ICN) paradigm offers an alternative approach for reliably delivering data by naming content and exploiting data available at any intermediate point (e.g., caches). However, receivers are often hetero… ▽ More Information dissemination applications (video, news, social media, etc.) with large number of receivers need to be efficient but also have limited loss tolerance. The new Information-Centric Networks (ICN) paradigm offers an alternative approach for reliably delivering data by naming content and exploiting data available at any intermediate point (e.g., caches). However, receivers are often heterogeneous, with widely varying receive rates. When using existing ICN congestion control mechanisms with in-sequence delivery, a particularly thorny problem of receivers going out-of-sync results in inefficiency and unfairness with heterogeneous receivers. We argue that separating reliability from congestion control leads to more scalable, efficient and fair data dissemination, and propose SAID, a Control Protocol for Scalable and Adaptive Information Dissemination in ICN. To maximize the amount of data transmitted at the first attempt, receivers request any next packet (ANP) of a flow instead of next-in-sequence packet, independent of the provider's transmit rate. This allows providers to transmit at an application-efficient rate, without being limited by the slower receivers. SAID ensures reliable delivery to all receivers eventually, by cooperative repair, while preserving privacy without unduly trusting other receivers. △ Less

Submitted 28 October, 2015; originally announced October 2015.

Comments: 12 pages

arXiv:1509.08439 [pdf, other]

Hyper-Fisher Vectors for Action Recognition

Authors: Sanath Narayan, Kalpathi R. Ramakrishnan

Abstract: In this paper, a novel encoding scheme combining Fisher vector and bag-of-words encodings has been proposed for recognizing action in videos. The proposed Hyper-Fisher vector encoding is sum of local Fisher vectors which are computed based on the traditional Bag-of-Words (BoW) encoding. Thus, the proposed encoding is simple and yet an effective representation over the traditional Fisher Vector enc… ▽ More In this paper, a novel encoding scheme combining Fisher vector and bag-of-words encodings has been proposed for recognizing action in videos. The proposed Hyper-Fisher vector encoding is sum of local Fisher vectors which are computed based on the traditional Bag-of-Words (BoW) encoding. Thus, the proposed encoding is simple and yet an effective representation over the traditional Fisher Vector encoding. By extensive evaluation on challenging action recognition datasets, viz., Youtube, Olympic Sports, UCF50 and HMDB51, we show that the proposed Hyper-Fisher Vector encoding improves the recognition performance by around 2-3% compared to the improved Fisher Vector encoding. We also perform experiments to show that the performance of the Hyper-Fisher Vector is robust to the dictionary size of the BoW encoding. △ Less

Submitted 28 September, 2015; originally announced September 2015.

Showing 1–50 of 59 results for author: Ramakrishnan, K