Search | arXiv e-print repository

The 3D-PC: a benchmark for visual perspective taking in humans and machines

Authors: Drew Linsley, Peisen Zhou, Alekh Karkada Ashok, Akash Nagaraj, Gaurav Gaonkar, Francis E Lewis, Zygmunt Pizlo, Thomas Serre

Abstract: Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on… ▽ More Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-perturb. Our challenge demonstrates that the training routines and architectures of today's DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties like humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2306.11327 [pdf, other]

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted to be published in the Proceedings of InterSpeech 2023

arXiv:2305.16872 [pdf]

The Economics of Augmented and Virtual Reality

Authors: Joshua Gans, Abhishek Nagaraj

Abstract: This paper explores the economics of Augmented Reality (AR) and Virtual Reality (VR) technologies within decision-making contexts. Two metrics are proposed: Context Entropy, the informational complexity of an environment, and Context Immersivity, the value from full immersion. The analysis suggests that AR technologies assist in understanding complex contexts, while VR technologies provide access… ▽ More This paper explores the economics of Augmented Reality (AR) and Virtual Reality (VR) technologies within decision-making contexts. Two metrics are proposed: Context Entropy, the informational complexity of an environment, and Context Immersivity, the value from full immersion. The analysis suggests that AR technologies assist in understanding complex contexts, while VR technologies provide access to distant, risky, or expensive environments. The paper provides a framework for assessing the value of AR and VR applications in various business sectors by evaluating the pre-existing context entropy and context immersivity. The goal is to identify areas where immersive technologies can significantly impact and distinguish those that may be overhyped. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 13 pages, 1 table

arXiv:2301.11722 [pdf, other]

Diffusion Models as Artists: Are we Closing the Gap between Humans and Machines?

Authors: Victor Boutin, Thomas Fel, Lakshya Singhal, Rishav Mukherji, Akash Nagaraj, Julien Colin, Thomas Serre

Abstract: An important milestone for AI is the development of algorithms that can produce drawings that are indistinguishable from those of humans. Here, we adapt the 'diversity vs. recognizability' scoring framework from Boutin et al, 2022 and find that one-shot diffusion models have indeed started to close the gap between humans and machines. However, using a finer-grained measure of the originality of in… ▽ More An important milestone for AI is the development of algorithms that can produce drawings that are indistinguishable from those of humans. Here, we adapt the 'diversity vs. recognizability' scoring framework from Boutin et al, 2022 and find that one-shot diffusion models have indeed started to close the gap between humans and machines. However, using a finer-grained measure of the originality of individual samples, we show that strengthening the guidance of diffusion models helps improve the humanness of their drawings, but they still fall short of approximating the originality and recognizability of human drawings. Comparing human category diagnostic features, collected through an online psychophysics experiment, against those derived from diffusion models reveals that humans rely on fewer and more localized features. Overall, our study suggests that diffusion models have significantly helped improve the quality of machine-generated drawings; however, a gap between humans and machines remains -- in part explainable by discrepancies in visual strategies. △ Less

Submitted 31 May, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:2210.09053 [pdf, other]

doi 10.1007/978-981-33-4543-0_8

Cross-domain Variational Capsules for Information Extraction

Authors: Akash Nagaraj, Akhil K, Akshay Venkatesh, Srikanth HR

Abstract: In this paper, we present a characteristic extraction algorithm and the Multi-domain Image Characteristics Dataset of characteristic-tagged images to simulate the way a human brain classifies cross-domain information and generates insight. The intent was to identify prominent characteristics in data and use this identification mechanism to auto-generate insight from data in other unseen domains. A… ▽ More In this paper, we present a characteristic extraction algorithm and the Multi-domain Image Characteristics Dataset of characteristic-tagged images to simulate the way a human brain classifies cross-domain information and generates insight. The intent was to identify prominent characteristics in data and use this identification mechanism to auto-generate insight from data in other unseen domains. An information extraction algorithm is proposed which is a combination of Variational Autoencoders (VAEs) and Capsule Networks. Capsule Networks are used to decompose images into their individual features and VAEs are used to explore variations on these decomposed features. Thus, making the model robust in recognizing characteristics from variations of the data. A noteworthy point is that the algorithm uses efficient hierarchical decoding of data which helps in richer output interpretation. Noticing a dearth in the number of datasets that contain visible characteristics in images belonging to various domains, the Multi-domain Image Characteristics Dataset was created and made publicly available. It consists of thousands of images across three domains. This dataset was created with the intent of introducing a new benchmark for fine-grained characteristic recognition tasks in the future. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was originally written in 2020

Journal ref: In Innovations in Computer Science and Engineering, pp. 63-72. Springer, Singapore, 2021

arXiv:2210.09052 [pdf, other]

Digital Image Forensics using Deep Learning

Authors: Akash Nagaraj, Mukund Sood, Vivek Kapoor, Yash Mathur, Bishesh Sinha

Abstract: During the investigation of criminal activity when evidence is available, the issue at hand is determining the credibility of the video and ascertaining that the video is real. Today, one way to authenticate the footage is to identify the camera that was used to capture the image or video in question. While a very common way to do this is by using image meta-data, this data can easily be falsified… ▽ More During the investigation of criminal activity when evidence is available, the issue at hand is determining the credibility of the video and ascertaining that the video is real. Today, one way to authenticate the footage is to identify the camera that was used to capture the image or video in question. While a very common way to do this is by using image meta-data, this data can easily be falsified by changing the video content or even splicing together content from two different cameras. Given the multitude of solutions proposed to this problem, it is yet to be sufficiently solved. The aim of our project is to build an algorithm that identifies which camera was used to capture an image using traces of information left intrinsically in the image, using filters, followed by a deep neural network on these filters. Solving this problem would have a big impact on the verification of evidence used in criminal and civil trials and even news reporting. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was written in 2018 as a part of our submission to the 2018 IEEE Signal Processing Cup: Forensic Camera Model Identification Challenge

arXiv:2210.09004 [pdf, other]

doi 10.1109/ICALT.2018.00122

Real-Time Automated Answer Scoring

Authors: Akash Nagaraj, Mukund Sood, Gowri Srinivasa

Abstract: In recent years, the role of big data analytics has exponentially grown and is now slowly making its way into the education industry. Several attempts are being made in this sphere in order to improve the quality of education being provided to students and while many collaborations have been carried out before, automated scoring of answers has been explored to a rather limited extent. One of the b… ▽ More In recent years, the role of big data analytics has exponentially grown and is now slowly making its way into the education industry. Several attempts are being made in this sphere in order to improve the quality of education being provided to students and while many collaborations have been carried out before, automated scoring of answers has been explored to a rather limited extent. One of the biggest hurdles to choosing constructed-response assessments over multiple-choice assessments is the effort and large cost that comes with their evaluation and this is precisely the issue that this project aims to solve. The aim is to accept raw-input from the student in the form of their answer, preprocess the answer, and automatically score the answer. In addition, we have made this a real-time system that captures "snapshots" of the writer's progress with respect to the answer, allowing us to unearth trends with respect to the way a student thinks, and how the student has arrived at their final answer. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was originally written in mid 2018

Journal ref: In 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), pp.231-232. IEEE, 2018

arXiv:2210.07465 [pdf, other]

Learning Algorithms in Static Analysis of Web Applications

Authors: Akash Nagaraj, Bishesh Sinha, Mukund Sood, Yash Mathur, Sanchika Gupta, Dinkar Sitaram

Abstract: Web applications are distributed applications, they are programs that run on more than one computer and communicate through a network or server. This very distributed nature of web applications, combined with the scale and sheer complexity of modern software systems complicate manual security auditing, while also creating a huge attack surface of potential hackers. These factors are making automat… ▽ More Web applications are distributed applications, they are programs that run on more than one computer and communicate through a network or server. This very distributed nature of web applications, combined with the scale and sheer complexity of modern software systems complicate manual security auditing, while also creating a huge attack surface of potential hackers. These factors are making automated analysis a necessity. Static Application Security Testing (SAST) is a method devised to automatically analyze application source code of large code bases without compiling it, and design conditions that are indicative of security vulnerabilities. However, the problem lies in the fact that the most widely used Static Application Security Testing Tools often yield unreliable results, owing to the false positive classification of vulnerabilities grossly outnumbering the classification of true positive vulnerabilities. This is one of the biggest hindrances to the proliferation of SAST testing, which leaves the user to review hundreds, if not thousands, of potential warnings, and classify them as either actionable or spurious. We try to minimize the problem of false positives by introducing a technique to filter the output of SAST tools. The aim of the project is to apply learning algorithms to the output by analyzing the true and false positives classified by OWASP Benchmark, and eliminate, or reduce the number of false positives presented to the user of the SAST Tool. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was originally written in 2019

arXiv:2210.07400 [pdf, other]

Real-time Action Recognition for Fine-Grained Actions and The Hand Wash Dataset

Authors: Akash Nagaraj, Mukund Sood, Chetna Sureka, Gowri Srinivasa

Abstract: In this paper we present a three-stream algorithm for real-time action recognition and a new dataset of handwash videos, with the intent of aligning action recognition with real-world constraints to yield effective conclusions. A three-stream fusion algorithm is proposed, which runs both accurately and efficiently, in real-time even on low-powered systems such as a Raspberry Pi. The cornerstone of… ▽ More In this paper we present a three-stream algorithm for real-time action recognition and a new dataset of handwash videos, with the intent of aligning action recognition with real-world constraints to yield effective conclusions. A three-stream fusion algorithm is proposed, which runs both accurately and efficiently, in real-time even on low-powered systems such as a Raspberry Pi. The cornerstone of the proposed algorithm is the incorporation of both spatial and temporal information, as well as the information of the objects in a video while using an efficient architecture, and Optical Flow computation to achieve commendable results in real-time. The results achieved by this algorithm are benchmarked on the UCF-101 as well as the HMDB-51 datasets, achieving an accuracy of 92.7% and 64.9% respectively. An important point to note is that the algorithm is novel in the aspect that it is also able to learn the intricate differences between extremely similar actions, which would be difficult even for the human eye. Additionally, noticing a dearth in the number of datasets for the recognition of very similar or fine-grained actions, this paper also introduces a new dataset that is made publicly available, the Hand Wash Dataset with the intent of introducing a new benchmark for fine-grained action recognition tasks in the future. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was originally written in early 2020

arXiv:2210.07397 [pdf, other]

A Concise Introduction to Reinforcement Learning in Robotics

Authors: Akash Nagaraj, Mukund Sood, Bhagya M Patil

Abstract: One of the biggest hurdles robotics faces is the facet of sophisticated and hard-to-engineer behaviors. Reinforcement learning offers a set of tools, and a framework to address this problem. In parallel, the misgivings of robotics offer a solid testing ground and evaluation metric for advancements in reinforcement learning. The two disciplines go hand-in-hand, much like the fields of Mathematics a… ▽ More One of the biggest hurdles robotics faces is the facet of sophisticated and hard-to-engineer behaviors. Reinforcement learning offers a set of tools, and a framework to address this problem. In parallel, the misgivings of robotics offer a solid testing ground and evaluation metric for advancements in reinforcement learning. The two disciplines go hand-in-hand, much like the fields of Mathematics and Physics. By means of this survey paper, we aim to invigorate links between the research communities of the two disciplines by focusing on the work done in reinforcement learning for locomotive and control aspects of robotics. Additionally, we aim to highlight not only the notable successes but also the key challenges of the application of Reinforcement Learning in Robotics. This paper aims to serve as a reference guide for researchers in reinforcement learning applied to the field of robotics. The literature survey is at a fairly introductory level, aimed at aspiring researchers. Appropriately, we have covered the most essential concepts required for research in the field of reinforcement learning, with robotics in mind. Through a thorough analysis of this problem, we are able to manifest how reinforcement learning could be applied profitably, and also focus on open-ended questions, as well as the potential for future research. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: This paper was originally written in 2019

Journal ref: Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications 2021

arXiv:2204.05872 [pdf, other]

Robust Quantification of Gender Disparity in Pre-Modern English Literature using Natural Language Processing

Authors: Akarsh Nagaraj, Mayank Kejriwal

Abstract: Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the Natural Language Processing (NLP) literature have been proposed for measuring such disparity using relatively extensive datasets and empirically rigorous methodologies. In this paper, we contribute to this line of research by… ▽ More Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the Natural Language Processing (NLP) literature have been proposed for measuring such disparity using relatively extensive datasets and empirically rigorous methodologies. In this paper, we contribute to this line of research by studying gender disparity, at scale, in copyright-expired literary texts published in the pre-modern period (defined in this work as the period ranging from the mid-nineteenth through the mid-twentieth century). One of the challenges in using such tools is to ensure quality control, and by extension, trustworthy statistical analysis. Another challenge is in using materials and methods that are publicly available and have been established for some time, both to ensure that they can be used and vetted in the future, and also, to add confidence to the methodology itself. We present our solution to addressing these challenges, and using multiple measures, demonstrate the significant discrepancy between the prevalence of female characters and male characters in pre-modern literature. The evidence suggests that the discrepancy declines when the author is female. The discrepancy seems to be relatively stable as we plot data over the decades in this century-long period. Finally, we aim to carefully describe both the limitations and ethical caveats associated with this study, and others like it. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Showing 1–11 of 11 results for author: Nagaraj, A