Search | arXiv e-print repository

Compositional Learning of Visually-Grounded Concepts Using Reinforcement

Authors: Zijun Lin, Haidi Azaman, M Ganesh Kumar, Cheston Tan

Abstract: Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Hence, we investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthet… ▽ More Children can rapidly generalize compositionally-constructed rules to unseen test sets. On the other hand, deep reinforcement learning (RL) agents need to be trained over millions of episodes, and their ability to generalize to unseen combinations remains unclear. Hence, we investigate the compositional abilities of RL agents, using the task of navigating to specified color-shape targets in synthetic 3D environments. First, we show that when RL agents are naively trained to navigate to target color-shape combinations, they implicitly learn to decompose the combinations, allowing them to (re-)compose these and succeed at held-out test combinations ("compositional learning"). Second, when agents are pretrained to learn invariant shape and color concepts ("concept learning"), the number of episodes subsequently needed for compositional learning decreased by 20 times. Furthermore, only agents trained on both concept and compositional learning could solve a more complex, out-of-distribution environment in zero-shot fashion. Finally, we verified that only text encoders pretrained on image-text datasets (e.g. CLIP) reduced the number of training episodes needed for our agents to demonstrate compositional learning, and also generalized to 5 unseen colors in zero-shot fashion. Overall, our results are the first to demonstrate that RL agents can be trained to implicitly learn concepts and compositionality, to solve more complex environments in zero-shot fashion. △ Less

Submitted 3 May, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

arXiv:2309.03483 [pdf, other]

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

Authors: Clarence Lee, M Ganesh Kumar, Cheston Tan

Abstract: State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natu… ▽ More State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset , which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks. △ Less

Submitted 7 September, 2023; originally announced September 2023.

Comments: 10 pages, 6 figures

arXiv:2106.13541 [pdf]

doi 10.1093/cercor/bhab456

A nonlinear hidden layer enables actor-critic agents to learn multiple paired association navigation

Authors: M Ganesh Kumar, Cheston Tan, Camilo Libedinsky, Shih-Cheng Yen, Andrew Yong-Yi Tan

Abstract: Navigation to multiple cued reward locations has been increasingly used to study rodent learning. Though deep reinforcement learning agents have been shown to be able to learn the task, they are not biologically plausible. Biologically plausible classic actor-critic agents have been shown to learn to navigate to single reward locations, but which biologically plausible agents are able to learn mul… ▽ More Navigation to multiple cued reward locations has been increasingly used to study rodent learning. Though deep reinforcement learning agents have been shown to be able to learn the task, they are not biologically plausible. Biologically plausible classic actor-critic agents have been shown to learn to navigate to single reward locations, but which biologically plausible agents are able to learn multiple cue-reward location tasks has remained unclear. In this computational study, we show versions of classic agents that learn to navigate to a single reward location, and adapt to reward location displacement, but are not able to learn multiple paired association navigation. The limitation is overcome by an agent in which place cell and cue information are first processed by a feedforward nonlinear hidden layer with synapses to the actor and critic subject to temporal difference error-modulated plasticity. Faster learning is obtained when the feedforward layer is replaced by a recurrent reservoir network. △ Less

Submitted 15 July, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: 31 pages, 8 figures. Acknowledgements revised

Journal ref: Cerebral Cortex, 2022;, bhab456

arXiv:2106.03580 [pdf]

One-shot learning of paired association navigation with biologically plausible schemas

Authors: M Ganesh Kumar, Cheston Tan, Camilo Libedinsky, Shih-Cheng Yen, Andrew Yong-Yi Tan

Abstract: Schemas are knowledge structures that can enable rapid learning. Rodent one-shot learning in a multiple paired association navigation task has been postulated to be schema-dependent. But how schemas, conceptualized at Marr's computational level, correspond with neural implementations remains poorly understood, and a biologically plausible computational model of the rodent learning has not been dem… ▽ More Schemas are knowledge structures that can enable rapid learning. Rodent one-shot learning in a multiple paired association navigation task has been postulated to be schema-dependent. But how schemas, conceptualized at Marr's computational level, correspond with neural implementations remains poorly understood, and a biologically plausible computational model of the rodent learning has not been demonstrated. Here, we compose such an agent from schemas with biologically plausible neural implementations. The agent contains an associative memory that can form one-shot associations between sensory cues and goal coordinates, implemented with a feedforward layer or a reservoir of recurrently connected neurons whose plastic output weights are governed by a novel 4-factor reward-modulated Exploratory Hebbian (EH) rule. Adding an actor-critic allows the agent to succeed even if an obstacle prevents direct heading. With the addition of working memory, the rodent behavior is replicated. Temporal-difference learning of a working memory gating mechanism enables one-shot learning despite distractors. △ Less

Submitted 27 August, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Minor revisions from version 2 preprint

arXiv:2106.01400 [pdf, other]

Dual Script E2E framework for Multilingual and Code-Switching ASR

Authors: Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema Murthy

Abstract: India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in… ▽ More India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in text-to-speech synthesis, in this work, we use an in-house rule-based phoneme-level common label set (CLS) representation to train multilingual and code-switching ASR for Indian languages. We propose two end-to-end (E2E) ASR systems. In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script. In the second system, we propose a modification to the E2E model, wherein the CLS representation and the native language characters are used simultaneously for training. We show our results on the multilingual and code-switching tasks of the Indic ASR Challenge 2021. Our best results achieve 6% and 5% improvement (approx) in word error rate over the baseline system for the multilingual and code-switching tasks, respectively, on the challenge development data. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted for publication at Interspeech 2021

arXiv:2007.13517 [pdf, other]

doi 10.1109/TIFS.2021.3067998

Evidence of Task-Independent Person-Specific Signatures in EEG using Subspace Techniques

Authors: Mari Ganesh Kumar, Shrikanth Narayanan, Mriganka Sur, Hema A Murthy

Abstract: Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas… ▽ More Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas from subspace-based text-independent speaker recognition and proposes novel modifications for modeling multi-channel EEG data. The proposed techniques assume that biometric information is present in the entire EEG signal and accumulate statistics across time in a high dimensional space. These high dimensional statistics are then projected to a lower dimensional space where the biometric information is preserved. The lower dimensional embeddings obtained using the proposed approach are shown to be task-independent. The best subspace system identifies individuals with accuracies of 86.4% and 35.9% on datasets with 30 and 920 subjects, respectively, using just nine EEG channels. The paper also provides insights into the subspace model's scalability to unseen tasks and individuals during training and the number of channels needed for subspace modeling. △ Less

Submitted 25 March, 2021; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Transactions on Information Forensics and Security, 2021

arXiv:1904.07453 [pdf, other]

doi 10.1109/ASRU46091.2019.9003824

Spoof detection using time-delay shallow neural network and feature switching

Authors: Mari Ganesh Kumar, Suvidha Rupesh Kumar, Saranya M, B. Bharathi, Hema A. Murthy

Abstract: Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for s… ▽ More Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively. △ Less

Submitted 23 January, 2020; v1 submitted 16 April, 2019; originally announced April 2019.

Journal ref: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1011--1017

Showing 1–7 of 7 results for author: Kumar, M G