Search | arXiv e-print repository

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

Authors: Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, Anil Murthy

Abstract: There is considerable confusion about the role of Large Language Models (LLMs) in planning and reasoning tasks. On one side are over-optimistic claims that LLMs can indeed do these tasks with just the right prompting or self-verification strategies. On the other side are perhaps over-pessimistic claims that all that LLMs are good for in planning/reasoning tasks are as mere translators of the probl… ▽ More There is considerable confusion about the role of Large Language Models (LLMs) in planning and reasoning tasks. On one side are over-optimistic claims that LLMs can indeed do these tasks with just the right prompting or self-verification strategies. On the other side are perhaps over-pessimistic claims that all that LLMs are good for in planning/reasoning tasks are as mere translators of the problem specification from one syntactic format to another, and ship the problem off to external symbolic solvers. In this position paper, we take the view that both these extremes are misguided. We argue that auto-regressive LLMs cannot, by themselves, do planning or self-verification (which is after all a form of reasoning), and shed some light on the reasons for misunderstandings in the literature. We will also argue that LLMs should be viewed as universal approximate knowledge sources that have much more meaningful roles to play in planning/reasoning tasks beyond simple front-end/back-end format translators. We present a vision of {\bf LLM-Modulo Frameworks} that combine the strengths of LLMs with external model-based verifiers in a tighter bi-directional interaction regime. We will show how the models driving the external verifiers themselves can be acquired with the help of LLMs. We will also argue that rather than simply pipelining LLMs and symbolic components, this LLM-Modulo Framework provides a better neuro-symbolic approach that offers tighter integration between LLMs and symbolic components, and allows extending the scope of model-based planning/reasoning regimes towards more flexible knowledge, problem and preference specifications. △ Less

Submitted 11 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Journal ref: Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2312.14292 [pdf, other]

Benchmarking Multi-Agent Preference-based Reinforcement Learning for Human-AI Teaming

Authors: Siddhant Bhambri, Mudit Verma, Anil Murthy, Subbarao Kambhampati

Abstract: Preference-based Reinforcement Learning (PbRL) is an active area of research, and has made significant strides in single-agent actor and in observer human-in-the-loop scenarios. However, its application within the co-operative multi-agent RL frameworks, where humans actively participate and express preferences for agent behavior, remains largely uncharted. We consider a two-agent (Human-AI) cooper… ▽ More Preference-based Reinforcement Learning (PbRL) is an active area of research, and has made significant strides in single-agent actor and in observer human-in-the-loop scenarios. However, its application within the co-operative multi-agent RL frameworks, where humans actively participate and express preferences for agent behavior, remains largely uncharted. We consider a two-agent (Human-AI) cooperative setup where both the agents are rewarded according to human's reward function for the team. However, the agent does not have access to it, and instead, utilizes preference-based queries to elicit its objectives and human's preferences for the robot in the human-robot team. We introduce the notion of Human-Flexibility, i.e. whether the human partner is amenable to multiple team strategies, with a special case being Specified Orchestration where the human has a single team policy in mind (most constrained case). We propose a suite of domains to study PbRL for Human-AI cooperative setup which explicitly require forced cooperation. Adapting state-of-the-art single-agent PbRL algorithms to our two-agent setting, we conduct a comprehensive benchmarking study across our domain suite. Our findings highlight the challenges associated with high degree of Human-Flexibility and the limited access to the human's envisioned policy in PbRL for Human-AI cooperation. Notably, we observe that PbRL algorithms exhibit effective performance exclusively in the case of Specified Orchestration which can be seen as an upper bound PbRL performance for future research. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2303.07130 [pdf, other]

Enhancing COVID-19 Severity Analysis through Ensemble Methods

Authors: Anand Thyagachandran, Hema A Murthy

Abstract: Computed Tomography (CT) scans provide a detailed image of the lungs, allowing clinicians to observe the extent of damage caused by COVID-19. The CT severity score (CTSS) based scoring method is used to identify the extent of lung involvement observed on a CT scan. This paper presents a domain knowledge-based pipeline for extracting regions of infection in COVID-19 patients using a combination of… ▽ More Computed Tomography (CT) scans provide a detailed image of the lungs, allowing clinicians to observe the extent of damage caused by COVID-19. The CT severity score (CTSS) based scoring method is used to identify the extent of lung involvement observed on a CT scan. This paper presents a domain knowledge-based pipeline for extracting regions of infection in COVID-19 patients using a combination of image-processing algorithms and a pre-trained UNET model. The severity of the infection is then classified into different categories using an ensemble of three machine-learning models: Extreme Gradient Boosting, Extremely Randomized Trees, and Support Vector Machine. The proposed system was evaluated on a validation dataset in the AI-Enabled Medical Image Analysis Workshop and COVID-19 Diagnosis Competition (AI-MIA-COV19D) and achieved a macro F1 score of 64%. These results demonstrate the potential of combining domain knowledge with machine learning techniques for accurate COVID-19 diagnosis using CT scans. The implementation of the proposed system for severity analysis is available at \textit{https://github.com/aanandt/Enhancing-COVID-19-Severity-Analysis-through-Ensemble-Methods.git } △ Less

Submitted 17 March, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2302.06227 [pdf, other]

Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages

Authors: Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN… ▽ More Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: 5 pages, 5 figures

arXiv:2211.08790 [pdf, other]

Structural Segmentation and Labeling of Tabla Solo Performances

Authors: Gowriprasad R, R Aravind, Hema A Murthy

Abstract: Tabla is a North Indian percussion instrument used as an accompaniment and an exclusive instrument for solo performances. Tabla solo is intricate and elaborate, exhibiting rhythmic evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. Each section has a specific structure and name associated with it. Tabla learning and performance in the Indian subcontinen… ▽ More Tabla is a North Indian percussion instrument used as an accompaniment and an exclusive instrument for solo performances. Tabla solo is intricate and elaborate, exhibiting rhythmic evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. Each section has a specific structure and name associated with it. Tabla learning and performance in the Indian subcontinent is based on stylistic schools called gharana-s. Several compositions by various composers from different gharana-s are played in each section. This paper addresses the task of segmenting the tabla solo concert into musically meaningful sections. We then assign suitable section labels and recognize gharana-s from the sections. We present a diverse collection of over 38 hours of solo tabla recordings for the task. We motivate the problem and present different challenges and facets of the tasks. Inspired by the distinct musical properties of tabla solo, we compute several rhythmic and timbral features for the segmentation task. This work explores the approach of automatically locating the significant changes in the rhythmic structure by analyzing local self-similarity in an unsupervised manner. We also explore supervised random forest and a convolutional neural network trained on hand-crafted features. Both supervised and unsupervised approaches are also tested on a set of held-out recordings. Segmentation of an audio piece into its structural components and labeling is crucial to many music information retrieval applications like repetitive structure finding, audio summarization, and fast music navigation. This work helps us obtain a comprehensive musical description of the tabla solo concert. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 35 pages, 11 figures

arXiv:2211.01603 [pdf, other]

Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals

Authors: Saish Jaiswal, Shreya Nema, Hema A Murthy, Manikandan Narayanan

Abstract: Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In t… ▽ More Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In this study, we address this challenge in the context of the well-studied problem of classifying genomic sequences into different taxonomic units (strain, phyla, order, etc.). We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence and subsequently the taxonomic classification accuracies. The sequences are first transformed into spectra, and projected to a subspace, where sequences belonging to different taxons are better distinguishable. Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.17153 [pdf, other]

The Importance of Accurate Alignments in End-to-End Speech Synthesis

Authors: Anusha Prakash, Hema A Murthy

Abstract: Unit selection synthesis systems required accurate segmentation and labeling of the speech signal owing to the concatenative nature. Hidden Markov model-based speech synthesis accommodates some transcription errors, but it was later shown that accurate transcriptions yield highly intelligible speech with smaller amounts of training data. With the arrival of end-to-end (E2E) systems, it was observe… ▽ More Unit selection synthesis systems required accurate segmentation and labeling of the speech signal owing to the concatenative nature. Hidden Markov model-based speech synthesis accommodates some transcription errors, but it was later shown that accurate transcriptions yield highly intelligible speech with smaller amounts of training data. With the arrival of end-to-end (E2E) systems, it was observed that very good quality speech could be synthesised with large amounts of data. As end-to-end synthesis progressed from Tacotron to FastSpeech2, it has become imminent that features that represent prosody are important for good-quality synthesis. In particular, durations of the sub-word units are important. Variants of FastSpeech use a teacher model or forced alignments to obtain good-quality synthesis. In this paper, we focus on duration prediction, using signal processing cues in tandem with forced alignment to produce accurate phone durations during training. The current work aims to highlight the importance of accurate alignments for good-quality synthesis. An attempt is made to train the E2E systems with accurately labeled data, and compare the same with approximately labeled data. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: Version 1 uploaded

arXiv:2108.02517 [pdf, other]

Multi-task Federated Edge Learning (MtFEEL) in Wireless Networks

Authors: Sawan Singh Mahara, Shruti M., B. N. Bharath, Akash Murthy

Abstract: Federated Learning (FL) has evolved as a promising technique to handle distributed machine learning across edge devices. A single neural network (NN) that optimises a global objective is generally learned in most work in FL, which could be suboptimal for edge devices. Although works finding a NN personalised for edge device specific tasks exist, they lack generalisation and/or convergence guarante… ▽ More Federated Learning (FL) has evolved as a promising technique to handle distributed machine learning across edge devices. A single neural network (NN) that optimises a global objective is generally learned in most work in FL, which could be suboptimal for edge devices. Although works finding a NN personalised for edge device specific tasks exist, they lack generalisation and/or convergence guarantees. In this paper, a novel communication efficient FL algorithm for personalised learning in a wireless setting with guarantees is presented. The algorithm relies on finding a ``better`` empirical estimate of losses at each device, using a weighted average of the losses across different devices. It is devised from a Probably Approximately Correct (PAC) bound on the true loss in terms of the proposed empirical loss and is bounded by (i) the Rademacher complexity, (ii) the discrepancy, (iii) and a penalty term. Using a signed gradient feedback to find a personalised NN at each device, it is also proven to converge in a Rayleigh flat fading (in the uplink) channel, at a rate of the order max{1/SNR,1/sqrt(T)} Experimental results show that the proposed algorithm outperforms locally trained devices as well as the conventionally used FedAvg and FedSGD algorithms under practical SNR regimes. △ Less

Submitted 9 March, 2022; v1 submitted 5 August, 2021; originally announced August 2021.

arXiv:2103.03215 [pdf, other]

Front-end Diarization for Percussion Separation in Taniavartanam of Carnatic Music Concerts

Authors: Nauman Dawalatabad, Jilt Sebastian, Jom Kuriakose, C. Chandra Sekhar, Shrikanth Narayanan, Hema A. Murthy

Abstract: Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source se… ▽ More Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source separation algorithms assume the presence of multiple percussive voices throughout the audio segment. We prevent this by first subjecting the taniavartanam to diarization. This process results in homogeneous clusters consisting of segments of either a single voice or multiple voices. A cluster of segments with multiple voices is identified using the Gaussian mixture model (GMM), which is then subjected to source separation. A deep recurrent neural network (DRNN) based approach is used to separate the multiple instrument segments. The effectiveness of the proposed system is evaluated on a standard Carnatic music dataset. The proposed approach provides close-to-oracle performance for non-overlap** segments and a significant improvement over traditional separation schemes. △ Less

Submitted 4 March, 2021; originally announced March 2021.

arXiv:2011.07279 [pdf, other]

Towards Zero-Shot Learning with Fewer Seen Class Examples

Authors: Vinay Kumar Verma, Ashish Mishra, Anubha Pandey, Hema A. Murthy, Piyush Rai

Abstract: We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach… ▽ More We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach leverages meta-learning to train a deep generative model that integrates variational autoencoder and generative adversarial networks. We propose a novel task distribution where meta-train and meta-validation classes are disjoint to simulate the ZSL behaviour in training. Once trained, the model can generate synthetic examples from seen and unseen classes. Synthesize samples can then be used to train the ZSL framework in a supervised manner. The meta-learner enables our model to generates high-fidelity samples using only a small number of training examples from seen classes. We conduct extensive experiments and ablation studies on four benchmark datasets of ZSL and observe that the proposed model outperforms state-of-the-art approaches by a significant margin when the number of examples per seen class is very small. △ Less

Submitted 14 November, 2020; originally announced November 2020.

Comments: Accepted in WACV 2021

arXiv:2011.02195 [pdf, other]

Correlation based Multi-phasal models for improved imagined speech EEG recognition

Authors: Rini A Sharon, Hema A Murthy

Abstract: Translation of imagined speech electroencephalogram(EEG) into human understandable commands greatly facilitates the design of naturalistic brain computer interfaces. To achieve improved imagined speech unit classification, this work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding… ▽ More Translation of imagined speech electroencephalogram(EEG) into human understandable commands greatly facilitates the design of naturalistic brain computer interfaces. To achieve improved imagined speech unit classification, this work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units. A bi-phase common representation learning module using neural networks is designed to model the correlation and reproducibility between an analysis phase and a support phase. The trained Correlation Network is then employed to extract discriminative features of the analysis phase. These features are further classified into five binary phonological categories using machine learning models such as Gaussian mixture based hidden Markov model and deep neural networks. The proposed approach further handles the non-availability of multi-phasal data during decoding. Topographic visualizations along with result-based inferences suggest that the multi-phasal correlation modelling approach proposed in the paper enhances imagined-speech EEG recognition performance. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Journal ref: Interspeech SMM 2020

arXiv:2010.05497 [pdf, other]

The "Sound of Silence" in EEG -- Cognitive voice activity detection

Authors: Rini A Sharon, Hema A Murthy

Abstract: Speech cognition bears potential application as a brain computer interface that can improve the quality of life for the otherwise communication impaired people. While speech and resting state EEG are popularly studied, here we attempt to explore a "non-speech"(NS) state of brain activity corresponding to the silence regions of speech audio. Firstly, speech perception is studied to inspect the exis… ▽ More Speech cognition bears potential application as a brain computer interface that can improve the quality of life for the otherwise communication impaired people. While speech and resting state EEG are popularly studied, here we attempt to explore a "non-speech"(NS) state of brain activity corresponding to the silence regions of speech audio. Firstly, speech perception is studied to inspect the existence of such a state, followed by its identification in speech imagination. Analogous to how voice activity detection is employed to enhance the performance of speech recognition, the EEG state activity detection protocol implemented here is applied to boost the confidence of imagined speech EEG decoding. Classification of speech and NS state is done using two datasets collected from laboratory-based and commercial-based devices. The state sequential information thus obtained is further utilized to reduce the search space of imagined EEG unit recognition. Temporal signal structures and topographic maps of NS states are visualized across subjects and sessions. The recognition performance and the visual distinction observed demonstrates the existence of silence signatures in EEG. △ Less

Submitted 12 October, 2020; originally announced October 2020.

arXiv:2009.04983 [pdf, other]

Exploration of End-to-end Synthesisers forZero Resource Speech Challenge 2020

Authors: Karthik Pandia D S, Anusha Prakash, Mano Ranjith Kumar, Hema A Murthy

Abstract: A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for develo** applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units,… ▽ More A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for develo** applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages: unit discovery, followed by unit sequence to spectrogram map**, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram map** stage. These modifications include training the map** on voice data, using x-vectors to improve the map**, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved. △ Less

Submitted 10 September, 2020; originally announced September 2020.

Comments: Accepted for publication in Interspeech 2020

arXiv:2007.13517 [pdf, other]

doi 10.1109/TIFS.2021.3067998

Evidence of Task-Independent Person-Specific Signatures in EEG using Subspace Techniques

Authors: Mari Ganesh Kumar, Shrikanth Narayanan, Mriganka Sur, Hema A Murthy

Abstract: Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas… ▽ More Electroencephalography (EEG) signals are promising as alternatives to other biometrics owing to their protection against spoofing. Previous studies have focused on capturing individual variability by analyzing task/condition-specific EEG. This work attempts to model biometric signatures independent of task/condition by normalizing the associated variance. Toward this goal, the paper extends ideas from subspace-based text-independent speaker recognition and proposes novel modifications for modeling multi-channel EEG data. The proposed techniques assume that biometric information is present in the entire EEG signal and accumulate statistics across time in a high dimensional space. These high dimensional statistics are then projected to a lower dimensional space where the biometric information is preserved. The lower dimensional embeddings obtained using the proposed approach are shown to be task-independent. The best subspace system identifies individuals with accuracies of 86.4% and 35.9% on datasets with 30 and 920 subjects, respectively, using just nine EEG channels. The paper also provides insights into the subspace model's scalability to unseen tasks and individuals during training and the number of channels needed for subspace modeling. △ Less

Submitted 25 March, 2021; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: IEEE Transactions on Information Forensics and Security, 2021

arXiv:2006.04372 [pdf, ps, other]

doi 10.21437/Interspeech.2019-2336

Zero resource speech synthesis using transcripts derived from perceptual acoustic units

Authors: Karthik Pandia D S, Hema A Murthy

Abstract: Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and… ▽ More Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are not available for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady-state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds to steady-states. Following this initialization, the units are allowed to re-organise on the continuous speech into a final set of AUs in an HMM-GMM framework. AU sequences thus obtained are used to train synthesis models. The performance of the proposed approach is evaluated on the Zerospeech 2019 challenge database. Subjective and objective scores show that reasonably good quality synthesis with low bit rate encoding can be achieved using the proposed AUs. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:2001.06657 [pdf, other]

Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Authors: Anubha Pandey, Ashish Mishra, Vinay Kumar Verma, Anurag Mittal, Hema A. Murthy

Abstract: Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previous… ▽ More Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR. △ Less

Submitted 18 January, 2020; originally announced January 2020.

Comments: Accepted in WACV'2020

arXiv:1904.07453 [pdf, other]

doi 10.1109/ASRU46091.2019.9003824

Spoof detection using time-delay shallow neural network and feature switching

Authors: Mari Ganesh Kumar, Suvidha Rupesh Kumar, Saranya M, B. Bharathi, Hema A. Murthy

Abstract: Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for s… ▽ More Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively. △ Less

Submitted 23 January, 2020; v1 submitted 16 April, 2019; originally announced April 2019.

Journal ref: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1011--1017

arXiv:1903.10870 [pdf, other]

Algorithms and Improved bounds for online learning under finite hypothesis class

Authors: Ankit Sharma, Late C. A. Murthy

Abstract: Online learning is the process of answering a sequence of questions based on the correct answers to the previous questions. It is studied in many research areas such as game theory, information theory and machine learning. There are two main components of online learning framework. First, the learning algorithm also known as the learner and second, the hypothesis class which is essentially a set o… ▽ More Online learning is the process of answering a sequence of questions based on the correct answers to the previous questions. It is studied in many research areas such as game theory, information theory and machine learning. There are two main components of online learning framework. First, the learning algorithm also known as the learner and second, the hypothesis class which is essentially a set of functions which learner uses to predict answers to the questions. Sometimes, this class contains some functions which have the capability to provide correct answers to the entire sequence of questions. This case is called realizable case. And when hypothesis class does not contain such functions is called unrealizable case. The goal of the learner, in both the cases, is to make as few mistakes as that could have been made by most powerful functions in hypothesis class over the entire sequence of questions. Performance of the learners is analysed by theoretical bounds on the number of mistakes made by them. This paper proposes three algorithms to improve the mistakes bound in the unrealizable case. Proposed algorithms perform highly better than the existing ones in the long run when most of the input sequences presented to the learner are likely to be realizable. △ Less

Submitted 24 March, 2019; originally announced March 2019.

Comments: 17 pages, 2 figures, 9 tables

arXiv:1903.10672 [pdf, other]

Robustness of Neural Networks to Parameter Quantization

Authors: Abhishek Murthy, Himel Das, Md Ariful Islam

Abstract: Quantization, a commonly used technique to reduce the memory footprint of a neural network for edge computing, entails reducing the precision of the floating-point representation used for the parameters of the network. The impact of such rounding-off errors on the overall performance of the neural network is estimated using testing, which is not exhaustive and thus cannot be used to guarantee the… ▽ More Quantization, a commonly used technique to reduce the memory footprint of a neural network for edge computing, entails reducing the precision of the floating-point representation used for the parameters of the network. The impact of such rounding-off errors on the overall performance of the neural network is estimated using testing, which is not exhaustive and thus cannot be used to guarantee the safety of the model. We present a framework based on Satisfiability Modulo Theory (SMT) solvers to quantify the robustness of neural networks to parameter perturbation. To this end, we introduce notions of local and global robustness that capture the deviation in the confidence of class assignments due to parameter quantization. The robustness notions are then cast as instances of SMT problems and solved automatically using solvers, such as dReal. We demonstrate our framework on two simple Multi-Layer Perceptrons (MLP) that perform binary classification on a two-dimensional input. In addition to quantifying the robustness, we also show that Rectified Linear Unit activation results in higher robustness than linear activations for our MLPs. △ Less

Submitted 26 March, 2019; originally announced March 2019.

arXiv:1902.08051 [pdf, other]

doi 10.1109/ICASSP.2019.8683114

Incremental Transfer Learning in Two-pass Information Bottleneck based Speaker Diarization System for Meetings

Authors: Nauman Dawalatabad, Srikanth Madikeri, C Chandra Sekhar, Hema A Murthy

Abstract: The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This pap… ▽ More The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while diarizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This paper attempts to improve the RTF of the TPIB system using an incremental transfer learning approach where the parameters learned by the ANN from other conversations are updated using current conversation rather than learning parameters from scratch. This reduces the RTF significantly. The effectiveness of the proposed approach compared to the baseline IB and the TPIB systems is demonstrated on standard NIST and AMI conversational meeting datasets. With a minor degradation in performance, the proposed system shows a significant improvement of 33.07% and 24.45% in RTF with respect to TPIB system on the NIST RT-04Eval and AMI-1 datasets, respectively. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: 5 pages, 2 figures, To appear in Proc. ICASSP 2019, May 12-17, 2019, Brighton, UK

arXiv:1811.04661 [pdf, ps, other]

RelDenClu: A Relative Density based Biclustering Method for identifying non-linear feature relations

Authors: Namita Jain, Susmita Ghosh, C. A. Murthy

Abstract: The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and join… ▽ More The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and joint densities for each pair of features to find the subset of observations, which forms the bases of the relation between them. It then finds the set of features connected by a common set of observations, resulting in a bicluster. To show the effectiveness of the proposed methodology, experimentation has been carried out on fifteen types of simulated datasets. Further, it has been applied to six real-life datasets. For three of these real-life datasets, the proposed method is used for unsupervised learning, while for other three real-life datasets it is used as an aid to supervised learning. For all the datasets the performance of the proposed method is compared with that of seven different state-of-the-art algorithms and the proposed algorithm is seen to produce better results. The efficacy of proposed algorithm is also seen by its use on COVID-19 dataset for identifying some features (genetic, demographics and others) that are likely to affect the spread of COVID-19. △ Less

Submitted 11 May, 2021; v1 submitted 12 November, 2018; originally announced November 2018.

arXiv:1709.00663 [pdf, other]

A Generative Model For Zero Shot Learning Using Conditional Variational Autoencoders

Authors: Ashish Mishra, M Shiva Krishna Reddy, Anurag Mittal, Hema A Murthy

Abstract: Zero shot learning in Image Classification refers to the setting where images from some novel classes are absent in the training data but other information such as natural language descriptions or attribute vectors of the classes are available. This setting is important in the real world since one may not be able to obtain images of all the possible classes at training. While previous approaches h… ▽ More Zero shot learning in Image Classification refers to the setting where images from some novel classes are absent in the training data but other information such as natural language descriptions or attribute vectors of the classes are available. This setting is important in the real world since one may not be able to obtain images of all the possible classes at training. While previous approaches have tried to model the relationship between the class attribute space and the image space via some kind of a transfer function in order to model the image space correspondingly to an unseen class, we take a different approach and try to generate the samples from the given attributes, using a conditional variational autoencoder, and use the generated samples for classification of the unseen classes. By extensive testing on four benchmark datasets, we show that our model outperforms the state of the art, particularly in the more realistic generalized setting, where the training classes can also appear at the test time along with the novel classes. △ Less

Submitted 27 January, 2018; v1 submitted 3 September, 2017; originally announced September 2017.

arXiv:1702.00787 [pdf, ps, other]

Distributed Approximation Algorithms for the Multiple Knapsack Problem

Authors: Ananth Murthy, Chandan Yeshwanth, Shrisha Rao

Abstract: We consider the distributed version of the Multiple Knapsack Problem (MKP), where $m$ items are to be distributed amongst $n$ processors, each with a knapsack. We propose different distributed approximation algorithms with a tradeoff between time and message complexities. The algorithms are based on the greedy approach of assigning the best item to the knapsack with the largest capacity. These alg… ▽ More We consider the distributed version of the Multiple Knapsack Problem (MKP), where $m$ items are to be distributed amongst $n$ processors, each with a knapsack. We propose different distributed approximation algorithms with a tradeoff between time and message complexities. The algorithms are based on the greedy approach of assigning the best item to the knapsack with the largest capacity. These algorithms obtain a solution with a bound of $\frac{1}{n+1}$ times the optimum solution, with either $\mathcal{O}\left(m\log n\right)$ time and $\mathcal{O}\left(m n\right)$ messages, or $\mathcal{O}\left(m\right)$ time and $\mathcal{O}\left(mn^{2}\right)$ messages. △ Less

Submitted 2 February, 2017; originally announced February 2017.

Comments: 18 pages

MSC Class: 68W15; 68W15 ACM Class: C.2.4; I.1.2

arXiv:1612.07074 [pdf]

Sparsity Measure of a Network Graph: Gini Index

Authors: Swati Goswami, C. A. Murthy, Asit K. Das

Abstract: This article examines the application of a popular measure of sparsity, Gini Index, on network graphs. A wide variety of network graphs happen to be sparse. But the index with which sparsity is commonly measured in network graphs is edge density, reflecting the proportion of the sum of the degrees of all nodes in the graph compared to the total possible degrees in the corresponding fully connected… ▽ More This article examines the application of a popular measure of sparsity, Gini Index, on network graphs. A wide variety of network graphs happen to be sparse. But the index with which sparsity is commonly measured in network graphs is edge density, reflecting the proportion of the sum of the degrees of all nodes in the graph compared to the total possible degrees in the corresponding fully connected graph. Thus edge density is a simple ratio and carries limitations, primarily in terms of the amount of information it takes into account in its definition. In this paper, we have provided a formulation for defining sparsity of a network graph by generalizing the concept of Gini Index and call it sparsity index. A majority of the six properties (viz., Robin Hood, Scaling, Rising Tide, Cloning, Bill Gates and Babies) with which sparsity measures are commonly compared are seen to be satisfied by the proposed index. A comparison between edge density and the sparsity index has been drawn with appropriate examples to highlight the efficacy of the proposed index. It has also been shown theoretically that the two measures follow similar trend for a changing graph, i.e., as the edge density of a graph increases its sparsity index decreases. Additionally, the paper draws a relationship, analytically, between the sparsity index and the exponent term of a power law distribution, a distribution which is known to approximate the degree distribution of a wide variety of network graphs. Finally, the article highlights how the proposed index together with Gini index can reveal important properties of a network graph. △ Less

Submitted 21 December, 2016; originally announced December 2016.

Comments: 15 pages with 6 figures

MSC Class: 68R10; 05C82 ACM Class: G.2.2; G.2.3

arXiv:1603.05435 [pdf, ps, other]

Modified Group Delay Based MultiPitch Estimation in Co-Channel Speech

Authors: Rajeev Rajan, Hema A. Murthy

Abstract: Phase processing has been replaced by group delay processing for the extraction of source and system parameters from speech. Group delay functions are ill-behaved when the transfer function has zeros that are close to unit circle in the z-domain. The modified group delay function addresses this problem and has been successfully used for formant and monopitch estimation. In this paper, modified gro… ▽ More Phase processing has been replaced by group delay processing for the extraction of source and system parameters from speech. Group delay functions are ill-behaved when the transfer function has zeros that are close to unit circle in the z-domain. The modified group delay function addresses this problem and has been successfully used for formant and monopitch estimation. In this paper, modified group delay functions are used for multipitch estimation in concurrent speech. The power spectrum of the speech is first flattened in order to annihilate the system characteristics, while retaining the source characteristics. Group delay analysis on this flattened spectrum picks the predominant pitch in the first pass and a comb filter is used to filter out the estimated pitch along with its harmonics. The residual spectrum is again analyzed for the next candidate pitch estimate in the second pass. The final pitch trajectories of the constituent speech utterances are formed using pitch grou** and post processing techniques. The performance of the proposed algorithm was evaluated on standard datasets using two metrics; pitch accuracy and standard deviation of fine pitch error. Our results show that the proposed algorithm is a promising pitch detection method in multipitch environment for real speech recordings. △ Less

Submitted 17 March, 2016; originally announced March 2016.

arXiv:1411.2883 [pdf, ps, other]

doi 10.1007/s13042-015-0418-6

A new estimate of mutual information based measure of dependence between two variables: properties and fast implementation

Authors: Namita Jain, C. A. Murthy

Abstract: This article proposes a new method to estimate an existing mutual information based dependence measure using histogram density estimates. Finding a suitable bin length for histogram is an open problem. We propose a new way of computing the bin length for histogram using a function of maximum separation between points. The chosen bin length leads to consistent density estimates for histogram method… ▽ More This article proposes a new method to estimate an existing mutual information based dependence measure using histogram density estimates. Finding a suitable bin length for histogram is an open problem. We propose a new way of computing the bin length for histogram using a function of maximum separation between points. The chosen bin length leads to consistent density estimates for histogram method. The values of density thus obtained are used to calculate an estimate of an existing dependence measure. The proposed estimate is named as Mutual Information Based Dependence Index (MIDI). Some important properties of MIDI have also been stated. The performance of the proposed method has been compared to generally accepted measures like Distance Correlation (dcor), Maximal Information Coefficient (MINE) in terms of accuracy and computational complexity with the help of several artificial data sets with different amounts of noise. The proposed method is able to detect many types of relationships between variables, without making any assumption about the functional form of the relationship. The power statistics of proposed method illustrate their effectiveness in detecting non linear relationship. Thus, it is able to achieve generality without a high rate of false positive cases. MIDI is found to work better on a real life data set than competing methods. The proposed method is found to overcome some of the limitations which occur with dcor and MINE. Computationally, MIDI is found to be better than dcor and MINE, in terms of time and memory, making it suitable for large data sets. △ Less

Submitted 13 September, 2015; v1 submitted 28 October, 2014; originally announced November 2014.

Comments: International Journal of Machine Learning and Cybernetics, Springer Berlin Heidelberg, 10-Sep-2015

Showing 1–26 of 26 results for author: Murthy, A