Search | arXiv e-print repository

arXiv:2405.19426 [pdf, other]

Deep Learning for Assessment of Oral Reading Fluency

Authors: Mithilesh Vaidya, Binaya Kumar Sahoo, Preeti Rao

Abstract: Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accu… ▽ More Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency. △ Less

Submitted 1 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.09572 [pdf, other]

Deep Neural Operator Enabled Digital Twin Modeling for Additive Manufacturing

Authors: Ning Liu, Xuxiao Li, Manoj R. Rajanna, Edward W. Reutzel, Brady Sawyer, Prahalada Rao, Jim Lua, Nam Phan, Yue Yu

Abstract: A digital twin (DT), with the components of a physics-based model, a data-driven model, and a machine learning (ML) enabled efficient surrogate, behaves as a virtual twin of the real-world physical process. In terms of Laser Powder Bed Fusion (L-PBF) based additive manufacturing (AM), a DT can predict the current and future states of the melt pool and the resulting defects corresponding to the inp… ▽ More A digital twin (DT), with the components of a physics-based model, a data-driven model, and a machine learning (ML) enabled efficient surrogate, behaves as a virtual twin of the real-world physical process. In terms of Laser Powder Bed Fusion (L-PBF) based additive manufacturing (AM), a DT can predict the current and future states of the melt pool and the resulting defects corresponding to the input laser parameters, evolve itself by assimilating in-situ sensor data, and optimize the laser parameters to mitigate defect formation. In this paper, we present a deep neural operator enabled computational framework of the DT for closed-loop feedback control of the L-PBF process. This is accomplished by building a high-fidelity computational model to accurately represent the melt pool states, an efficient surrogate model to approximate the melt pool solution field, followed by an physics-based procedure to extract information from the computed melt pool simulation that can further be correlated to the defect quantities of interest (e.g., surface roughness). In particular, we leverage the data generated from the high-fidelity physics-based model and train a series of Fourier neural operator (FNO) based ML models to effectively learn the relation between the input laser parameters and the corresponding full temperature field of the melt pool. Subsequently, a set of physics-informed variables such as the melt pool dimensions and the peak temperature can be extracted to compute the resulting defects. An optimization algorithm is then exercised to control laser input and minimize defects. On the other hand, the constructed DT can also evolve with the physical twin via offline finetuning and online material calibration. Finally, a probabilistic framework is adopted for uncertainty quantification. The developed DT is envisioned to guide the AM process and facilitate high-quality manufacturing. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2306.09384 [pdf, other]

MobileASR: A resource-aware on-device learning framework for user voice personalization applications on mobile phones

Authors: Zitha Sasindran, Harsha Yelchuri, Pooja Rao, T. V. Prabhakar

Abstract: We describe a comprehensive methodology for develo** user-voice personalized automatic speech recognition (ASR) models by effectively training models on mobile phones, allowing user data and models to be stored and used locally. To achieve this, we propose a resource-aware sub-model-based training approach that considers the RAM, and battery capabilities of mobile phones. By considering the eval… ▽ More We describe a comprehensive methodology for develo** user-voice personalized automatic speech recognition (ASR) models by effectively training models on mobile phones, allowing user data and models to be stored and used locally. To achieve this, we propose a resource-aware sub-model-based training approach that considers the RAM, and battery capabilities of mobile phones. By considering the evaluation metric and resource constraints of the mobile phones, we are able to perform efficient training and halt the process accordingly. To simulate real users, we use speakers with various accents. The entire on-device training and evaluation framework was then tested on various mobile phones across brands. We show that fine-tuning the models and selecting the right hyperparameter values is a trade-off between the lowest achievable performance metric, on-device training time, and memory consumption. Overall, our methodology offers a comprehensive solution for develo** personalized ASR models while leveraging the capabilities of mobile phones, and balancing the need for accuracy with resource constraints. △ Less

Submitted 9 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: Accepted in AIMLSystems 2023

arXiv:2209.00291 [pdf, other]

Generating Coherent Drum Accompaniment With Fills And Improvisations

Authors: Rishabh Dahale, Vaibhav Talwadker, Preeti Rao, Prateek Verma

Abstract: Creating a complex work of art like music necessitates profound creativity. With recent advancements in deep learning and powerful models such as transformers, there has been huge progress in automatic music generation. In an accompaniment generation context, creating a coherent drum pattern with apposite fills and improvisations at proper locations in a song is a challenging task even for an expe… ▽ More Creating a complex work of art like music necessitates profound creativity. With recent advancements in deep learning and powerful models such as transformers, there has been huge progress in automatic music generation. In an accompaniment generation context, creating a coherent drum pattern with apposite fills and improvisations at proper locations in a song is a challenging task even for an experienced drummer. Drum beats tend to follow a repetitive pattern through stanzas with fills or improvisation at section boundaries. In this work, we tackle the task of drum pattern generation conditioned on the accompanying music played by four melodic instruments: Piano, Guitar, Bass, and Strings. We use the transformer sequence to sequence model to generate a basic drum pattern conditioned on the melodic accompaniment to find that improvisation is largely absent, attributed possibly to its expectedly relatively low representation in the training data. We propose a novelty function to capture the extent of improvisation in a bar relative to its neighbors. We train a model to predict improvisation locations from the melodic accompaniment tracks. Finally, we use a novel BERT-inspired in-filling architecture, to learn the structure of both the drums and melody to in-fill elements of improvised music. △ Less

Submitted 1 September, 2022; originally announced September 2022.

Comments: 8 pages, 7 figures, 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), Bengaluru, India

arXiv:2204.03166 [pdf]

Musical Information Extraction from the Singing Voice

Authors: Preeti Rao

Abstract: Music information retrieval is currently an active research area that addresses the extraction of musically important information from audio signals, and the applications of such information. The extracted information can be used for search and retrieval of music in recommendation systems, or to aid musicological studies or even in music learning. Sophisticated signal processing techniques are app… ▽ More Music information retrieval is currently an active research area that addresses the extraction of musically important information from audio signals, and the applications of such information. The extracted information can be used for search and retrieval of music in recommendation systems, or to aid musicological studies or even in music learning. Sophisticated signal processing techniques are applied to convert low-level acoustic signal properties to musical attributes which are further embedded in a rule-based or statistical classification framework to link with high-level descriptions such as melody, genre, mood and artist type. Vocal music comprises a large and interesting category of music where the lead instrument is the singing voice. The singing voice is more versatile than many musical instruments and therefore poses interesting challenges to information retrieval systems. In this paper, we provide a brief overview of research in vocal music processing followed by a description of related work at IIT Bombay leading to the development of an interface for melody detection of singing voice in polyphony. △ Less

Submitted 6 April, 2022; originally announced April 2022.

arXiv:2203.06583 [pdf]

Bi-Sampling Approach to Classify Music Mood leveraging Raga-Rasa Association in Indian Classical Music

Authors: Mohan Rao B C, Vinayak Arkachaari, Harsha M N, Sushmitha M N, Gayathri Ramesh K K, Ullas M S, Pathi Mohan Rao, Sudha G, Narayana Darapaneni

Abstract: The impact of Music on the mood or emotion of the listener is a well-researched area in human psychology and behavioral science. In Indian classical music, ragas are the melodic structure that defines the various styles and forms of the music. Each raga has been found to evoke a specific emotion in the listener. With the advent of advanced capabilities of audio signal processing and the applicatio… ▽ More The impact of Music on the mood or emotion of the listener is a well-researched area in human psychology and behavioral science. In Indian classical music, ragas are the melodic structure that defines the various styles and forms of the music. Each raga has been found to evoke a specific emotion in the listener. With the advent of advanced capabilities of audio signal processing and the application of machine learning, the demand for intelligent music classifiers and recommenders has received increased attention, especially in the 'Music as a service' cloud applications. This paper explores a novel framework to leverage the raga-rasa association in Indian classical Music to build an intelligent classifier and its application in music recommendation system based on user's current mood and the mood they aspire to be in. △ Less

Submitted 13 March, 2022; originally announced March 2022.

arXiv:2112.03871 [pdf, ps, other]

Training end-to-end speech-to-text models on mobile phones

Authors: Zitha S, Raghavendra Rao Suresh, Pooja Rao, T. V. Prabhakar

Abstract: Training the state-of-the-art speech-to-text (STT) models in mobile devices is challenging due to its limited resources relative to a server environment. In addition, these models are trained on generic datasets that are not exhaustive in capturing user-specific characteristics. Recently, on-device personalization techniques have been making strides in mitigating the problem. Although many current… ▽ More Training the state-of-the-art speech-to-text (STT) models in mobile devices is challenging due to its limited resources relative to a server environment. In addition, these models are trained on generic datasets that are not exhaustive in capturing user-specific characteristics. Recently, on-device personalization techniques have been making strides in mitigating the problem. Although many current works have already explored the effectiveness of on-device personalization, the majority of their findings are limited to simulation settings or a specific smartphone. In this paper, we develop and provide a detailed explanation of our framework to train end-to-end models in mobile phones. To make it simple, we considered a model based on connectionist temporal classification (CTC) loss. We evaluated the framework on various mobile phones from different brands and reported the results. We provide enough evidence that fine-tuning the models and choosing the right hyperparameter values is a trade-off between the lowest WER achievable, training time on-device, and memory consumption. Hence, this is vital for a successful deployment of on-device training onto a resource-limited environment like mobile phones. We use training sets from speakers with different accents and record a 7.6% decrease in average word error rate (WER). We also report the associated computational cost measurements with respect to time, memory usage, and cpu utilization in mobile phones in real-time. △ Less

Submitted 7 December, 2021; originally announced December 2021.

arXiv:2112.00635 [pdf, other]

Predicting lexical skills from oral reading with acoustic measures

Authors: Charvi Vitthal, Shreeharsha B S, Kamini Sabu, Preeti Rao

Abstract: Literacy assessment is an important activity for education administrators across the globe. Typically achieved in a school setting by testing a child's oral reading, it is intensive in human resources. While automatic speech recognition (ASR) is a potential solution to the problem, it tends to be computationally expensive for hand-held devices apart from needing language and accent-specific speech… ▽ More Literacy assessment is an important activity for education administrators across the globe. Typically achieved in a school setting by testing a child's oral reading, it is intensive in human resources. While automatic speech recognition (ASR) is a potential solution to the problem, it tends to be computationally expensive for hand-held devices apart from needing language and accent-specific speech for training. In this work, we propose a system to predict the word-decoding skills of a student based on simple acoustic features derived from the recording. We first identify a meaningful categorization of word-decoding skills by analyzing a manually transcribed data set of children's oral reading recordings. Next the automatic prediction of the category is attempted with the proposed acoustic features. Pause statistics, syllable rate and spectral and intensity dynamics are found to be reliable indicators of specific types of oral reading deficits, providing useful feedback by discriminating the different characteristics of beginning readers. This computationally simple and language-agnostic approach is found to provide a performance close to that obtained using a language dependent ASR that required considerable tuning of its parameters. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2110.14273 [pdf, other]

Deep Learning For Prominence Detection In Children's Read Speech

Authors: Mithilesh Vaidya, Kamini Sabu, Preeti Rao

Abstract: The detection of perceived prominence in speech has attracted approaches ranging from the design of linguistic knowledge-based acoustic features to the automatic feature learning from suprasegmental attributes such as pitch and intensity contours. We present here, in contrast, a system that operates directly on segmented speech waveforms to learn features relevant to prominent word detection for c… ▽ More The detection of perceived prominence in speech has attracted approaches ranging from the design of linguistic knowledge-based acoustic features to the automatic feature learning from suprasegmental attributes such as pitch and intensity contours. We present here, in contrast, a system that operates directly on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters as the first convolutional layer. We further explore the benefits of the linguistic association between the prosodic events of phrase boundary and prominence with different multi-task architectures. Matching the previously reported performance on the same dataset of a random forest ensemble predictor trained on carefully chosen hand-crafted acoustic features, we evaluate further the possibly complementary information from hand-crafted acoustic and pre-trained lexical features. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: Under review at ICASSP 2022. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2109.12434 [pdf, other]

Emergent behavior and neural dynamics in artificial agents tracking turbulent plumes

Authors: Satpreet Harcharan Singh, Floris van Breugel, Rajesh P. N. Rao, Bingni Wen Brunton

Abstract: Tracking a turbulent plume to locate its source is a complex control problem because it requires multi-sensory integration and must be robust to intermittent odors, changing wind direction, and variable plume statistics. This task is routinely performed by flying insects, often over long distances, in pursuit of food or mates. Several aspects of this remarkable behavior have been studied in detail… ▽ More Tracking a turbulent plume to locate its source is a complex control problem because it requires multi-sensory integration and must be robust to intermittent odors, changing wind direction, and variable plume statistics. This task is routinely performed by flying insects, often over long distances, in pursuit of food or mates. Several aspects of this remarkable behavior have been studied in detail in many experimental studies. Here, we take a complementary in silico approach, using artificial agents trained with reinforcement learning to develop an integrated understanding of the behaviors and neural computations that support plume tracking. Specifically, we use deep reinforcement learning (DRL) to train recurrent neural network (RNN) agents to locate the source of simulated turbulent plumes. Interestingly, the agents' emergent behaviors resemble those of flying insects, and the RNNs learn to represent task-relevant variables, such as head direction and time since last odor encounter. Our analyses suggest an intriguing experimentally testable hypothesis for tracking plumes in changing wind direction -- that agents follow local plume shape rather than the current wind direction. While reflexive short-memory behaviors are sufficient for tracking plumes in constant wind, longer timescales of memory are essential for tracking plumes that switch direction. At the level of neural dynamics, the RNNs' population activity is low-dimensional and organized into distinct dynamical structures, with some correspondence to behavioral modules. Our in silico approach provides key intuitions for turbulent plume tracking strategies and motivates future targeted experimental and theoretical developments. △ Less

Submitted 17 December, 2021; v1 submitted 25 September, 2021; originally announced September 2021.

ACM Class: I.2.6; I.2.0; I.5.1

arXiv:2104.09064 [pdf, other]

Automatic Stroke Classification of Tabla Accompaniment in Hindustani Vocal Concert Audio

Authors: Rohit M. A., Preeti Rao

Abstract: The tabla is a unique percussion instrument due to the combined harmonic and percussive nature of its timbre, and the contrasting harmonic frequency ranges of its two drums. This allows a tabla player to uniquely emphasize parts of the rhythmic cycle (theka) in order to mark the salient positions. An analysis of the loudness dynamics and timing deviations at various cycle positions is an important… ▽ More The tabla is a unique percussion instrument due to the combined harmonic and percussive nature of its timbre, and the contrasting harmonic frequency ranges of its two drums. This allows a tabla player to uniquely emphasize parts of the rhythmic cycle (theka) in order to mark the salient positions. An analysis of the loudness dynamics and timing deviations at various cycle positions is an important part of musicological studies on the expressivity in tabla accompaniment. To achieve this at a corpus-level, and not restrict it to the few recordings that manual annotation can afford, it is helpful to have access to an automatic tabla transcription system. Although a few systems have been built by training models on labeled tabla strokes, the achieved accuracy does not necessarily carry over to unseen instruments. In this article, we report our work towards building an instrument-independent stroke classification system for accompaniment tabla based on the more easily available tabla solo audio tracks. We present acoustic features that capture the distinctive characteristics of tabla strokes and build an automatic system to predict the label as one of a reduced, but musicologically motivated, target set of four stroke categories. To address the lack of sufficient labeled training data, we turn to common data augmentation methods and find the use of pitch-shifting based augmentation to be most promising. We then analyse the important features and highlight the problem of their instrument-dependence while motivating the use of more task-specific data augmentation strategies to improve the diversity of training data. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: To appear in the JOURNAL OF ACOUSTICAL SOCIETY OF INDIA, April 2021

arXiv:2104.05488 [pdf, other]

CNN Encoding of Acoustic Parameters for Prominence Detection

Authors: Kamini Sabu, Mithilesh Vaidya, Preeti Rao

Abstract: Expressive reading, considered the defining attribute of oral reading fluency, comprises the prosodic realization of phrasing and prominence. In the context of evaluating oral reading, it helps to establish the speaker's comprehension of the text. We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words using acoustic-prosodic and lexi… ▽ More Expressive reading, considered the defining attribute of oral reading fluency, comprises the prosodic realization of phrasing and prominence. In the context of evaluating oral reading, it helps to establish the speaker's comprehension of the text. We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words using acoustic-prosodic and lexico-syntactic features. A previous well-tuned random forest ensemble predictor is replaced by an RNN sequence classifier to exploit potential context dependency across the longer utterance. Further, deep learning is applied to obtain word-level features from low-level acoustic contours of fundamental frequency, intensity and spectral shape in an end-to-end fashion. Performance comparisons are presented across the different feature types and across different feature learning architectures for prominent word prediction to draw insights wherever possible. △ Less

Submitted 27 January, 2022; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures, 6 tables, Submitted to INTERSPEECH 2021

arXiv:2103.04346 [pdf, other]

An Optimized Signal Processing Pipeline for Syllable Detection and Speech Rate Estimation

Authors: Kamini Sabu, Syomantak Chaudhuri, Preeti Rao, Mahesh Patil

Abstract: Syllable detection is an important speech analysis task with applications in speech rate estimation, word segmentation, and automatic prosody detection. Based on the well understood acoustic correlates of speech articulation, it has been realized by local peak picking on a frequency-weighted energy contour that represents vowel sonority. While several of the analysis parameters are set based on kn… ▽ More Syllable detection is an important speech analysis task with applications in speech rate estimation, word segmentation, and automatic prosody detection. Based on the well understood acoustic correlates of speech articulation, it has been realized by local peak picking on a frequency-weighted energy contour that represents vowel sonority. While several of the analysis parameters are set based on known speech signal properties, the selection of the frequency-weighting coefficients and peak-picking threshold typically involves heuristics, raising the possibility of data-based optimisation. In this work, we consider the optimization of the parameters based on the direct minimization of naturally arising task-specific objective functions. The resulting non-convex cost function is minimized using a population-based search algorithm to achieve a performance that exceeds previously published performance results on the same corpus using a relatively low amount of labeled data. Further, the optimisation of system parameters on a different corpus is shown to result in an explainable change in the optimal values. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: 6 pages, 3 figures, accepted in National Conference on Communications (NCC) 2020

arXiv:2008.08405 [pdf, other]

HpRNet : Incorporating Residual Noise Modeling for Violin in a Variational Parametric Synthesizer

Authors: Krishna Subramani, Preeti Rao

Abstract: Generative Models for Audio Synthesis have been gaining momentum in the last few years. More recently, parametric representations of the audio signal have been incorporated to facilitate better musical control of the synthesized output. In this work, we investigate a parametric model for violin tones, in particular the generative modeling of the residual bow noise to make for more natural tone qua… ▽ More Generative Models for Audio Synthesis have been gaining momentum in the last few years. More recently, parametric representations of the audio signal have been incorporated to facilitate better musical control of the synthesized output. In this work, we investigate a parametric model for violin tones, in particular the generative modeling of the residual bow noise to make for more natural tone quality. To aid in our analysis, we introduce a dataset of Carnatic Violin Recordings where bow noise is an integral part of the playing style of higher pitched notes in specific gestural contexts. We obtain insights about each of the harmonic and residual components of the signal, as well as their interdependence, via observations on the latent space derived in the course of variational encoding of the spectral envelopes of the sustained sounds. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: https://github.com/SubramaniKrishna/HpRNet

arXiv:2008.00756 [pdf, other]

Structure and Automatic Segmentation of Dhrupad Vocal Bandish Audio

Authors: Rohit M. A., Preeti Rao

Abstract: A Dhrupad vocal concert comprises a composition section that is interspersed with improvised episodes of increased rhythmic activity involving the interaction between the vocals and the percussion. Tracking the changing rhythmic density, in relation to the underlying metric tempo of the piece, thus facilitates the detection and labeling of the improvised sections in the concert structure. This wor… ▽ More A Dhrupad vocal concert comprises a composition section that is interspersed with improvised episodes of increased rhythmic activity involving the interaction between the vocals and the percussion. Tracking the changing rhythmic density, in relation to the underlying metric tempo of the piece, thus facilitates the detection and labeling of the improvised sections in the concert structure. This work concerns the automatic detection of the musically relevant rhythmic densities as they change in time across the bandish (composition) performance. An annotated dataset of Dhrupad bandish concert sections is presented. We investigate a CNN-based system, trained to detect local tempo relationships, and follow it with temporal smoothing. We also employ audio source separation as a pre-processing step to the detection of the individual surface densities of the vocals and the percussion. This helps us obtain the complete musical description of the concert sections in terms of capturing the changing rhythmic interaction of the two performers. △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: Part of this work published in ISMIR 2020

arXiv:2004.00001 [pdf, other]

doi 10.1109/ICASSP40776.2020.9054181

VaPar Synth -- A Variational Parametric Model for Audio Synthesis

Authors: Krishna Subramani, Preeti Rao, Alexandre D'Hooge

Abstract: With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signals directly in the time or frequency domain. In the interest of more flexible control over the generated sound, it could be more useful to work with a parametric representation of the signal which correspo… ▽ More With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signals directly in the time or frequency domain. In the interest of more flexible control over the generated sound, it could be more useful to work with a parametric representation of the signal which corresponds more directly to the musical attributes such as pitch, dynamics and timbre. We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch. △ Less

Submitted 30 March, 2020; originally announced April 2020.

Comments: https://github.com/SubramaniKrishna/VaPar-Synth , Accepted in ICASSP 2020

arXiv:2002.06595 [pdf, other]

Speech-to-Singing Conversion in an Encoder-Decoder Framework

Authors: Jayneel Parekh, Preeti Rao, Yi-Hsuan Yang

Abstract: In this paper our goal is to convert a set of spoken lines into sung ones. Unlike previous signal processing based methods, we take a learning based approach to the problem. This allows us to automatically model various aspects of this transformation, thus overcoming dependence on specific inputs such as high quality singing templates or phoneme-score synchronization information. Specifically, we… ▽ More In this paper our goal is to convert a set of spoken lines into sung ones. Unlike previous signal processing based methods, we take a learning based approach to the problem. This allows us to automatically model various aspects of this transformation, thus overcoming dependence on specific inputs such as high quality singing templates or phoneme-score synchronization information. Specifically, we propose an encoder--decoder framework for our task. Given time-frequency representations of speech and a target melody contour, we learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker while adhering to the target melody. We also propose a multi-task learning based objective to improve lyric intelligibility. We present a quantitative and qualitative analysis of our framework. △ Less

Submitted 16 February, 2020; originally announced February 2020.

Comments: Accepted at IEEE ICASSP 2020

arXiv:2001.08349 [pdf, other]

Investigating naturalistic hand movements by behavior mining in long-term video and neural recordings

Authors: Satpreet H. Singh, Steven M. Peterson, Rajesh P. N. Rao, Bingni W. Brunton

Abstract: Recent technological advances in brain recording and artificial intelligence are propelling a new paradigm in neuroscience beyond the traditional controlled experiment. Rather than focusing on cued, repeated trials, naturalistic neuroscience studies neural processes underlying spontaneous behaviors performed in unconstrained settings. However, analyzing such unstructured data lacking a priori expe… ▽ More Recent technological advances in brain recording and artificial intelligence are propelling a new paradigm in neuroscience beyond the traditional controlled experiment. Rather than focusing on cued, repeated trials, naturalistic neuroscience studies neural processes underlying spontaneous behaviors performed in unconstrained settings. However, analyzing such unstructured data lacking a priori experimental design remains a significant challenge, especially when the data is multi-modal and long-term. Here we describe an automated approach for analyzing simultaneously recorded long-term, naturalistic electrocorticography (ECoG) and naturalistic behavior video data. We take a behavior-first approach to analyzing the long-term recordings. Using a combination of computer vision, discrete latent-variable modeling, and string pattern-matching on the behavioral video data, we find and annotate spontaneous human upper-limb movement events. We show results from our approach applied to data collected for 12 human subjects over 7--9 days for each subject. Our pipeline discovers and annotates over 40,000 instances of naturalistic human upper-limb movement events in the behavioral videos. Analysis of the simultaneously recorded brain data reveals neural signatures of movement that corroborate prior findings from traditional controlled experiments. We also prototype a decoder for a movement initiation detection task to demonstrate the efficacy of our pipeline as a source of training data for brain-computer interfacing applications. Our work addresses the unique data analysis challenges in studying naturalistic human behaviors, and contributes methods that may generalize to other neural recording modalities beyond ECoG. We publicly release our curated dataset, providing a resource to study naturalistic neural and behavioral variability at a scale not previously available. △ Less

Submitted 19 June, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

arXiv:1911.08335 [pdf, other]

Generative Audio Synthesis with a Parametric Model

Authors: Krishna Subramani, Alexandre D'Hooge, Preeti Rao

Abstract: Use a parametric representation of audio to train a generative model in the interest of obtaining more flexible control over the generated sound. Use a parametric representation of audio to train a generative model in the interest of obtaining more flexible control over the generated sound. △ Less

Submitted 15 November, 2019; originally announced November 2019.

Comments: ISMIR 2019 Late Breaking/Demo

arXiv:1906.08916 [pdf, other]

Understanding and Classifying Cultural Music Using Melodic Features Case Of Hindustani, Carnatic And Turkish Music

Authors: Amruta Vidwans, Prateek Verma, Preeti Rao

Abstract: We present a melody based classification of musical styles by exploiting the pitch and energy based characteristics derived from the audio signal. Three prominent musical styles were chosen which have improvisation as integral part with similar melodic principles, theme, and structure of concerts namely, Hindustani, Carnatic and Turkish music. Listeners of one or more of these genres can discrimin… ▽ More We present a melody based classification of musical styles by exploiting the pitch and energy based characteristics derived from the audio signal. Three prominent musical styles were chosen which have improvisation as integral part with similar melodic principles, theme, and structure of concerts namely, Hindustani, Carnatic and Turkish music. Listeners of one or more of these genres can discriminate between these based on the melodic contour alone. Listening tests were carried out using melodic attributes alone, on similar melodic pieces with respect to raga/makam, and removing any instrumentation cue to validate our hypothesis that style distinction is evident in the melody. Our method is based on finding a set of highly discriminatory features, derived from musicology, to capture distinct characteristics of the melodic contour. Behavior in terms of transitions of the pitch contour, the presence of micro-tonal notes and the nature of variations in the vocal energy are exploited. The automatically classified style labels are found to correlate well with subjective listening judgments. This was verified by using statistical tests to compare the labels from subjective and objective judgments. The melody based features, when combined with timbre based features, were seen to improve the classification performance. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: The work appeared in the 3rd CompMusic Workshop for Develo** Computational models for the Discovery of the Worlds Music held at IIT Madras at Chennai in 2013

arXiv:1904.03710 [pdf, other]

Planar Geometry and Image Recovery from Motion-Blur

Authors: Kuldeep Purohit, Subeesh Vasu, M. Purnachandra Rao, A. N. Rajagopalan

Abstract: Existing works on motion deblurring either ignore the effects of depth-dependent blur or work with the assumption of a multi-layered scene wherein each layer is modeled in the form of fronto-parallel plane. In this work, we consider the case of 3D scenes with piecewise planar structure i.e., a scene that can be modeled as a combination of multiple planes with arbitrary orientations. We first propo… ▽ More Existing works on motion deblurring either ignore the effects of depth-dependent blur or work with the assumption of a multi-layered scene wherein each layer is modeled in the form of fronto-parallel plane. In this work, we consider the case of 3D scenes with piecewise planar structure i.e., a scene that can be modeled as a combination of multiple planes with arbitrary orientations. We first propose an approach for estimation of normal of a planar scene from a single motion blurred observation. We then develop an algorithm for automatic recovery of number of planes, the parameters corresponding to each plane, and camera motion from a single motion blurred image of a multiplanar 3D scene. Finally, we propose a first-of-its-kind approach to recover the planar geometry and latent image of the scene by adopting an alternating minimization framework built on our findings. Experiments on synthetic and real data reveal that our proposed method achieves state-of-the-art results. △ Less

Submitted 6 February, 2022; v1 submitted 7 April, 2019; originally announced April 2019.

arXiv:1807.11138 [pdf, other]

Audio segmentation based on melodic style with hand-crafted features and with convolutional neural networks

Authors: Amruta Vidwans, Nachiket Deo, Preeti Rao

Abstract: We investigate methods for the automatic labeling of the taan section, a prominent structural component of the Hindustani Khayal vocal concert. The taan contains improvised raga-based melody rendered in the highly distinctive style of rapid pitch and energy modulations of the voice. We propose computational features that capture these specific high-level characteristics of the singing voice in the… ▽ More We investigate methods for the automatic labeling of the taan section, a prominent structural component of the Hindustani Khayal vocal concert. The taan contains improvised raga-based melody rendered in the highly distinctive style of rapid pitch and energy modulations of the voice. We propose computational features that capture these specific high-level characteristics of the singing voice in the polyphonic context. The extracted local features are used to achieve classification at the frame level via a trained multilayer perceptron (MLP) network, followed by grou** and segmentation based on novelty detection. We report high accuracies with reference to musician annotated taan sections across artists and concerts. We also compare the performance obtained by the compact specialized features with frame-level classification via a convolutional neural network (CNN) operating directly on audio spectrogram patches for the same task. While the relatively simple architecture we experiment with does not quite attain the classification accuracy of the hand-crafted features, it provides for a performance well above chance with interesting insights about the ability of the network to learn discriminative features effectively from labeled data. △ Less

Submitted 29 July, 2018; originally announced July 2018.

Comments: This work was done in 2015 at Indian Institute of Technology, Bombay, as a part of the ERC grant agreement 267583 (CompMusic) project

Showing 1–22 of 22 results for author: Rao, P