Search | arXiv e-print repository

Safe Force/Position Tracking Control via Control Barrier Functions for Floating Base Mobile Manipulator Systems

Authors: Maryam Sharifi, Shahab Heshmati-Alamdari

Abstract: This paper introduces a safe force/position tracking control strategy designed for Free-Floating Mobile Manipulator Systems (MMSs) engaging in compliant contact with planar surfaces. The strategy uniquely integrates the Control Barrier Function (CBF) to manage operational limitations and safety concerns. It effectively addresses safety-critical aspects in the kinematic as well as dynamic level, su… ▽ More This paper introduces a safe force/position tracking control strategy designed for Free-Floating Mobile Manipulator Systems (MMSs) engaging in compliant contact with planar surfaces. The strategy uniquely integrates the Control Barrier Function (CBF) to manage operational limitations and safety concerns. It effectively addresses safety-critical aspects in the kinematic as well as dynamic level, such as manipulator joint limits, system velocity constraints, and inherent system dynamic uncertainties. The proposed strategy remains robust to the uncertainties of the MMS dynamic model, external disturbances, or variations in the contact stiffness model. The proposed control method has low computational demand ensures easy implementation on onboard computing systems, endorsing real-time operations. Simulation results verify the strategy's efficacy, reflecting enhanced system performance and safety. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: Accepted for presentation at the European Control Conference (ECC) 2024, Stockholm, Sweden

arXiv:2306.12925 [pdf, other]

AudioPaLM: A Large Language Model That Can Speak and Listen

Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: Technical report

arXiv:2305.09636 [pdf, other]

SoundStorm: Efficient Parallel Audio Generation

Authors: Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

Abstract: We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consist… ▽ More We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices. △ Less

Submitted 16 May, 2023; originally announced May 2023.

arXiv:2304.10892 [pdf, other]

doi 10.1145/3578356.3592578

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Authors: Mehran Salmani, Saeid Ghafouri, Alireza Sanaee, Kamran Razavi, Max Mühlhäuser, Joseph Doyle, Pooyan Jamshidi, Mohsen Sharifi

Abstract: The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations… ▽ More The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler). △ Less

Submitted 24 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:2302.03540 [pdf, other]

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Authors: Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

Abstract: We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables… ▽ More We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests. △ Less

Submitted 7 February, 2023; originally announced February 2023.

arXiv:2301.11325 [pdf, other]

MusicLM: Generating Music From Text

Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

arXiv:2209.03143 [pdf, other]

AudioLM: a Language Modeling Approach to Audio Generation

Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati… ▽ More We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music. △ Less

Submitted 25 July, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

arXiv:2202.07273 [pdf, other]

SpeechPainter: Text-conditioned Speech Inpainting

Authors: Zalán Borsos, Matt Sharifi, Marco Tagliasacchi

Abstract: We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed… ▽ More We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests. △ Less

Submitted 30 March, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: Submitted to Interspeech 2022

arXiv:2109.13832 [pdf, other]

Compositional Construction of Abstractions for Infinite Networks of Switched Systems

Authors: Maryam Sharifi, Abdalla Swikir, Navid Noroozi, Majid Zamani

Abstract: We construct compositional continuous approximations for an interconnection of infinitely many discrete-time switched systems. An approximation (known as abstraction) is itself a continuous-space system, which can be used as a replacement of the original (known as concrete) system in a controller design process. Having synthesized a controller for the abstract system, the controller is refined to… ▽ More We construct compositional continuous approximations for an interconnection of infinitely many discrete-time switched systems. An approximation (known as abstraction) is itself a continuous-space system, which can be used as a replacement of the original (known as concrete) system in a controller design process. Having synthesized a controller for the abstract system, the controller is refined to a more detailed controller for the concrete system. To quantify the mismatch between the output trajectory of the approximation and of that the original system, we use the notion of so-called simulation functions. In particular, each subsystem in the concrete network and its corresponding one in the abstract network is related through a local simulation function. We show that if the local simulation functions satisfy a certain small-gain type condition developed for a network of infinitely many subsystems, then the aggregation of the individual simulation functions provides an overall simulation function between the overall abstraction and the concrete network. For a network of linear switched systems, we systematically construct local abstractions and local simulation functions, where the required conditions are expressed in terms of linear matrix inequalities and can be efficiently computed. We illustrate the effectiveness of our approach through an application to frequency control in a power gird with a switched (i.e. time-varying) topology. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2101.08873

arXiv:2109.07121 [pdf, other]

Enhancing Data-Driven Reachability Analysis using Temporal Logic Side Information

Authors: Amr Alanwar, Frank J. Jiang, Maryam Sharifi, Dimos V. Dimarogonas, Karl H. Johansson

Abstract: This paper presents algorithms for performing data-driven reachability analysis under temporal logic side information. In certain scenarios, the data-driven reachable sets of a robot can be prohibitively conservative due to the inherent noise in the robot's historical measurement data. In the same scenarios, we often have side information about the robot's expected motion (e.g., limits on how much… ▽ More This paper presents algorithms for performing data-driven reachability analysis under temporal logic side information. In certain scenarios, the data-driven reachable sets of a robot can be prohibitively conservative due to the inherent noise in the robot's historical measurement data. In the same scenarios, we often have side information about the robot's expected motion (e.g., limits on how much a robot can move in a one-time step) that could be useful for further specifying the reachability analysis. In this work, we show that if we can model this side information using a signal temporal logic (STL) fragment, we can constrain the data-driven reachability analysis and safely limit the conservatism of the computed reachable sets. Moreover, we provide formal guarantees that, even after incorporating side information, the computed reachable sets still properly over-approximate the robot's future states. Lastly, we empirically validate the practicality of the over-approximation by computing constrained, data-driven reachable sets for the Small-Vehicles-for-Autonomy (SVEA) hardware platform in two driving scenarios. △ Less

Submitted 30 March, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: Accepted at the IEEE International Conference on Robotics and Automation (ICRA 2022)

arXiv:2103.15604 [pdf, ps, other]

Higher Order Convergent Control Barrier Functions for Leader-Follower Multi-Agent Systems under STL Tasks

Authors: Maryam Sharifi, Dimos V. Dimarogonas

Abstract: This paper presents control strategies based on time-varying convergent higher order control barrier functions for a class of leader-follower multi-agent systems under signal temporal logic (STL) tasks. Each agent is assigned a local STL task which may be dependent on the behavior of agents involved in other tasks. The leader has knowledge on the associated tasks and controls the performance of th… ▽ More This paper presents control strategies based on time-varying convergent higher order control barrier functions for a class of leader-follower multi-agent systems under signal temporal logic (STL) tasks. Each agent is assigned a local STL task which may be dependent on the behavior of agents involved in other tasks. The leader has knowledge on the associated tasks and controls the performance of the subgroup involved agents. Robust solutions for the task satisfaction, based on the leader's accessibility to the follower agents' states are suggested. Our approach finds solutions to guarantee the satisfaction of STL tasks independent of the agents' initial conditions. △ Less

Submitted 9 October, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

arXiv:2103.00986 [pdf, ps, other]

Fixed-Time Convergent Control Barrier Functions for Coupled Multi-Agent Systems Under STL Tasks

Authors: Maryam Sharifi, Dimos V. Dimarogonas

Abstract: This paper presents a control strategy based on a new notion of time-varying fixed-time convergent control barrier functions (TFCBFs) for a class of coupled multi-agent systems under signal temporal logic (STL) tasks. In this framework, each agent is assigned a local STL task regradless of the tasks of other agents. Each task may be dependent on the behavior of other agents which may cause conflic… ▽ More This paper presents a control strategy based on a new notion of time-varying fixed-time convergent control barrier functions (TFCBFs) for a class of coupled multi-agent systems under signal temporal logic (STL) tasks. In this framework, each agent is assigned a local STL task regradless of the tasks of other agents. Each task may be dependent on the behavior of other agents which may cause conflicts on the satisfaction of all tasks. Our approach finds a robust solution to guarantee the fixed-time satisfaction of STL tasks in a least violating way and independent of the agents' initial condition in the presence of undesired violation effects of the neighbor agents. Particularly, the robust performance of the task satisfactions can be adjusted in a user-specified way. △ Less

Submitted 29 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted in ECC 2021

arXiv:2101.10627 [pdf, ps, other]

Robust Finite-Time Consensus Subject to Unknown Communication Time Delays Based on Delay-Dependent Criteria

Authors: Maryam Sharifi

Abstract: In this paper, robust finite-time consensus of a group of nonlinear multi-agent systems in the presence of communication time delays is considered. In particular, appropriate delay-dependent strategies which are less conservative are suggested. Sufficient conditions for finite-time consensus in the presence of deterministic and stochastic disturbances are presented. The communication delays don't… ▽ More In this paper, robust finite-time consensus of a group of nonlinear multi-agent systems in the presence of communication time delays is considered. In particular, appropriate delay-dependent strategies which are less conservative are suggested. Sufficient conditions for finite-time consensus in the presence of deterministic and stochastic disturbances are presented. The communication delays don't need to be time invariant, uniform, symmetric, or even known. The only required condition is that all delays satisfy a known upper bound. The consensus algorithm is appropriate for agents with partial access to neighbor agents' signals. The Lyapunov-Razumikhin theorem for finite-time convergence is used to prove the results. Simulation results on a group of mobile robot manipulators as the agents of the system are presented. △ Less

Submitted 26 January, 2021; originally announced January 2021.

arXiv:2101.08873 [pdf, ps, other]

Compositional Construction of Abstractions for Infinite Networks of Discrete-Time Switched Systems

Authors: Maryam Sharifi, Abdalla Swikir, Navid Noroozi, Majid Zamani

Abstract: In this paper, we develop a compositional scheme for the construction of continuous approximations for interconnections of infinitely many discrete-time switched systems. An approximation (also known as abstraction) is itself a continuous-space system, which can be used as a replacement of the original (also known as concrete) system in a controller design process. Having designed a controller for… ▽ More In this paper, we develop a compositional scheme for the construction of continuous approximations for interconnections of infinitely many discrete-time switched systems. An approximation (also known as abstraction) is itself a continuous-space system, which can be used as a replacement of the original (also known as concrete) system in a controller design process. Having designed a controller for the abstract system, it is refined to a more detailed one for the concrete system. We use the notion of so-called simulation functions to quantify the mismatch between the original system and its approximation. In particular, each subsystem in the concrete network and its corresponding one in the abstract network are related through a notion of local simulation functions. We show that if the local simulation functions satisfy certain small-gain type conditions developed for a network containing infinitely many subsystems, then the aggregation of the individual simulation functions provides an overall simulation function quantifying the error between the overall abstraction network and the concrete one. In addition, we show that our methodology results in a scale-free compositional approach for any finite-but-arbitrarily large networks obtained from truncation of an infinite network. We provide a systematic approach to construct local abstractions and simulation functions for networks of linear switched systems. The required conditions are expressed in terms of linear matrix inequalities that can be efficiently computed. We illustrate the effectiveness of our approach through an application to AC islanded microgirds. △ Less

Submitted 21 January, 2021; originally announced January 2021.

arXiv:2005.13101 [pdf, other]

State Estimation-Based Robust Optimal Control of Influenza Epidemics in an Interactive Human Society

Authors: Vahid Azimi, Mojtaba Sharifi, Seyed Fakoorian, Thang Tien Nguyen, Van Van Huynh

Abstract: This paper presents a state estimation-based robust optimal control strategy for influenza epidemics in an interactive human society in the presence of modeling uncertainties. Interactive society is influenced by the random entrance of individuals from other human societies whose effects can be modeled as a non-Gaussian noise. Since only the number of exposed and infected humans can be measured, s… ▽ More This paper presents a state estimation-based robust optimal control strategy for influenza epidemics in an interactive human society in the presence of modeling uncertainties. Interactive society is influenced by the random entrance of individuals from other human societies whose effects can be modeled as a non-Gaussian noise. Since only the number of exposed and infected humans can be measured, states of the influenza epidemics are first estimated by an extended maximum correntropy Kalman filter (EMCKF) to provide a robust state estimation in the presence of the non-Gaussian noise. An online quadratic program (QP) optimization is then synthesized subject to a robust control Lyapunov function (RCLF) to minimize susceptible and infected humans, while minimizing and bounding the rates of vaccination and antiviral treatment. The joint QP-RCLF-EMCKF meets multiple design specifications such as state estimation, tracking, pointwise control optimality, and robustness to parameter uncertainty and state estimation errors that have not been achieved simultaneously in previous studies. The uniform ultimate boundedness (UUB)/convergence of error trajectories is guaranteed using a Lyapunov stability argument. The soundness of the proposed approach is validated on the influenza epidemics of an interactive human society with a population of 16000. Simulation results show that the QP-RCLF-EMCKF achieves appropriate tracking and state estimation performance. The robustness of the proposed controller is finally illustrated in the presence of modeling error and non-Gaussian noise. △ Less

Submitted 11 November, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

arXiv:2002.01322 [pdf, other]

Training Keyword Spotters with Limited and Synthesized Speech Data

Authors: James Lin, Kevin Kilgour, Dominik Roblek, Matthew Sharifi

Abstract: With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term… ▽ More With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy. △ Less

Submitted 31 January, 2020; originally announced February 2020.

arXiv:1910.11664 [pdf, other]

doi 10.1109/TASLP.2020.2982285

SPICE: Self-supervised Pitch Estimation

Authors: Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, Mihajlo Velimirović

Abstract: We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. Th… ▽ More We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. The key observation is that pitch shift maps to a simple translation when the audio signal is analysed through the lens of the constant-Q transform (CQT). We design a self-supervised task by feeding two shifted slices of the CQT to the same convolutional encoder, and require that the difference in the outputs is proportional to the corresponding difference in pitch. In addition, we introduce a small model head on top of the encoder, which is able to determine the confidence of the pitch estimate, so as to distinguish between voiced and unvoiced audio. Our results show that the proposed method is able to estimate pitch at a level of accuracy comparable to fully supervised models, both on clean and noisy audio samples, although it does not require access to large labeled datasets. △ Less

Submitted 4 September, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: Accepted to IEEE Transactions on Audio, Speech and Language Processing

Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1118-1128, 2020

arXiv:1812.08466 [pdf, other]

Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Authors: Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi

Abstract: We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fréchet Inception Distance (FID) metric used to evalu… ▽ More We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fréchet Inception Distance (FID) metric used to evaluate generative image models to the audio domain. FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients of 0.39, -0.15 and -0.01 respectively. △ Less

Submitted 17 January, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

arXiv:1811.00006 [pdf, other]

Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition

Authors: David B. Ramsay, Kevin Kilgour, Dominik Roblek, Matthew Sharifi

Abstract: Low power digital signal processors (DSPs) typically have a very limited amount of memory in which to cache data. In this paper we develop efficient bottleneck feature (BNF) extractors that can be run on a DSP, and retrain a baseline large-vocabulary continuous speech recognition (LVCSR) system to use these BNFs with only a minimal loss of accuracy. The small BNFs allow the DSP chip to cache more… ▽ More Low power digital signal processors (DSPs) typically have a very limited amount of memory in which to cache data. In this paper we develop efficient bottleneck feature (BNF) extractors that can be run on a DSP, and retrain a baseline large-vocabulary continuous speech recognition (LVCSR) system to use these BNFs with only a minimal loss of accuracy. The small BNFs allow the DSP chip to cache more audio features while the main application processor is suspended, thereby reducing the overall battery usage. Our presented system is able to reduce the footprint of standard, fixed point DSP spectral features by a factor of 10 without any loss in word error rate (WER) and by a factor of 64 with only a 5.8% relative increase in WER. △ Less

Submitted 31 October, 2018; originally announced November 2018.

Comments: Submitted to ICASSP 2019

arXiv:1711.10958 [pdf, other]

Now Playing: Continuous low-power music recognition

Authors: Blaise Agüera y Arcas, Beat Gfeller, Ruiqi Guo, Kevin Kilgour, Sanjiv Kumar, James Lyon, Julian Odell, Marvin Ritter, Dominik Roblek, Matthew Sharifi, Mihajlo Velimirović

Abstract: Existing music recognition applications require a connection to a server that performs the actual recognition. In this paper we present a low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction. To reduce battery consumption, a small music detector runs continuously on the mobile device's DSP chip and wakes up the main applicatio… ▽ More Existing music recognition applications require a connection to a server that performs the actual recognition. In this paper we present a low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction. To reduce battery consumption, a small music detector runs continuously on the mobile device's DSP chip and wakes up the main application processor only when it is confident that music is present. Once woken, the recognizer on the application processor is provided with a few seconds of audio which is fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of tens of thousands of songs. Our presented system, Now Playing, has a daily battery usage of less than 1% on average, respects user privacy by running entirely on-device and can passively recognize a wide range of music. △ Less

Submitted 29 November, 2017; originally announced November 2017.

Comments: Authors are listed in alphabetical order by last name

Showing 1–20 of 20 results for author: Sharifi, M