Skip to main content

Showing 1–5 of 5 results for author: Perz, B

.
  1. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  2. arXiv:2309.15292  [pdf, other

    cs.LG eess.SP

    Scaling Representation Learning from Ubiquitous ECG with State-Space Models

    Authors: Kleanthis Avramidis, Dominika Kunc, Bartosz Perz, Kranti Adsul, Tiantian Feng, Przemysław Kazienko, Stanisław Saganowski, Shrikanth Narayanan

    Abstract: Ubiquitous sensing from wearable devices in the wild holds promise for enhancing human well-being, from diagnosing clinical conditions and measuring stress to building adaptive health promoting scaffolds. But the large volumes of data therein across heterogeneous contexts pose challenges for conventional supervised learning approaches. Representation Learning from biological signals is an emerging… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: Pre-print, currently under review

  3. arXiv:2202.08164  [pdf, other

    eess.AS cs.CL cs.LG

    Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

    Authors: Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba

    Abstract: State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filt… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted at ICASSP 2022

  4. arXiv:2101.05695  [pdf, other

    eess.AS cs.SD

    EmoCat: Language-agnostic Emotional Voice Conversion

    Authors: Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

    Abstract: Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with les… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: Submitted to IEEE ICASSP 2021

  5. arXiv:2011.05707  [pdf, other

    eess.AS cs.CL cs.SD

    Low-resource expressive text-to-speech using data augmentation

    Authors: Goeric Huybrechts, Thomas Merritt, Giulia Comini, Bartek Perz, Raahil Shah, Jaime Lorenzo-Trueba

    Abstract: While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of su… ▽ More

    Submitted 1 June, 2021; v1 submitted 11 November, 2020; originally announced November 2020.