Skip to main content

Showing 1–5 of 5 results for author: Caceres, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2403.10493  [pdf, other

    cs.SD eess.AS eess.SP

    MusicHiFi: Fast High-Fidelity Stereo Vocoding

    Authors: Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

    Abstract: Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fide… ▽ More

    Submitted 20 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  2. arXiv:2303.06475  [pdf, other

    eess.AS cs.CL

    Transcription free filler word detection with Neural semi-CRFs

    Authors: Ge Zhu, Yujia Yan, Juan-Pablo Caceres, Zhiyao Duan

    Abstract: Non-linguistic filler words, such as "uh" or "um", are prevalent in spontaneous speech and serve as indicators for expressing hesitation or uncertainty. Previous works for detecting certain non-linguistic filler words are highly dependent on transcriptions from a well-established commercial automatic speech recognition (ASR) system. However, certain ASR systems are not universally accessible from… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  3. arXiv:2203.15135  [pdf, other

    cs.CL cs.SD eess.AS

    Filler Word Detection and Classification: A Dataset and Benchmark

    Authors: Ge Zhu, Juan-Pablo Caceres, Justin Salamon

    Abstract: Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem to date. A key reason is the absence of a dataset with annotated… ▽ More

    Submitted 1 July, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: To appear at Insterspeech 2022

  4. arXiv:2110.02360  [pdf, other

    eess.AS cs.SD

    Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

    Authors: Max Morrison, Zeyu **, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

    Abstract: Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches due to their speed and relatively higher quality.… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  5. arXiv:2102.08328  [pdf, other

    eess.AS cs.LG cs.SD

    Context-Aware Prosody Correction for Text-Based Speech Editing

    Authors: Max Morrison, Lucas Rencker, Zeyu **, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

    Abstract: Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-bas… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: To appear in proceedings of ICASSP 2021