Skip to main content

Showing 1–7 of 7 results for author: Katsouros, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15284  [pdf, other

    cs.CL cs.SD eess.AS

    The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

    Authors: Georgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros

    Abstract: The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus o… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: To be presented at Interspeech 2024

  2. arXiv:2309.12242  [pdf, other

    cs.SD cs.LG eess.AS

    Weakly-supervised Automated Audio Captioning via text only training

    Authors: Theodoros Kouzelis, Vassilis Katsouros

    Abstract: In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: DCASE Workshop 2023

  3. arXiv:2309.11140  [pdf, other

    cs.SD cs.LG eess.AS

    Investigating Personalization Methods in Text to Music Generation

    Authors: Manos Plitsis, Theodoros Kouzelis, Georgios Paraskevopoulos, Vassilis Katsouros, Yannis Panagakis

    Abstract: In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and a… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024, Examples at https://zelaki.github.io/

  4. arXiv:2306.00996  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

    Authors: Theodoros Kouzelis, Georgios Paraskevopoulos, Athanasios Katsamanis, Vassilis Katsouros

    Abstract: The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed w… ▽ More

    Submitted 30 May, 2023; originally announced June 2023.

    Comments: Interspeech 2023

  5. arXiv:2301.00304  [pdf, other

    cs.CL cs.SD eess.AS

    Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek

    Authors: Georgios Paraskevopoulos, Theodoros Kouzelis, Georgios Rouvalis, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

    Abstract: Modern speech recognition systems exhibits rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where diversity of training data is limited. In this work we propose M2DS2, a simple and sample-efficient finetuning strategy for large pretrained speech models, based on mixed source and target domain self-supervis… ▽ More

    Submitted 31 December, 2022; originally announced January 2023.

  6. arXiv:2204.13437  [pdf, other

    cs.SD cs.LG eess.AS

    Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

    Authors: Efthymios Georgiou, Kosmas Kritsis, Georgios Paraskevopoulos, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

    Abstract: Recent deep learning Text-to-Speech (TTS) systems have achieved impressive performance by generating speech close to human parity. However, they suffer from training stability issues as well as incorrect alignment of the intermediate acoustic representation with the input text sequence. In this work, we introduce Regotron, a regularized version of Tacotron2 which aims to alleviate the training iss… ▽ More

    Submitted 14 July, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

  7. arXiv:2204.00448  [pdf, other

    cs.LG cs.AI cs.CL

    Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition

    Authors: Gerasimos Chatzoudis, Manos Plitsis, Spyridoula Stamouli, Athanasia-Lida Dimou, Athanasios Katsamanis, Vassilis Katsouros

    Abstract: Aphasia is a common speech and language disorder, typically caused by a brain injury or a stroke, that affects millions of people worldwide. Detecting and assessing Aphasia in patients is a difficult, time-consuming process, and numerous attempts to automate it have been made, the most successful using machine learning models trained on aphasic speech data. Like in many medical applications, aphas… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 figure, submitted to INTERSPEECH 2022