Showing 1–1 of 1 results for author: Rotaru, C
-
RoDia: A New Dataset for Romanian Dialect Identification from Speech
Authors:
Codrut Rotaru,
Nicolae-Catalin Ristea,
Radu Tudor Ionescu
Abstract:
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. Th…
▽ More
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
△ Less
Submitted 20 March, 2024; v1 submitted 6 September, 2023;
originally announced September 2023.