HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: manyfoot
  • failed: pgf-pie

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-SA 4.0
arXiv:2309.03378v3 [cs.CL] 20 Mar 2024

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codruţ Rotaru11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Nicolae Cătălin Ristea1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Radu Tudor Ionescu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversity of Bucharest, Romania
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNational University of Science and Technology Politehnica Bucharest, Romania
Corresponding author: [email protected]
Abstract

We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 59.83%percent59.8359.83\%59.83 % and a micro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 62.08%percent62.0862.08\%62.08 %, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

RoDia: A New Dataset for Romanian Dialect Identification from Speech


Codruţ Rotaru11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTand Nicolae Cătălin Ristea1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPTand Radu Tudor Ionescu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversity of Bucharest, Romania 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNational University of Science and Technology Politehnica Bucharest, Romania Corresponding author: [email protected]

1 Introduction

Spoken dialect identification emerged as a challenging task aiming to achieve a fine-grained distinction between varieties of a certain language, having similar implications to spoken language identification Barnard et al. (2014); Kimanuka et al. (2024); Ma et al. (2007). Despite being a more delicate task, spoken dialect identification received comparatively lower attention, most of it being devoted to dialect identification for widely spoken languages, such as English Weinberger and Kunath (2011), Chinese Zhang et al. (2022), and Arabic Ali et al. (2017, 2019); Shon et al. (2020). Spoken dialect identification for low-resource languages, such as Swiss German Dogan-Schönberger et al. (2021); Plüss et al. (2023) and Finnish Hämäläinen et al. (2021), has remained relatively underexplored Ranathunga et al. (2023); Barnard et al. (2014); Hämäläinen et al. (2021). Different from prior studies, we focus on spoken language identification in Romanian, a low-resource language characterized by its intricate dialectal variations within the country of Romania. Romanian, a Romance language with Latin roots, boasts a rich linguistic landscape shaped by historical, geographical, and sociocultural factors Barbu-Mititelu et al. (2018). However, despite its linguistic complexity, Romanian remains a low-resource language, with limited studies dedicated to understanding its regional linguistic diversity. This scarcity of resources is not unique to Romanian, numerous other languages around the world having similar challenges due to their lower visibility on the global linguistic stage Ranathunga et al. (2023); Barnard et al. (2014); Hämäläinen et al. (2021). Notably, the VarDial workshop is one of the main drivers for growing the interest around language variety and dialect identification, through the organization of multiple shared tasks each year Aepli et al. (2022); Chakravarthi et al. (2021); Gaman et al. (2020); Zampieri et al. (2019).

Due to the success of deep learning frameworks in speech processing Mehrish et al. (2023), researchers started to employ such methods in the area of low-resource languages Chan and Lane (2015); Al-Ghezi et al. (2023). This has led to a growing need for resources on low-resource languages. Considering dialect identification datasets across different languages, we can distinguish between two types of resources: text-based datasets Bouamor et al. (2018); Butnaru and Ionescu (2019); Francom et al. (2014); Găman et al. (2023); Găman and Ionescu (2022) and speech-based datasets Ali et al. (2017, 2019); Shon et al. (2020); Dogan-Schönberger et al. (2021); Plüss et al. (2023); Hämäläinen et al. (2021). While various languages have benefited from text-based resources that leverage written materials capturing linguistic variations, the auditory dimension of dialects adds an intricate layer of complexity. Text data, although valuable, might not fully encapsulate the nuanced phonetic and prosodic characteristics that are pivotal in dialect differentiation. In contrast to text datasets, audio datasets Ali et al. (2017, 2019); Shon et al. (2020); Dogan-Schönberger et al. (2021); Plüss et al. (2023) offer a more holistic representation, capturing not only the lexical disparities, but also the subtle intonations and accents inherent in speech.

Refer to caption

Figure 1: The administrative regions of Romania and the dominant dialect spoken within each region. RoDia is the first benchmark to contain samples representing these five Romanian dialects.

To the best of our knowledge, RoDia is the first dataset to tackle spoken dialect identification in the Romanian landscape in accordance with historical, geographical, and sociocultural factors, encouraging the research in this low-resource language. Although there are two text datasets addressing Romanian dialect identification, MOROCO Butnaru and Ionescu (2019) and MOROCO-Tweets Găman and Ionescu (2022), these cover only two dialects: Romanian (equivalent to the Muntenesc dialect) and Moldavian (Moldovenesc). In contrast, our dataset is focused on speech and covers five Romanian dialects, as shown in Figure 1. We underline that the extra dialects, namely Ardelenesc, Bănăţean, and Oltenesc, are very well characterized by phonetic differences captured only in speech. This explains why MOROCO Butnaru and Ionescu (2019) and MOROCO-Tweets Găman and Ionescu (2022) only contain text samples from the other two dialects.

The number of audio datasets available in Romanian is rather low Avram et al. (2022); Georgescu et al. (2020), confirming that Romanian is indeed a low-resource language. We underline that existing datasets comprising Romanian speech samples are mainly focused on automatic speech recognition, ignoring the diversity of dialects within the region. In contrast, our dataset is specifically designed to represent five distinct dialects spoken in Romania: Muntenesc (accepted as the official language of Romania), Ardelenesc, Moldovenesc, Oltenesc, and Bănăţean111We refer to the original (untranslated) dialect names, since most of them have no translation in English..

2 Dataset

Data collection and annotation. We collected the vast majority of the audio samples by gathering interviews and shows from local TV channels from all five regions considered in the dataset (see Figure 1). To obtain a clean dataset, we employed a rigorous selection process. First, we manually cropped the gathered audio files with respect to each speaker, e.g. we split an interview into multiple samples, such that each sample contains a single speaker. Next, we discard samples with interfering speakers and with a low perceived intelligibility. To make sure the label assignment is robust, we submitted all samples gathered from the TV channels to local annotators to validate the assigned labels. The manual validation process eliminates samples with a questionable dialect. In addition, a small proportion of data was acquired by recording citizens native to the five regions, who were asked to read some random texts from the Romanian Wikipedia, in their own dialect. We cropped the recorded samples to minimize the amount of silence at the start and the end of each audio sample. Upon curating the gathered data, we are left with a clean dataset containing 2,76827682,\!7682 , 768 audio samples, each having between 2.52.52.52.5 and 5.05.05.05.0 seconds of speech. The sample rate of all samples is 44.144.144.144.1 kHz.

We divide the dataset into 2,16421642,\!1642 , 164 samples for training and 604604604604 samples for testing, such that there is no overlap between speakers in training and test. Without separating the speakers between training and test, a model that overfits to certain speaker-specific features that are not related to dialect (e.g. pitch, loudness, rate, etc.) will reach high scores on the test set. However, these scores are unlikely to represent the actual performance of the model in a real-world scenario, where the audio samples come from unknown speakers. We thus consider that a more realistic evaluation is to separate the speakers. In the proposed setting, models that learn patterns related to speakers will not be able to capitalize on features unrelated to dialect identification.

Class #speakers Train Test
#samples SNR SRR #samples SNR SRR
Ardelenesc 47 427 28.8 36.4 119 30.5 38.5
Bănăţean 67 424 23.1 34.6 99 25.0 37.7
Moldovenesc 47 384 25.6 32.4 206 25.7 27.8
Muntenesc 64 603 29.0 35.3 106 26.7 37.8
Oltenesc 31 326 26.6 31.2 74 26.5 33.7
Overall 256 2164 26.9 34.2 604 26.8 34.0
Table 1: The number of training and test samples for each class in our dataset. For reference, we include the average SNR and SRR values for each category. The number of speakers per dialect is also provided.

There are five local annotators (one per region), who annotated all samples. The inter-rater Quadratic Weighted Kappa score is 0.83, indicating that the collected labels exhibit a substantial agreement among human evaluators. The average accuracy of our raters is 86%, indicating that the task is fairly easy for humans. Note that all raters are native Romanian speakers who speak the literary language, as well as at least one of the five dialects.

The collected samples comprise interviews and read speech found on YouTube. The read speech is actually gathered from videos where various speakers read from different books (without any influence or preparation from our side). The percentage of read speech is 21%.

Aside from dialect labels, our annotators also labeled each audio sample with the gender and age of each speaker. More precisely, the age and gender of each speaker is estimated by two annotators who had to analyze both video and audio modalities. The age annotation consists in classifying each speaker into a 10-year age group, after watching the video available for the respective speaker. In summary, our audio samples come with dialect, age and gender labels, enabling the study of additional tasks such as gender prediction or age estimation from speech.

Dataset statistics. For a more comprehensive view, we present both demographic information and audio quality statistics for our new dataset. In Table 1, we report the number of samples for each dialect in both train and test splits, as well as the signal-to-noise ratio (SNR) and signal-to-reverberation ratio (SRR). Regarding data quality, we note that the SNR and SRR values are consistently higher than 23232323 dB, highlighting that the audio samples have relatively low noise and reverberation. The Muntenesc dialect has the largest number of audio samples. This dialect was easy to collect, since it represents the literary language, which is often borrowed by speakers native to other regions. On the opposite side, the Oltenesc dialect is least represented, having only 400 audio samples. However, the distribution gap between the five classes is not high enough to pose significant challenges to machine learning models.

We present demographic information in Figure 2. In terms of demographic insights, RoDia exhibits a relatively balanced gender distribution, having 59.5%percent59.559.5\%59.5 % male and 40.5%percent40.540.5\%40.5 % female speakers. Aside from separating the speakers between training and test, we also made sure to have similar demographics for the train and test splits, reducing unnecessary distribution gaps. In summary, we consider that RoDia is a suitable resource for spoken dialect identification.

\pie\pie
Figure 2: Age and gender statistics for the RoDia dataset.
Model Ardelenesc Bănăţean Moldovenesc Muntenesc Oltenesc Overall F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT P𝑃Pitalic_P R𝑅Ritalic_R F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Micro Macro
ResNet-18 59.45 73.94 65.91 47.54 58.58 52.48 73.98 62.13 67.54 44.11 56.60 49.58 88.00 29.72 44.44 58.94 55.99
AST 66.92 73.10 69.87 43.62 65.65 52.41 77.92 58.25 66.66 48.96 66.98 56.57 92.30 32.43 48.00 60.76 58.70
SepTr 67.71 72.26 69.91 41.31 69.69 51.87 81.75 58.73 68.36 53.33 67.92 59.75 85.18 31.08 45.54 61.42 59.08
wav2vec 2.0 68.00 71.42 69.67 43.13 66.66 52.38 77.63 60.67 68.10 54.34 70.75 61.47 88.88 32.24 47.52 62.08 59.83
Table 2: Spoken dialect identification results of ResNet-18 He et al. (2016), AST Gong et al. (2021), SepTr Ristea et al. (2022) and wav2vec 2.0 Baevski et al. (2020) on the RoDia test set. For a comprehensive evaluation, we report both per dialect and overall results. The best score on each column is highlighted in blue.

3 Experiments

Baseline methods. We compile a lineup of four state-of-the-art neural architectures for speech processing to form a set of competitive dialect identification baselines for our novel dataset. We consider both convolutional He et al. (2016) and transformer-based neural networks Gong et al. (2021); Ristea et al. (2022), as well as a hybrid architecture Baevski et al. (2020). We employ the ResNet-18 He et al. (2016) convolutional network, as it was previously used for audio classification tasks Ristea and Ionescu (2020). Additionally, we explore two transformer-based architectures, namely the Audio Spectrogram Transformer (AST) Gong et al. (2021) and the Separable Transformer (SepTr) Ristea et al. (2022), due to their high performance in audio classification. We also employ the wav2vec 2.0 model Baevski et al. (2020), which uses a hybrid architecture combining the advantages of both convolutional and transformer blocks. All models are trained in the multi-class setting, since the ground-truth labels are constructed in a similar manner: one audio sample belongs to only one dialect.

Preprocessing. For models operating in the time-frequency domain He et al. (2016); Gong et al. (2021); Ristea et al. (2022), we apply the Short-Time Fourier Transform with a window size of 512 and a hop size of 256. Then, we compute the square root of the magnitude, obtaining the spectrogram map. The other steps and parameters are exactly as described in the original papers. For wav2vec 2.0 Baevski et al. (2020), we apply the preprocessing steps described by the authors, which are mainly used for normalization. In all our experiments, we use the following data augmentation methods: noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment Park et al. (2019).

Evaluation metrics. We report the precision (P𝑃Pitalic_P), recall (R𝑅Ritalic_R), and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores computed for each dialect. These metrics provide insights into the ability of models to correctly classify instances within each class. To quantify the overall performance, we aggregate the individual scores via the micro and macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT measures. The micro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score combines the performance metrics across all examples, while the macro F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score offers a balanced average of the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores across all classes.

Training environment. All models are optimized on an Nvidia GeForce GTX 3090 GPU with 24 GB of VRAM.

Hyperparameter tuning. For each model, we employed grid search to find the optimal learning rate (between 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT) and the optimal batch size (between 8888 and 128128128128 samples). We take the wav2vec 2.0 Baevski et al. (2020) model, which is pretrained on English data, and fine-tune it for 10101010 epochs on RoDia using a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and mini-batches of 16161616 samples. We train ResNet-18, AST Gong et al. (2021) and SepTr Ristea et al. (2022) from scratch. The models are trained for 50505050 epochs with early stop**, using a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and mini-batches of 32323232 samples. All models are optimized with the Adam optimizer Kingma and Ba (2015).

Refer to caption

Figure 3: Confusion matrix on the test set for the wav2vec 2.0 Baevski et al. (2020) model.

Dialect identification results. In Table 2, we present the spoken dialect identification results of the baseline models on the RoDia test set. The convolutional network, ResNet-18, obtains the lowest overall performance. Still, ResNet-18 reaches competitive F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores for the Bănăţean and Moldovenesc dialects. Unlike the other baselines, the ResNet-18 model struggles with the Muntenesc and Ardelenesc dialects, which explains its low overall performance. The transformer-based models, AST Gong et al. (2021) and SepTr Ristea et al. (2022), yield superior results, with a slight upper hand from the SepTr model. In terms of the overall F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores, the best model appears to be wav2vec 2.0 Baevski et al. (2020). However, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores per dialect seem to tell a slightly different story, since the wav2vec 2.0 is outperformed by at least one of the other models on four dialects: Ardelenesc, Bănăţean, Moldovenesc and Oltenesc. The competitive edge of wav2vec 2.0 lies in its ability to better identify the Muntenesc dialect. We underline that the audio samples were recorded in uncontrolled scenarios, so the reported results directly reflect the capability of systems in the real-world case.

To further assess the behavior of the best baseline model, namely wav2vec 2.0, we consider its confusion matrix illustrated in Figure 3. The confusion matrix reveals some interesting patterns. While the model tends to mislabel samples from the Oltenesc dialect, we observe that most of these mistakes are caused by a high confusion with the Bănăţean dialect. Since Banat and Oltenia are neighboring regions, there are several similarities between these two dialects. For instance, both dialects are characterized by the frequent use of the perfect simple tense, which is hardly encountered in the other Romanian dialects. Another noticeable problem is with the Moldovenesc dialect, which is often wrongly identified as Bănăţean and Muntenesc. The confusion between the Moldovenesc and Bănăţean dialects is caused by the fact these two dialects kept the form of words such as câne (dog), pâne (bread) and mâne (tomorrow), from the old Romanian. In the literary language, as well as the other dialects, these words are pronounced with an ‘i’ before the consonant ‘n’, as follows: câine, pâine and mâine. The confusion with the Muntenesc dialect can be attributed to the fact that some residents of the southern part of Moldavia lost some of the dialectal features, e.g. they prefer to use the word pantofi to refer to shoes, and the word papuci to refer to slippers, just as the residents of Muntenia. In contrast, the residents of the northern side of Moldavia regularly use papuci to refer to shoes, and şlapi to refer to slippers. In summary, the confusion matrix shows that Romanian dialect identification is not an easy task, requiring researchers to address specific issues in order to come up with more accurate models in the future.

The confusion matrix illustrated in Figure 3 also shows that the training data distribution does not affect wav2vec 2.0. For example, the Moldovenesc dialect is the second-least popular dialect in our dataset, but wav2vec places many of the test samples into the Moldovenesc class. Overall, the confusion matrix of wav2vec reflects the test data distribution, although the model was trained on a slightly different class distribution. This confirms that the imbalance is not high enough to bias models.

Dialect WER
Ardelenesc 31.5%
Bănăţean 30.2%
Moldovenesc 32.8%
Muntenesc 24.1%
Oltenesc 29.7%
Table 3: Word error rates (WER) of the Whisper-Large model Radford et al. (2023) on the five dialects from the RoDia dataset. The Whisper-Large model is trained on the literary Romanian language. The ASR transcripts are compared with manual transcripts to establish the performance of the Whisper-Large model.

Speech recognition results. We manually transcribed our data samples to test the performance of a state-of-the-art automatic speech recognition (ASR) system on RoDia. Then, we applied the open source Whisper-Large model Radford et al. (2023) on our test set and obtained the word error rates reported in Table 3. The Muntenesc dialect is almost identical to the literary language, explaining why it exhibits the lowest WER. The WER obtained by the Whisper-Large model for the Romanian language on the Common Voice dataset is 19.8%. The difference between the WER for the Muntenesc dialect and the WER reported on Common Voice can be attributed to the distribution gap between the RoDia and the Common Voice datasets. Considering the generally higher error rates for the other Romanian dialects shown in Table 3, we conclude that ASR for dialectal speech is more difficult. This justifies the utility of our novel dataset for ASR.

4 Conclusion

In this paper, we introduced RoDia, the first dataset for Romanian dialect identification from speech. Our dataset contains 2,76827682,\!7682 , 768 speech samples representing five Romanian dialects. The audio samples were manually annotated with dialect, age and gender labels, enabling the study of spoken dialect identification in a realistic scenario, where the speakers in the training and test splits are disjoint. We conducted experiments with four state-of-the-art speech processing models, establishing a range of baseline performance levels for future research.

5 Limitations

This work is focused on spoken Romanian dialect identification, but the performance levels of the considered approaches might be different on other languages. Due to our specific focus on the Romanian language, we did not evaluate the performance of the considered models across other languages. However, we consider that the evaluation on other languages is beyond the scope of the current study.

Another limitation of our work is the slightly limited number of samples with manual labels included in our corpus. This limitation is caused by scarcity of resources available online. Most of the Romanian video or audio samples available online use the literary language. Local news and content creators commonly use the literary language taught in schools. Hence, dialects are mostly used by rural residents, which often have no Internet access. This situation significantly limits the dialectal resources that are publicly available.

6 Ethics Statement

The manual labeling was carried out by volunteers who agreed to annotate the audio samples for free. Prior to the annotation, they also agreed to let us publish their labels along with the dataset. Our data is collected from YouTube, which resides in the public web domain. We note that the European regulations222https://eur-lex.europa.eu/eli/dir/2019/790/oj allow researchers to use data in the public web domain for non-commercial research purposes. Thus, we release our data and code under the CC BY-NC-SA 4.0 license333https://creativecommons.org/licenses/by-nc-sa/4.0/.

During data collection, we made sure the audio samples do not contain information that names or uniquely identifies individual people.

References

Appendix A Appendix

Peculiarities of Romanian dialects. As per Wikipedia444https://en.wikipedia.org/wiki/Romanian_dialects, the Romanian dialects are not easy to classify, and their classification is still highly debated by experts, who proposed various classifications, ranging from 2 to even 20 dialects. Since there is no standard classification, we underline that the dialects from RoDia are not unanimously considered as dialects by experts. In Romanian, these are called “grai”, which is translated (perhaps abusively) as “idiom” or “dialect” in English. Aside from phonetic differences, we note that there are a few hundred words that are specific to each such “grai”. For example, lists of such words, called regionalisms, are available online555http://regionalisme.ro. The dialects included in our dataset have lists comprising between 300 and 800 regionalisms. In summary, the Romanian dialects have several distinctive features, such as:

  • Phonetic differences, e.g. the word “ce” (what) is pronounced “ci” in the Moldovenesc dialect and “ce” in other dialects.

  • Regionalisms, e.g. the word “melon” is translated as “pepene” in the Muntenesc dialect, “harbuz” in the Moldovenesc dialect, “lubeniţă” in the Bănăţean and Oltenesc dialects, and “curcubete” or “lebeniţă” in the Ardelenesc dialect.

  • The addition of unnecessary dialect-specific words, e.g. “What time is it?” is normally translated as “Cât este ceasul?”, but in the Ardelenesc dialect, people commonly use “Oare cât este ceasul?” (which can be translated as “I wonder what time is it”). In general, it is common to use “Oare” when addressing a question to another person in the Ardelenesc dialect.

  • Preference for using a different past tense in the Oltenesc and Bănăţean dialects than in other Romanian dialects, e.g. for the phrase “I was”, speakers of the Oltenesc and Bănăţean dialects dialects say “fusei”, but the speakers of other Romanian dialects use “am fost”.

Dialect Population
Ardeal (without Banat) 5.5M
Moldova 4.2M
Muntenia 3.3M
Oltenia 2M
Banat 1.25M
Crişana 1.2M
Maramureş 0.46M
Table 4: Population size for seven of the largest regions in Romania. Our dataset includes representative dialects for the top five regions.

Criteria for choosing the five dialects. To establish the set of dialects for RoDia, we used two criteria. On the one hand, we aimed to include as many dialects as possible. On the other hand, we were limited by the low number of audio samples available for dialects spoken in small regions (by low populations). We thus selected the top five most popular dialects, which are representative for the regions depicted in Figure 1 from our paper. In Table 4, we provide the size of the population in each region of Romania corresponding to one of the top seven Romanian dialects. Dialects that are not included in RoDia correspond to smaller sub-regions, e.g. Crişana and Maramureş, for which it is even harder to collect sufficient audio samples.