Search | arXiv e-print repository

Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring

Authors: Serge Zaugg, Mike van der Schaar, Florence Erbs, Antonio Sanchez, Joan V. Castell, Emiliano Ramallo, Michel André

Abstract: Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performa… ▽ More Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performance while learning from moderate-sized data. Recordings from a rainforest ecosystem were used. Start and end-point of sounds from 20 bird species were manually annotated. Spectrograms from 10 second segments were used as CNN input. We designed simple CNNs with a frequency unwrap** layer (SIMP-FU models) such that any output unit was connected to all spectrogram frequencies but only to a sub-region of time, the Receptive Field (RF). Our models allowed experimentation with different RF durations. Models either used the time-indexed labels that encode start and end-point of sounds or simpler segment-level labels. Models learning from time-indexed labels performed considerably better than their segment-level counterparts. Best classification performances was achieved for models with intermediate RF duration of 1.5 seconds. The best SIMP-FU models achieved AUCs over 0.95 in 18 of 20 classes on the test set. On compact low-cost hardware the best SIMP-FU models evaluated up to seven times faster than real-time data acquisition. RF duration was a major driver of classification performance. The optimum of 1.5 s was in the same range as the duration of the sounds. Our models achieved good classification performance while learning from moderate-sized training data. This is explained by the usage of time-indexed labels during training and adequately sized RF. Results confirm the feasibility of deploying small CNNs with good classification performance on compact low-cost devices. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2207.01547 [pdf, other]

Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)

Authors: Ariadna Sanchez, Alessio Falai, Ziyao Zhang, Orazio Angelini, Kayoko Yanagisawa

Abstract: An essential design decision for multilingual Neural Text-To-Speech (NTTS) systems is how to represent input linguistic features within the model. Looking at the wide variety of approaches in the literature, two main paradigms emerge, unified and separate representations. The former uses a shared set of phonetic tokens across languages, whereas the latter uses unique phonetic tokens for each langu… ▽ More An essential design decision for multilingual Neural Text-To-Speech (NTTS) systems is how to represent input linguistic features within the model. Looking at the wide variety of approaches in the literature, two main paradigms emerge, unified and separate representations. The former uses a shared set of phonetic tokens across languages, whereas the latter uses unique phonetic tokens for each language. In this paper, we conduct a comprehensive study comparing multilingual NTTS systems models trained with both representations. Our results reveal that the unified approach consistently achieves better cross-lingual synthesis with respect to both naturalness and accent. Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity. For this reason, we carry out an ablation study to understand the interaction of the representation type with the size of the token embedding. We find that the difference between the two paradigms only emerges above a certain threshold embedding size. This study provides strong evidence that unified representations should be the preferred paradigm when building multilingual NTTS systems. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2207.01507 [pdf, other]

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)

Authors: Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko Yanagisawa

Abstract: Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems. In order to train these models, it is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis. In this context, it is common to hear questions such as "Would inc… ▽ More Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems. In order to train these models, it is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis. In this context, it is common to hear questions such as "Would including more Spanish data help my Italian synthesis, given the closeness of both languages?". Unfortunately, we found existing literature on the topic lacking in completeness in this regard. In the present work, we conduct an extensive ablation study aimed at understanding how various factors of the training corpora, such as language family affiliation, gender composition, and the number of speakers, contribute to the quality of Polyglot synthesis. Our findings include the observation that female speaker data are preferred in most scenarios, and that it is not always beneficial to have more speakers from the target language variant in the training corpus. The findings herein are informative for the process of data procurement and corpora building. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2011.11719 [pdf, other]

Explainable-by-design Semi-Supervised Representation Learning for COVID-19 Diagnosis from CT Imaging

Authors: Abel Díaz Berenguer, Hichem Sahli, Boris Joukovsky, Maryna Kvasnytsia, Ine Dirks, Mitchel Alioscha-Perez, Nikos Deligiannis, Panagiotis Gonidakis, Sebastián Amador Sánchez, Redona Brahimetaj, Evgenia Papavasileiou, Jonathan Cheung-Wai Chana, Fei Li, Shangzhen Song, Yixin Yang, Sofie Tilborghs, Siri Willems, Tom Eelbode, Jeroen Bertels, Dirk Vandermeulen, Frederik Maes, Paul Suetens, Lucas Fidon, Tom Vercauteren, David Robben , et al. (15 additional authors not shown)

Abstract: Our motivating application is a real-world problem: COVID-19 classification from CT imaging, for which we present an explainable Deep Learning approach based on a semi-supervised classification pipeline that employs variational autoencoders to extract efficient feature embedding. We have optimized the architecture of two different networks for CT images: (i) a novel conditional variational autoenc… ▽ More Our motivating application is a real-world problem: COVID-19 classification from CT imaging, for which we present an explainable Deep Learning approach based on a semi-supervised classification pipeline that employs variational autoencoders to extract efficient feature embedding. We have optimized the architecture of two different networks for CT images: (i) a novel conditional variational autoencoder (CVAE) with a specific architecture that integrates the class labels inside the encoder layers and uses side information with shared attention layers for the encoder, which make the most of the contextual clues for representation learning, and (ii) a downstream convolutional neural network for supervised classification using the encoder structure of the CVAE. With the explainable classification results, the proposed diagnosis system is very effective for COVID-19 classification. Based on the promising results obtained qualitatively and quantitatively, we envisage a wide deployment of our developed technique in large-scale clinical studies.Code is available at https://git.etrovub.be/AVSP/ct-based-covid-19-diagnostic-tool.git. △ Less

Submitted 2 September, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

arXiv:2010.13731 [pdf, other]

doi 10.1007/978-3-030-61705-9_54

Dyslexia detection from EEG signals using SSA component correlation and Convolutional Neural Networks

Authors: Andrés Ortiz, Francisco J. Martinez-Murcia, Marco A. Formoso, Juan Luis Luque, Auxiliadora Sánchez

Abstract: Objective dyslexia diagnosis is not a straighforward task since it is traditionally performed by means of the intepretation of different behavioural tests. Moreover, these tests are only applicable to readers. This way, early diagnosis requires the use of specific tasks not only related to reading. Thus, the use of Electroencephalography (EEG) constitutes an alternative for an objective and early… ▽ More Objective dyslexia diagnosis is not a straighforward task since it is traditionally performed by means of the intepretation of different behavioural tests. Moreover, these tests are only applicable to readers. This way, early diagnosis requires the use of specific tasks not only related to reading. Thus, the use of Electroencephalography (EEG) constitutes an alternative for an objective and early diagnosis that can be used with pre-readers. In this way, the extraction of relevant features in EEG signals results crucial for classification. However, the identification of the most relevant features is not straighforward, and predefined statistics in the time or frequency domain are not always discriminant enough. On the other hand, classical processing of EEG signals based on extracting EEG bands frequency descriptors, usually make some assumptions on the raw signals that could cause indormation loosing. In this work we propose an alternative for analysis in the frequency domain based on Singluar Spectrum Analysis (SSA) to split the raw signal into components representing different oscillatory modes. Moreover, correlation matrices obtained for each component among EEG channels are classfied using a Convolutional Neural network. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: 11 pages, 7 figures. Submitted to conference

Journal ref: HAIS 2020: Hybrid Artificial Intelligent Systems. LNCS 12344. pp 655-664

arXiv:2007.15546 [pdf, other]

Comparative study of deep learning methods for the automatic segmentation of lung, lesion and lesion type in CT scans of COVID-19 patients

Authors: Sofie Tilborghs, Ine Dirks, Lucas Fidon, Siri Willems, Tom Eelbode, Jeroen Bertels, Bart Ilsen, Arne Brys, Adriana Dubbeldam, Nico Buls, Panagiotis Gonidakis, Sebastián Amador Sánchez, Annemiek Snoeckx, Paul M. Parizel, Johan de Mey, Dirk Vandermeulen, Tom Vercauteren, David Robben, Dirk Smeets, Frederik Maes, Jef Vandemeulebroucke, Paul Suetens

Abstract: Recent research on COVID-19 suggests that CT imaging provides useful information to assess disease progression and assist diagnosis, in addition to help understanding the disease. There is an increasing number of studies that propose to use deep learning to provide fast and accurate quantification of COVID-19 using chest CT scans. The main tasks of interest are the automatic segmentation of lung a… ▽ More Recent research on COVID-19 suggests that CT imaging provides useful information to assess disease progression and assist diagnosis, in addition to help understanding the disease. There is an increasing number of studies that propose to use deep learning to provide fast and accurate quantification of COVID-19 using chest CT scans. The main tasks of interest are the automatic segmentation of lung and lung lesions in chest CT scans of confirmed or suspected COVID-19 patients. In this study, we compare twelve deep learning algorithms using a multi-center dataset, including both open-source and in-house developed algorithms. Results show that ensembling different methods can boost the overall test set performance for lung segmentation, binary lesion segmentation and multiclass lesion segmentation, resulting in mean Dice scores of 0.982, 0.724 and 0.469, respectively. The resulting binary lesions were segmented with a mean absolute volume error of 91.3 ml. In general, the task of distinguishing different lesion types was more difficult, with a mean absolute volume difference of 152 ml and mean Dice scores of 0.369 and 0.523 for consolidation and ground glass opacity, respectively. All methods perform binary lesion segmentation with an average volume error that is better than visual assessment by human raters, suggesting these methods are mature enough for a large-scale evaluation for use in clinical practice. △ Less

Submitted 10 January, 2022; v1 submitted 29 July, 2020; originally announced July 2020.

Comments: Updated acknowledgments

Showing 1–6 of 6 results for author: Sánchez, A