-
Generative artificial intelligence in ophthalmology: multimodal retinal images for the diagnosis of Alzheimer's disease with convolutional neural networks
Authors:
I. R. Slootweg,
M. Thach,
K. R. Curro-Tafili,
F. D. Verbraak,
F. H. Bouwman,
Y. A. L. Pijnenburg,
J. F. Boer,
J. H. P. de Kwisthout,
L. Bagheriye,
P. J. González
Abstract:
Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 A…
▽ More
Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 AmyloidPET negative subjects were used for classification. Denoising Diffusion Probabilistic Models (DDPMs) were trained to generate synthetic images and unimodal CNNs were pretrained on synthetic data and finetuned on real data or trained solely on real data. Multimodal classifiers were developed to combine predictions of the four unimodal CNNs with patient metadata. Class activation maps of the unimodal classifiers provided insight into the network's attention to inputs. Results. DDPMs generated diverse, realistic images without memorization. Pretraining unimodal CNNs with synthetic data improved AUPR at most from 0.350 to 0.579. Integration of metadata in multimodal CNNs improved AUPR from 0.486 to 0.634, which was the best overall best classifier. Class activation maps highlighted relevant retinal regions which correlated with AD. Conclusion. Our method for generating and leveraging synthetic data has the potential to improve AmyloidPET prediction from multimodal retinal imaging. A DDPM can generate realistic and unique multimodal synthetic retinal images. Our best performing unimodal and multimodal classifiers were not pretrained on synthetic data, however pretraining with synthetic data slightly improved classification performance for two out of the four modalities.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems
Authors:
Philippe Gonzalez,
Zheng-Hua Tan,
Jan Østergaard,
Jesper Jensen,
Tommy Sonne Alstrøm,
Tobias May
Abstract:
The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but…
▽ More
The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems do not benefit from increasing the training dataset size as much as the discriminative systems. They perform the best relative to the discriminative systems with datasets of 10 h or less, but they are outperformed by the discriminative systems with datasets of 100 h or more.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Investigating the Design Space of Diffusion Models for Speech Enhancement
Authors:
Philippe Gonzalez,
Zheng-Hua Tan,
Jan Østergaard,
Jesper Jensen,
Tommy Sonne Alstrøm,
Tobias May
Abstract:
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals…
▽ More
Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler
Authors:
Philippe Gonzalez,
Zheng-Hua Tan,
Jan Østergaard,
Jesper Jensen,
Tommy Sonne Alstrøm,
Tobias May
Abstract:
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on t…
▽ More
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement.
△ Less
Submitted 16 January, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Fuzzy Information Seeded Region Growing for Automated Lesions After Stroke Segmentation in MR Brain Images
Authors:
Mario Pascual González
Abstract:
In the realm of medical imaging, precise segmentation of stroke lesions from brain MRI images stands as a critical challenge with significant implications for patient diagnosis and treatment. Addressing this, our study introduces an innovative approach using a Fuzzy Information Seeded Region Growing (FISRG) algorithm. Designed to effectively delineate the complex and irregular boundaries of stroke…
▽ More
In the realm of medical imaging, precise segmentation of stroke lesions from brain MRI images stands as a critical challenge with significant implications for patient diagnosis and treatment. Addressing this, our study introduces an innovative approach using a Fuzzy Information Seeded Region Growing (FISRG) algorithm. Designed to effectively delineate the complex and irregular boundaries of stroke lesions, the FISRG algorithm combines fuzzy logic with Seeded Region Growing (SRG) techniques, aiming to enhance segmentation accuracy.
The research involved three experiments to optimize the FISRG algorithm's performance, each focusing on different parameters to improve the accuracy of stroke lesion segmentation. The highest Dice score achieved in these experiments was 94.2\%, indicating a high degree of similarity between the algorithm's output and the expert-validated ground truth. Notably, the best average Dice score, amounting to 88.1\%, was recorded in the third experiment, highlighting the efficacy of the algorithm in consistently segmenting stroke lesions across various slices.
Our findings reveal the FISRG algorithm's strengths in handling the heterogeneity of stroke lesions. However, challenges remain in areas of abrupt lesion topology changes and in distinguishing lesions from similar intensity brain regions. The results underscore the potential of the FISRG algorithm in contributing significantly to advancements in medical imaging analysis for stroke diagnosis and treatment.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments
Authors:
Philippe Gonzalez,
Tommy Sonne Alstrøm,
Tobias May
Abstract:
The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing…
▽ More
The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
△ Less
Submitted 8 November, 2023; v1 submitted 12 September, 2023;
originally announced September 2023.
-
On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems
Authors:
Philippe Gonzalez,
Tommy Sonne Alstrøm,
Tobias May
Abstract:
The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during trainin…
▽ More
The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
△ Less
Submitted 31 March, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Modeling and Control of Combustion Phasing in Dual-Fuel Compression Ignition Engines
Authors:
Wenbo Sui,
Jorge Pulpeiro González,
Carrie M. Hall
Abstract:
Dual fuel engines can achieve high efficiencies and low emissions but also can encounter high cylinder-to-cylinder variations on multi-cylinder engines. In order to avoid these variations, they require a more complex method for combustion phasing control such as model-based control. Since the combustion process in these engines is complex, typical models of the system are complex as well and there…
▽ More
Dual fuel engines can achieve high efficiencies and low emissions but also can encounter high cylinder-to-cylinder variations on multi-cylinder engines. In order to avoid these variations, they require a more complex method for combustion phasing control such as model-based control. Since the combustion process in these engines is complex, typical models of the system are complex as well and there is a need for simpler, computationally efficient, control-oriented models of the dual fuel combustion process. In this paper, a mean-value combustion phasing model is designed and calibrated and two control strategies are proposed. Combustion phasing is predicted using a knock integral model, burn duration model and a Wiebe function and this model is used in both an adaptive closed loop controller and an open loop controller. These two control methodologies are tested and compared in simulations. Both control strategies are able to reach steady state in 5 cycles after a transient and have steady state errors in CA50 that are less than 0.1 crank angle degree (CAD) with the adaptive control strategy and less than 1.5 CAD with the model-based feedforward control method.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.