-
sfislands: An R Package for Accommodating Islands and Disjoint Zones in Areal Spatial Modelling
Authors:
Kevin Horan,
Katarina Domijan,
Chris Brunsdon
Abstract:
Fitting areal models which use a spatial weights matrix to represent relationships between geographical units can be a cumbersome task, particularly when these units are not well-behaved. The two chief aims of sfislands are to simplify the process of creating an appropriate neighbourhood matrix, and to quickly visualise the predictions of subsequent models. The package uses visual aids in the form…
▽ More
Fitting areal models which use a spatial weights matrix to represent relationships between geographical units can be a cumbersome task, particularly when these units are not well-behaved. The two chief aims of sfislands are to simplify the process of creating an appropriate neighbourhood matrix, and to quickly visualise the predictions of subsequent models. The package uses visual aids in the form of easily-generated maps to help this process. This paper demonstrates how sfislands could be useful to researchers. It begins by describing the package's functions in the context of a proposed workflow. It then presents three worked examples showing a selection of potential use-cases. These range from earthquakes in Indonesia, to river crossings in London, and hierarchical models of output areas in Liverpool. We aim to show how the sfislands package streamlines much of the human workflow involved in creating and examining such models.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
What is to be gained by ensemble models in analysis of spectroscopic data?
Authors:
Katarina Domijan
Abstract:
An empirical study was carried out to compare different implementations of ensemble models aimed at improving prediction in spectroscopic data. A wide range of candidate models were fitted to benchmark datasets from regression and classification settings. A statistical analysis using linear mixed model was carried out on prediction performance criteria resulting from model fits over random splits…
▽ More
An empirical study was carried out to compare different implementations of ensemble models aimed at improving prediction in spectroscopic data. A wide range of candidate models were fitted to benchmark datasets from regression and classification settings. A statistical analysis using linear mixed model was carried out on prediction performance criteria resulting from model fits over random splits of the data. The results showed that the ensemble classifiers were able to consistently outperform candidate models in our application
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Subjective assessment of the impact of a content adaptive optimiser for compressing 4K HDR content with AV1
Authors:
Vibhoothi,
Angeliki Katsenou,
François Pitié,
Katarina Domijan,
Anil Kokaram
Abstract:
Since 2015 video dimensionality has expanded to higher spatial and temporal resolutions and a wider colour gamut. This High Dynamic Range (HDR) content has gained traction in the consumer space as it delivers an enhanced quality of experience. At the same time, the complexity of codecs is growing. This has driven the development of tools for content-adaptive optimisation that achieve optimal rate-…
▽ More
Since 2015 video dimensionality has expanded to higher spatial and temporal resolutions and a wider colour gamut. This High Dynamic Range (HDR) content has gained traction in the consumer space as it delivers an enhanced quality of experience. At the same time, the complexity of codecs is growing. This has driven the development of tools for content-adaptive optimisation that achieve optimal rate-distortion performance for HDR video at 4K resolution. While improvements of just a few percentage points in BD-Rate (1-5\%) are significant for the streaming media industry, the impact on subjective quality has been less studied especially for HDR/AV1. In this paper, we conduct a subjective quality assessment (42 subjects) of 4K HDR content with a per-clip optimisation strategy. We correlate these subjective scores with existing popular objective metrics used in standard development and show that some perceptual metrics correlate surprisingly well even though they are not tuned for HDR. We find that the DSQCS protocol is too insensitive to categorically compare the methods but the data allows us to make recommendations about the use of experts vs non-experts in HDR studies, and explain the subjective impact of film grain in HDR content under compression.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Classification of cow diet based on milk mid infrared spectra: a data analysis competition at the "International workshop of spectroscopy and chemometrics 2022"
Authors:
Maria Frizzarin,
Giulio Visentin,
Alessandro Ferragina,
Elena Hayes,
Antonio Bevilacqua,
Bhaskar Dhariyal,
Katarina Domijan,
Hussain Khan,
Georgiana Ifrim,
Thach Le Nguyen,
Joe Meagher,
Laura Menchetti,
Ashish Singh,
Suzy Whoriskey,
Robert Williamson,
Martina Zappaterra,
Alessandro Casa
Abstract:
In April 2022, the Vistamilk SFI Research Centre organized the second edition of the "International Workshop on Spectroscopy and Chemometrics - Applications in Food and Agriculture". Within this event, a data challenge was organized among participants of the workshop. Such data competition aimed at develo** a prediction model to discriminate dairy cows' diet based on milk spectral information co…
▽ More
In April 2022, the Vistamilk SFI Research Centre organized the second edition of the "International Workshop on Spectroscopy and Chemometrics - Applications in Food and Agriculture". Within this event, a data challenge was organized among participants of the workshop. Such data competition aimed at develo** a prediction model to discriminate dairy cows' diet based on milk spectral information collected in the mid-infrared region. In fact, the development of an accurate and reliable discriminant model for dairy cows' diet can provide important authentication tools for dairy processors to guarantee product origin for dairy food manufacturers from grass-fed animals. Different statistical and machine learning modelling approaches have been employed during the workshop, with different pre-processing steps involved and different degree of complexity. The present paper aims to describe the statistical methods adopted by participants to develop such classification model.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
Hierarchical Embedded Bayesian Additive Regression Trees
Authors:
Bruna Wundervald,
Andrew Parnell,
Katarina Domijan
Abstract:
We propose a simple yet powerful extension of Bayesian Additive Regression Trees which we name Hierarchical Embedded BART (HE-BART). The model allows for random effects to be included at the terminal node level of a set of regression trees, making HE-BART a non-parametric alternative to mixed effects models which avoids the need for the user to specify the structure of the random effects in the mo…
▽ More
We propose a simple yet powerful extension of Bayesian Additive Regression Trees which we name Hierarchical Embedded BART (HE-BART). The model allows for random effects to be included at the terminal node level of a set of regression trees, making HE-BART a non-parametric alternative to mixed effects models which avoids the need for the user to specify the structure of the random effects in the model, whilst maintaining the prediction and uncertainty calibration properties of standard BART. Using simulated and real-world examples, we demonstrate that this new extension yields superior predictions for many of the standard mixed effects models' example data sets, and yet still provides consistent estimates of the random effect variances. In a future version of this paper, we outline its use in larger, more advanced data sets and structures.
△ Less
Submitted 24 April, 2023; v1 submitted 14 April, 2022;
originally announced April 2022.
-
Mid infrared spectroscopy and milk quality traits: a data analysis competition at the "International Workshop on Spectroscopy and Chemometrics 2021"
Authors:
Maria Frizzarin,
Antonio Bevilacqua,
Bhaskar Dhariyal,
Katarina Domijan,
Federico Ferraccioli,
Elena Hayes,
Georgiana Ifrim,
Agnieszka Konkolewska,
Thach Le Nguyen,
Uche Mbaka,
Giovanna Ranzato,
Ashish Singh,
Marco Stefanucci,
Alessandro Casa
Abstract:
A chemometric data analysis challenge has been arranged during the first edition of the "International Workshop on Spectroscopy and Chemometrics", organized by the Vistamilk SFI Research Centre and held online in April 2021. The aim of the competition was to build a calibration model in order to predict milk quality traits exploiting the information contained in mid-infrared spectra only. Three di…
▽ More
A chemometric data analysis challenge has been arranged during the first edition of the "International Workshop on Spectroscopy and Chemometrics", organized by the Vistamilk SFI Research Centre and held online in April 2021. The aim of the competition was to build a calibration model in order to predict milk quality traits exploiting the information contained in mid-infrared spectra only. Three different traits have been provided, presenting heterogeneous degrees of prediction complexity thus possibly requiring trait-specific modelling choices. In this paper the different approaches adopted by the participants are outlined and the insights obtained from the analyses are critically discussed.
△ Less
Submitted 19 September, 2022; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Interactive slice visualization for exploring machine learning models
Authors:
Catherine B. Hurley,
Mark O'Connell,
Katarina Domijan
Abstract:
Machine learning models fit complex algorithms to arbitrarily large datasets. These algorithms are well-known to be high on performance and low on interpretability. We use interactive visualization of slices of predictor space to address the interpretability deficit; in effect opening up the black-box of machine learning algorithms, for the purpose of interrogating, explaining, validating and comp…
▽ More
Machine learning models fit complex algorithms to arbitrarily large datasets. These algorithms are well-known to be high on performance and low on interpretability. We use interactive visualization of slices of predictor space to address the interpretability deficit; in effect opening up the black-box of machine learning algorithms, for the purpose of interrogating, explaining, validating and comparing model fits. Slices are specified directly through interaction, or using various touring algorithms designed to visit high-occupancy sections or regions where the model fits have interesting properties. The methods presented here are implemented in the R package \pkg{condvis2}.
△ Less
Submitted 7 September, 2021; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Generalizing Gain Penalization for Feature Selection in Tree-based Models
Authors:
Bruna Wundervald,
Andrew Parnell,
Katarina Domijan
Abstract:
We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models.…
▽ More
We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models. The new method allows for more flexibility in the choice of feature-specific importance weights. We validate our method on both simulated and real data and implement itas an extension of the popular R package ranger.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Comparison of Machine Learning Models in Food Authentication Studies
Authors:
Manokamna Singh,
Katarina Domijan
Abstract:
The underlying objective of food authentication studies is to determine whether unknown food samples have been correctly labelled. In this paper we study three near infrared (NIR) spectroscopic datasets from food samples of different types: meat samples (labelled by species), olive oil samples (labelled by their geographic origin) and honey samples (labelled as pure or adulterated by different adu…
▽ More
The underlying objective of food authentication studies is to determine whether unknown food samples have been correctly labelled. In this paper we study three near infrared (NIR) spectroscopic datasets from food samples of different types: meat samples (labelled by species), olive oil samples (labelled by their geographic origin) and honey samples (labelled as pure or adulterated by different adulterants). We apply and compare a large number of classification, dimension reduction and variable selection approaches to these datasets. NIR data pose specific challenges to classification and variable selection: the datasets are high - dimensional where the number of cases ($n$) $<<$ number of features ($p$) and the recorded features are highly serially correlated. In this paper we carry out comparative analysis of different approaches and find that partial least squares, a classic tool employed for these types of data, outperforms all the other approaches considered.
△ Less
Submitted 17 May, 2019;
originally announced May 2019.
-
Solar flare forecasting from magnetic feature properties generated by Solar Monitor Active Region Tracker
Authors:
Katarina Domijan,
D. Shaun Bloomfield,
Francois Pitie
Abstract:
We study the predictive capabilities of magnetic feature properties (MF) generated by Solar Monitor Active Region Tracker (SMART) for solar flare forecasting from two datasets: the full dataset of SMART detections from 1996 to 2010 that has been previously studied by Ahmed et al. (2011) and a subset of that dataset which only includes detections that are NOAA active regions (ARs). Main contributio…
▽ More
We study the predictive capabilities of magnetic feature properties (MF) generated by Solar Monitor Active Region Tracker (SMART) for solar flare forecasting from two datasets: the full dataset of SMART detections from 1996 to 2010 that has been previously studied by Ahmed et al. (2011) and a subset of that dataset which only includes detections that are NOAA active regions (ARs). Main contributions: we use marginal relevance as a filter feature selection method to identify most useful SMART MF properties for separating flaring from non-flaring detections and logistic regression to derive classification rules to predict future observations. For comparison, we employ a Random Forest, Support Vector Machine and a set of Deep Neural Network models, as well as Lasso for feature selection. Using the linear model with three features we obtain significantly better results (TSS=0.84) to those reported by Ahmed et al.(2011) for the full dataset of SMART detections. The same model produced competitive results (TSS=0.67) for the dataset of SMART detections that are NOAA ARs which can be compared to a broader section of flare forecasting literature. We show that more complex models are not required for this data.
△ Less
Submitted 6 December, 2018;
originally announced December 2018.
-
Conditional Visualization for Statistical Models: An Introduction to the condvis Package in R
Authors:
Mark O'Connell,
Catherine B. Hurley,
Katarina Domijan
Abstract:
The condvis package is for interactive visualization of sections in data space, showing fitted models on the section, and observed data near the section. The primary goal is the interpretation of complex models, and showing how the observed data support the fitted model. There is a video accompaniment to this paper available at https://www.youtube.com/watch?v=rKFq7xwgdX0. This is a preprint versio…
▽ More
The condvis package is for interactive visualization of sections in data space, showing fitted models on the section, and observed data near the section. The primary goal is the interpretation of complex models, and showing how the observed data support the fitted model. There is a video accompaniment to this paper available at https://www.youtube.com/watch?v=rKFq7xwgdX0. This is a preprint version of an article to appear in the Journal of Statistical Software.
△ Less
Submitted 2 October, 2016;
originally announced October 2016.