-
Approximation of bivariate densities with compositional splines
Authors:
Stanislav Škorňa,
Jitka Machalová,
Jana Burkotová,
Karel Hron,
Sonja Greven
Abstract:
Reliable estimation and approximation of probability density functions is fundamental for their further processing. However, their specific properties, i.e. scale invariance and relative scale, prevent the use of standard methods of spline approximation and have to be considered when building a suitable spline basis. Bayes Hilbert space methodology allows to account for these properties of densiti…
▽ More
Reliable estimation and approximation of probability density functions is fundamental for their further processing. However, their specific properties, i.e. scale invariance and relative scale, prevent the use of standard methods of spline approximation and have to be considered when building a suitable spline basis. Bayes Hilbert space methodology allows to account for these properties of densities and enables their conversion to a standard Lebesgue space of square integrable functions using the centered log-ratio transformation. As the transformed densities fulfill a zero integral constraint, the constraint should likewise be respected by any spline basis used. Bayes Hilbert space methodology also allows to decompose bivariate densities into their interactive and independent parts with univariate marginals. As this yields a useful framework for studying the dependence structure between random variables, a spline basis ideally should admit a corresponding decomposition. This paper proposes a new spline basis for (transformed) bivariate densities respecting the desired zero integral property. We show that there is a one-to-one correspondence of this basis to a corresponding basis in the Bayes Hilbert space of bivariate densities using tools of this methodology. Furthermore, the spline representation and the resulting decomposition into interactive and independent parts are derived. Finally, this novel spline representation is evaluated in a simulation study and applied to empirical geochemical data.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Efficient spline orthogonal basis for representation of density functions
Authors:
Jana Burkotová,
Ivana Pavlů,
Hiba Nassar,
Jitka Machalová,
Karel Hron
Abstract:
Probability density functions form a specific class of functional data objects with intrinsic properties of scale invariance and relative scale characterized by the unit integral constraint. The Bayes spaces methodology respects their specific nature, and the centred log-ratio transformation enables processing such functional data in the standard Lebesgue space of square-integrable functions. As t…
▽ More
Probability density functions form a specific class of functional data objects with intrinsic properties of scale invariance and relative scale characterized by the unit integral constraint. The Bayes spaces methodology respects their specific nature, and the centred log-ratio transformation enables processing such functional data in the standard Lebesgue space of square-integrable functions. As the data representing densities are frequently observed in their discrete form, the focus has been on their spline representation. Therefore, the crucial step in the approximation is to construct a proper spline basis reflecting their specific properties. Since the centred log-ratio transformation forms a subspace of functions with a zero integral constraint, the standard $B$-spline basis is no longer suitable. Recently, a new spline basis incorporating this zero integral property, called $Z\!B$-splines, was developed. However, this basis does not possess the orthogonal property which is beneficial from computational and application point of view. As a result of this paper, we describe an efficient method for constructing an orthogonal $Z\!B$-splines basis, called $Z\!B$-splinets. The advantages of the $Z\!B$-splinet approach are foremost a computational efficiency and locality of basis supports that is desirable for data interpretability, e.g. in the context of functional principal component analysis. The proposed approach is demonstrated on an empirical demographic dataset.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Identifying Important Pairwise Logratios in Compositional Data with Sparse Principal Component Analysis
Authors:
Viktorie Nesrstová,
Ines Wilms,
Karel Hron,
Peter Filzmoser
Abstract:
Compositional data are characterized by the fact that their elemental information is contained in simple pairwise logratios of the parts that constitute the composition. While pairwise logratios are typically easy to interpret, the number of possible pairs to consider quickly becomes (too) large even for medium-sized compositions, which might hinder interpretability in further multivariate analyse…
▽ More
Compositional data are characterized by the fact that their elemental information is contained in simple pairwise logratios of the parts that constitute the composition. While pairwise logratios are typically easy to interpret, the number of possible pairs to consider quickly becomes (too) large even for medium-sized compositions, which might hinder interpretability in further multivariate analyses. Sparse methods can therefore be useful to identify few, important pairwise logratios (respectively parts contained in them) from the total candidate set. To this end, we propose a procedure based on the construction of all possible pairwise logratios and employ sparse principal component analysis to identify important pairwise logratios. The performance of the procedure is demonstrated both with simulated and real-world data. In our empirical analyses, we propose three visual tools showing (i) the balance between sparsity and explained variability, (ii) stability of the pairwise logratios, and (iii) importance of the original compositional parts to aid practitioners with their model interpretation.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Exploratory functional data analysis of multivariate densities for the identification of agricultural soil contamination by risk elements
Authors:
Tomáš Matys Grygar,
Una Radojičić,
Ivana Pavlů,
Sonja Greven,
Johanna Genest Nešlehová,
Štěpánka Tůmová,
Karel Hron
Abstract:
Geochemical map** of risk element concentrations in soils is performed in countries around the world. It results in large datasets of high analytical quality, which can be used to identify soils that violate individual legislative limits for safe food production. However, there is a lack of advanced data mining tools that would be suitable for sensitive exploratory data analysis of big data whil…
▽ More
Geochemical map** of risk element concentrations in soils is performed in countries around the world. It results in large datasets of high analytical quality, which can be used to identify soils that violate individual legislative limits for safe food production. However, there is a lack of advanced data mining tools that would be suitable for sensitive exploratory data analysis of big data while respecting the natural variability of soil composition. To distinguish anthropogenic contamination from natural variation, the analysis of the entire data distributions for smaller sub-areas is key. In this article, we propose a new data mining method for geochemical map** data based on functional data analysis of probability density functions in the framework of Bayes spaces after post-stratification of a big dataset to smaller districts. Proposed tools allow us to analyse the entire distribution, going beyond a superficial detection of extreme concentration anomalies. We illustrate the proposed methodology on a dataset gathered according to the Czech national legislation (1990--2009). Taking into account specific properties of probability density functions and recent results for orthogonal decomposition of multivariate densities enabled us to reveal real contamination patterns that were so far only suspected in Czech agricultural soils. We process the above Czech soil composition dataset by first compartmentalising it into spatial units, in particular the districts, and by subsequently clustering these districts according to diagnostic features of their uni- and multivariate distributions at high concentration ends. Comparison between compartments is key to the reliable distinction of diffuse contamination. In this work, we used soil contamination by Cu-bearing pesticides as an example for empirical testing of the proposed data mining approach.
△ Less
Submitted 6 November, 2023; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Principal Balances of Compositional Data for Regression and Classification using Partial Least Squares
Authors:
V. Nesrstová,
I. Wilms,
J. Palarea-Albaladejo,
P. Filzmoser,
J. A. Martín-Fernández,
D. Friedecký,
K. Hron
Abstract:
High-dimensional compositional data are commonplace in the modern omics sciences amongst others. Analysis of compositional data requires a proper choice of orthonormal coordinate representation as their relative nature is not compatible with the direct use of standard statistical methods. Principal balances, a specific class of log-ratio coordinates, are well suited to this context since they are…
▽ More
High-dimensional compositional data are commonplace in the modern omics sciences amongst others. Analysis of compositional data requires a proper choice of orthonormal coordinate representation as their relative nature is not compatible with the direct use of standard statistical methods. Principal balances, a specific class of log-ratio coordinates, are well suited to this context since they are constructed in such a way that the first few coordinates capture most of the variability in the original data. Focusing on regression and classification problems in high dimensions, we propose a novel Partial Least Squares (PLS) based procedure to construct principal balances that maximize explained variability of the response variable and notably facilitates interpretability when compared to the ordinary PLS formulation. The proposed PLS principal balance approach can be understood as a generalized version of common logcontrast models, since multiple orthonormal (instead of one) logcontrasts are estimated simultaneously. We demonstrate the performance of the method using both simulated and real data sets.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Compositional Cubes: A New Concept for Multi-factorial Compositions
Authors:
Kamila Fačevicová,
Peter Filzmoser,
Karel Hron
Abstract:
Compositional data are commonly known as multivariate observations carrying relative information. Even though the case of vector or even two-factorial compositional data (compositional tables) is already well described in the literature, there is still a need for a comprehensive approach to the analysis of multi-factorial relative-valued data. Therefore, this contribution builds around the current…
▽ More
Compositional data are commonly known as multivariate observations carrying relative information. Even though the case of vector or even two-factorial compositional data (compositional tables) is already well described in the literature, there is still a need for a comprehensive approach to the analysis of multi-factorial relative-valued data. Therefore, this contribution builds around the current knowledge about compositional data a general theory of work with k-factorial compositional data. As a main finding it turns out that similar to the case of compositional tables also the multi-factorial structures can be orthogonally decomposed into an independent and several interactive parts and, moreover, a coordinate representation allowing for their separate analysis by standard analytical methods can be constructed. For the sake of simplicity, these features are explained in detail for the case of three-factorial compositions (compositional cubes), followed by an outline covering the general case. The three-dimensional structure is analysed in depth in two practical examples, dealing with systems of spatial and time dependent compositional cubes. The methodology is implemented in the R package robCompositions.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
Robust Principal Component Analysis for Compositional Tables
Authors:
Julie Rendlová,
Karel Hron,
Kamila Fačevicová,
Peter Filzmoser
Abstract:
A data table which is arranged according to two factors can often be considered as a compositional table. An example is the number of unemployed people, split according to gender and age classes. Analyzed as compositions, the relevant information would consist of ratios between different cells of such a table. This is particularly useful when analyzing several compositional tables jointly, where t…
▽ More
A data table which is arranged according to two factors can often be considered as a compositional table. An example is the number of unemployed people, split according to gender and age classes. Analyzed as compositions, the relevant information would consist of ratios between different cells of such a table. This is particularly useful when analyzing several compositional tables jointly, where the absolute numbers are in very different ranges, e.g. if unemployment data are considered from different countries. Within the framework of the logratio methodology, compositional tables can be decomposed into independent and interactive parts, and orthonormal coordinates can be assigned to these parts. However, these coordinates usually require some prior knowledge about the data, and they are not easy to handle for exploring the relationships between the given factors.
Here we propose a special choice of coordinates with a direct relation to centered logratio (clr) coefficients, which are particularly useful for an interpretation in terms of the original cells of the tables. With these coordinates, robust principal component analysis (PCA) is performed for dimension reduction, allowing to investigate the relationships between the factors. The link between orthonormal coordinates and clr coefficients enables to apply robust PCA, which would otherwise suffer from the singularity of clr coefficients.
△ Less
Submitted 11 April, 2019;
originally announced April 2019.
-
Interpretation of Compositional Regression with Application to Time Budget Analysis
Authors:
Ivo Muller,
Karel Hron,
Eva Fiserova,
Jan Smahaj,
Panajotis Cakirpaloglu,
Jana Vancakova
Abstract:
Regression with compositional response or covariates, or even regression between parts of a composition, is frequently employed in social sciences. Among other possible applications, it may help to reveal interesting features in time allocation analysis. As individual activities represent relative contributions to the total amount of time, statistical processing of raw data (frequently represented…
▽ More
Regression with compositional response or covariates, or even regression between parts of a composition, is frequently employed in social sciences. Among other possible applications, it may help to reveal interesting features in time allocation analysis. As individual activities represent relative contributions to the total amount of time, statistical processing of raw data (frequently represented directly as proportions or percentages) using standard methods may lead to biased results. Specific geometrical features of time budget variables are captured by the logratio methodology of compositional data, whose aim is to build (preferably orthonormal) coordinates to be applied with popular statistical methods. The aim of this paper is to present recent tools of regression analysis within the logratio methodology and apply them to reveal potential relationships among psychometric indicators in a real-world data set. In particular, orthogonal logratio coordinates have been introduced to enhance the interpretability of coefficients in regression models.
△ Less
Submitted 26 September, 2016;
originally announced September 2016.
-
Preprocessing of centred logratio transformed density functions using smoothing splines
Authors:
Jitka Machalova,
Karel Hron,
Gianna Serafina Monti
Abstract:
With large-scale database systems, statistical analysis of data, formed by probability distributions, become an important task in explorative data analysis. Nevertheless, due to specific properties of density functions, their proper statistical treatment still represents a challenging task in functional data analysis. Namely, the usual L2 metric does not fully accounts for the relative character o…
▽ More
With large-scale database systems, statistical analysis of data, formed by probability distributions, become an important task in explorative data analysis. Nevertheless, due to specific properties of density functions, their proper statistical treatment still represents a challenging task in functional data analysis. Namely, the usual L2 metric does not fully accounts for the relative character of information, carried by density functions; instead, their geometrical features are followed by Bayes spaces of measures. The easiest possibility of expressing density functions in L2 space is to use centred logratio transformation, nevertheless, it results in functional data with a constant integral constraint that needs to be taken into account for further analysis. While theoretical background for reasonable analysis of density functions is already provided comprehensively by Bayes spaces themselves, preprocessing issues still need to be developed. The aim of this paper is to introduce optimal smoothing splines for centred logratio transformed density functions that take all their specific features into account and provide a concise methodology for reasonable preprocessing of raw (discretized) distributional observations. Theoretical developments are illustrated with a real-world data set from official statistics.
△ Less
Submitted 28 January, 2015;
originally announced January 2015.