-
Dealing with zero-inflated data: achieving SOTA with a two-fold machine learning approach
Authors:
Jože M. Rožanec,
Gašper Petelin,
João Costa,
Blaž Bertalanič,
Gregor Cerar,
Marko Guček,
Gregor Papa,
Dunja Mladenić
Abstract:
In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and ev…
▽ More
In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and even airport shuttle demand prediction. The presence of zeroes affects the models' learning and may result in poor performance. Furthermore, zeroes also distort the metrics used to compute the model's prediction quality. This paper showcases two real-world use cases (home appliances classification and airport shuttle demand prediction) where a hierarchical model applied in the context of zero-inflated data leads to excellent results. In particular, for home appliances classification, the weighted average of Precision, Recall, F1, and AUC ROC was increased by 27%, 34%, 49%, and 27%, respectively. Furthermore, it is estimated that the proposed approach is also four times more energy efficient than the SOTA approach against which it was compared to. Two-fold models performed best in all cases when predicting airport shuttle demand, and the difference against other models has been proven to be statistically significant.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem
Authors:
Rok Hribar,
Timotej Hrga,
Gregor Papa,
Gašper Petelin,
Janez Povh,
Nataša Pržulj,
Vida Vukašinović
Abstract:
In this paper, we consider the symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), which attempts to factorize several symmetric non-negative matrices simultaneously. This can be considered as a generalization of the classical non-negative matrix tri-factorization problem and includes a non-convex objective function which is a multivariate sixth degree polynomial and a has…
▽ More
In this paper, we consider the symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), which attempts to factorize several symmetric non-negative matrices simultaneously. This can be considered as a generalization of the classical non-negative matrix tri-factorization problem and includes a non-convex objective function which is a multivariate sixth degree polynomial and a has convex feasibility set. It has a special importance in data science, since it serves as a mathematical model for the fusion of different data sources in data clustering.
We develop four methods to solve the SNMTF. They are based on four theoretical approaches known from the literature: the fixed point method (FPM), the block-coordinate descent with projected gradient (BCD), the gradient method with exact line search (GM-ELS) and the adaptive moment estimation method (ADAM). For each of these methods we offer a software implementation: for the former two methods we use Matlab and for the latter Python with the TensorFlow library.
We test these methods on three data-sets: the synthetic data-set we generated, while the others represent real-life similarities between different objects.
Extensive numerical results show that with sufficient computing time all four methods perform satisfactorily and ADAM most often yields the best mean square error ($\mathrm{MSE}$). However, if the computation time is limited, FPM gives the best $\mathrm{MSE}$ because it shows the fastest convergence at the beginning.
All data-sets and codes are publicly available on our GitLab profile.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning
Authors:
Robin Vogel,
Aurélien Bellet,
Stephan Clémençon,
Ons Jelassi,
Guillaume Papa
Abstract:
The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involvin…
▽ More
The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points - such as metric learning, clustering or ranking do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartitioning steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets.
△ Less
Submitted 21 June, 2019;
originally announced June 2019.
-
Learning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk Minimizers
Authors:
Clémençon Stephan,
Patrice Bertail,
Guillaume Papa
Abstract:
The generalization ability of minimizers of the empirical risk in the context of binary classification has been investigated under a wide variety of complexity assumptions for the collection of classifiers over which optimization is performed. In contrast, the vast majority of the works dedicated to this issue stipulate that the training dataset used to compute the empirical risk functional is com…
▽ More
The generalization ability of minimizers of the empirical risk in the context of binary classification has been investigated under a wide variety of complexity assumptions for the collection of classifiers over which optimization is performed. In contrast, the vast majority of the works dedicated to this issue stipulate that the training dataset used to compute the empirical risk functional is composed of i.i.d. observations. Beyond the cases where training data are drawn uniformly without replacement among a large i.i.d. sample or modelled as a realization of a weakly dependent sequence of r.v.'s, statistical guarantees when the data used to train a classifier are drawn by means of a more general sampling/survey scheme and exhibit a complex dependence structure have not been documented yet. It is the main purpose of this paper to show that the theory of empirical risk minimization can be extended to situations where statistical learning is based on survey samples and knowledge of the related inclusion probabilities. Precisely, we prove that minimizing a weighted version of the empirical risk, refered to as the Horvitz-Thompson risk (HT risk), over a class of controlled complexity lead to a rate for the excess risk of the order $O_{\mathbb{P}}((κ_N (\log N)/n)^{1/2})$ with $κ_N=(n/N)/\min_{i\leq N}π_i$, when data are sampled by means of a rejective scheme of (deterministic) size $n$ within a statistical population of cardinality $N\geq n$, a generalization of basic {\it sampling without replacement} with unequal probability weights $π_i>0$. Extension to other sampling schemes are then established by a coupling argument. Beyond theoretical results, numerical experiments are displayed in order to show the relevance of HT risk minimization and that ignoring the sampling scheme used to generate the training dataset may completely jeopardize the learning procedure.
△ Less
Submitted 18 January, 2019; v1 submitted 11 October, 2016;
originally announced October 2016.
-
Survey schemes for stochastic gradient descent with applications to M-estimation
Authors:
Stéphan Clémençon,
Patrice Bertail,
Emilie Chautru,
Guillaume Papa
Abstract:
In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the "full data" statistics with their counterparts based on the resulting random samples, of manageab…
▽ More
In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the "full data" statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods in large-scale statistical and machine-learning problems. Precisely, we prove that, in presence of some a priori information, one may significantly increase asymptotic accuracy when choosing appropriate first order inclusion probabilities, without affecting complexity. These striking results are described here by limit theorems and are also illustrated by numerical experiments.
△ Less
Submitted 9 January, 2015;
originally announced January 2015.
-
One-bit Decentralized Detection with a Rao Test for Multisensor Fusion
Authors:
D. Ciuonzo,
G. Papa,
G. Romano,
P. Salvo Rossi,
P. K. Willett
Abstract:
In this letter we propose the Rao test as a simpler alternative to the generalized likelihood ratio test (GLRT) for multisensor fusion. We consider sensors observing an unknown deterministic parameter with symmetric and unimodal noise. A decision fusion center (DFC) receives quantized sensor observations through error-prone binary symmetric channels and makes a global decision. We analyze the opti…
▽ More
In this letter we propose the Rao test as a simpler alternative to the generalized likelihood ratio test (GLRT) for multisensor fusion. We consider sensors observing an unknown deterministic parameter with symmetric and unimodal noise. A decision fusion center (DFC) receives quantized sensor observations through error-prone binary symmetric channels and makes a global decision. We analyze the optimal quantizer thresholds and we study the performance of the Rao test in comparison to the GLRT. Also, a theoretical comparison is made and asymptotic performance is derived in a scenario with homogeneous sensors. All the results are confirmed through simulations.
△ Less
Submitted 26 June, 2013;
originally announced June 2013.