-
RecSys Challenge 2023: From data preparation to prediction, a simple, efficient, robust and scalable solution
Authors:
Maxime Manderlier,
Fabian Lecron
Abstract:
The RecSys Challenge 2023, presented by ShareChat, consists to predict if an user will install an application on his smartphone after having seen advertising impressions in ShareChat & Moj apps. This paper presents the solution of 'Team UMONS' to this challenge, giving accurate results (our best score is 6.622686) with a relatively small model that can be easily implemented in different production…
▽ More
The RecSys Challenge 2023, presented by ShareChat, consists to predict if an user will install an application on his smartphone after having seen advertising impressions in ShareChat & Moj apps. This paper presents the solution of 'Team UMONS' to this challenge, giving accurate results (our best score is 6.622686) with a relatively small model that can be easily implemented in different production configurations. Our solution scales well when increasing the dataset size and can be used with datasets containing missing values.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
Bounded Simplex-Structured Matrix Factorization: Algorithms, Identifiability and Applications
Authors:
Olivier Vu Thanh,
Nicolas Gillis,
Fabian Lecron
Abstract:
In this paper, we propose a new low-rank matrix factorization model dubbed bounded simplex-structured matrix factorization (BSSMF). Given an input matrix $X$ and a factorization rank $r$, BSSMF looks for a matrix $W$ with $r$ columns and a matrix $H$ with $r$ rows such that $X \approx WH$ where the entries in each column of $W$ are bounded, that is, they belong to given intervals, and the columns…
▽ More
In this paper, we propose a new low-rank matrix factorization model dubbed bounded simplex-structured matrix factorization (BSSMF). Given an input matrix $X$ and a factorization rank $r$, BSSMF looks for a matrix $W$ with $r$ columns and a matrix $H$ with $r$ rows such that $X \approx WH$ where the entries in each column of $W$ are bounded, that is, they belong to given intervals, and the columns of $H$ belong to the probability simplex, that is, $H$ is column stochastic. BSSMF generalizes nonnegative matrix factorization (NMF), and simplex-structured matrix factorization (SSMF). BSSMF is particularly well suited when the entries of the input matrix $X$ belong to a given interval; for example when the rows of $X$ represent images, or $X$ is a rating matrix such as in the Netflix and MovieLens datasets where the entries of $X$ belong to the interval $[1,5]$. The simplex-structured matrix $H$ not only leads to an easily understandable decomposition providing a soft clustering of the columns of $X$, but implies that the entries of each column of $WH$ belong to the same intervals as the columns of $W$. In this paper, we first propose a fast algorithm for BSSMF, even in the presence of missing data in $X$. Then we provide identifiability conditions for BSSMF, that is, we provide conditions under which BSSMF admits a unique decomposition, up to trivial ambiguities. Finally, we illustrate the effectiveness of BSSMF on two applications: extraction of features in a set of images, and the matrix completion problem for recommender systems.
△ Less
Submitted 31 March, 2023; v1 submitted 26 September, 2022;
originally announced September 2022.
-
A One-Class Classification Decision Tree Based on Kernel Density Estimation
Authors:
Sarah Itani,
Fabian Lecron,
Philippe Fortemps
Abstract:
One-class Classification (OCC) is an area of machine learning which addresses prediction based on unbalanced datasets. Basically, OCC algorithms achieve training by means of a single class sample, with potentially some additional counter-examples. The current OCC models give satisfaction in terms of performance, but there is an increasing need for the development of interpretable models. In the pr…
▽ More
One-class Classification (OCC) is an area of machine learning which addresses prediction based on unbalanced datasets. Basically, OCC algorithms achieve training by means of a single class sample, with potentially some additional counter-examples. The current OCC models give satisfaction in terms of performance, but there is an increasing need for the development of interpretable models. In the present work, we propose a one-class model which addresses concerns of both performance and interpretability. Our hybrid OCC method relies on density estimation as part of a tree-based learning algorithm, called One-Class decision Tree (OC-Tree). Within a greedy and recursive approach, our proposal rests on kernel density estimation to split a data subset on the basis of one or several intervals of interest. Thus, the OC-Tree encloses data within hyper-rectangles of interest which can be described by a set of rules. Against state-of-the-art methods such as Cluster Support Vector Data Description (ClusterSVDD), One-Class Support Vector Machine (OCSVM) and isolation Forest (iForest), the OC-Tree performs favorably on a range of benchmark datasets. Furthermore, we propose a real medical application for which the OC-Tree has demonstrated its effectiveness, through the ability to tackle interpretable diagnosis aid based on unbalanced datasets.
△ Less
Submitted 20 March, 2020; v1 submitted 14 May, 2018;
originally announced May 2018.