-
The 2021 RecSys Challenge Dataset: Fairness is not optional
Authors:
Luca Belli,
Alykhan Tejani,
Frank Portman,
Alexandre Lung-Yut-Fong,
Ben Chamberlain,
Yuanpu Xie,
Kristian Lum,
Jonathan Hunt,
Michael Bronstein,
Vito Walter Anelli,
Saikishore Kalloori,
Bruce Ferwerda,
Wenzhe Shi
Abstract:
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dat…
▽ More
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dataset was synced with the Twitter platform: if a user deleted their content, the same content would be promptly removed from the dataset too. In this paper, we introduce the dataset and challenge, highlighting some of the issues that arise when creating recommender systems at Twitter scale.
△ Less
Submitted 21 September, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline
Authors:
Luca Belli,
Sofia Ira Ktena,
Alykhan Tejani,
Alexandre Lung-Yut-Fong,
Frank Portman,
Xiao Zhu,
Yuanpu Xie,
Akshay Gupta,
Michael Bronstein,
Amra Delić,
Gabriele Sottocornola,
Walter Anelli,
Nazareno Andrade,
Jessie Smith,
Wenzhe Shi
Abstract:
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagemen…
▽ More
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagements on the Home Timeline has only been explored to a limited extent. At the same time, there is a lack of large-scale public social network datasets that would enable the scientific community to both benchmark and build more powerful and comprehensive models that tailor content to user interests. By releasing an original dataset of 160 million Tweets along with engagement information, Twitter aims to address exactly that. During this release, special attention is drawn on maintaining compliance with existing privacy laws. Apart from user privacy, this paper touches on the key challenges faced by researchers and professionals striving to predict user engagements. It further describes the key aspects of the RecSys 2020 Challenge that was organized by ACM RecSys in partnership with Twitter using this dataset.
△ Less
Submitted 7 October, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Homogeneity and change-point detection tests for multivariate data using rank statistics
Authors:
Alexandre Lung-Yut-Fong,
Céline Lévy-Leduc,
Olivier Cappé
Abstract:
Detecting and locating changes in highly multivariate data is a major concern in several current statistical applications. In this context, the first contribution of the paper is a novel non-parametric two-sample homogeneity test for multivariate data based on the well-known Wilcoxon rank statistic. The proposed two-sample homogeneity test statistic can be extended to deal with ordinal or censored…
▽ More
Detecting and locating changes in highly multivariate data is a major concern in several current statistical applications. In this context, the first contribution of the paper is a novel non-parametric two-sample homogeneity test for multivariate data based on the well-known Wilcoxon rank statistic. The proposed two-sample homogeneity test statistic can be extended to deal with ordinal or censored data as well as to test for the homogeneity of more than two samples. The second contribution of the paper concerns the use of the proposed test statistic to perform retrospective change-point analysis. It is first shown that the approach is computationally feasible even when looking for a large number of change-points thanks to the use of dynamic programming. Computable asymptotic $p$-values for the test are then provided in the case where a single potential change-point is to be detected. Compared to available alternatives, the proposed approach appears to be very reliable and robust. This is particularly true in situations where the data is contaminated by outliers or corrupted by noise and where the potential changes only affect subsets of the coordinates of the data.
△ Less
Submitted 9 February, 2012; v1 submitted 11 July, 2011;
originally announced July 2011.
-
Robust Retrospective Multiple Change-point Estimation for Multivariate Data
Authors:
Alexandre Lung-Yut-Fong,
Céline Lévy-Leduc,
Olivier Cappé
Abstract:
We propose a non-parametric statistical procedure for detecting multiple change-points in multidimensional signals. The method is based on a test statistic that generalizes the well-known Kruskal-Wallis procedure to the multivariate setting. The proposed approach does not require any knowledge about the distribution of the observations and is parameter-free. It is computationally efficient thanks…
▽ More
We propose a non-parametric statistical procedure for detecting multiple change-points in multidimensional signals. The method is based on a test statistic that generalizes the well-known Kruskal-Wallis procedure to the multivariate setting. The proposed approach does not require any knowledge about the distribution of the observations and is parameter-free. It is computationally efficient thanks to the use of dynamic programming and can also be applied when the number of change-points is unknown. The method is shown through simulations to be more robust than alternatives, particularly when faced with atypical distributions (e.g., with outliers), high noise levels and/or high-dimensional data.
△ Less
Submitted 10 February, 2011; v1 submitted 9 February, 2011;
originally announced February 2011.
-
Distributed detection/localization of change-points in high-dimensional network traffic data
Authors:
Alexandre Lung-Yut-Fong,
Céline Lévy-Leduc,
Olivier Cappé
Abstract:
We propose a novel approach for distributed statistical detection of change-points in high-volume network traffic. We consider more specifically the task of detecting and identifying the targets of Distributed Denial of Service (DDoS) attacks. The proposed algorithm, called DTopRank, performs distributed network anomaly detection by aggregating the partial information gathered in a set of network…
▽ More
We propose a novel approach for distributed statistical detection of change-points in high-volume network traffic. We consider more specifically the task of detecting and identifying the targets of Distributed Denial of Service (DDoS) attacks. The proposed algorithm, called DTopRank, performs distributed network anomaly detection by aggregating the partial information gathered in a set of network monitors. In order to address massive data while limiting the communication overhead within the network, the approach combines record filtering at the monitor level and a nonparametric rank test for doubly censored time series at the central decision site. The performance of the DTopRank algorithm is illustrated both on synthetic data as well as from a traffic trace provided by a major Internet service provider.
△ Less
Submitted 20 September, 2011; v1 submitted 30 September, 2009;
originally announced September 2009.