-
GLINKX: A Scalable Unified Framework For Homophilous and Heterophilous Graphs
Authors:
Marios Papachristou,
Rishab Goel,
Frank Portman,
Matthew Miller,
Rong **
Abstract:
In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency e…
▽ More
In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -- GLINKX -- that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations, (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets.
△ Less
Submitted 18 November, 2022; v1 submitted 1 November, 2022;
originally announced November 2022.
-
MiCRO: Multi-interest Candidate Retrieval Online
Authors:
Frank Portman,
Stephen Ragain,
Ahmed El-Kishky
Abstract:
Providing personalized recommendations in an environment where items exhibit ephemerality and temporal relevancy (e.g. in social media) presents a few unique challenges: (1) inductively understanding ephemeral appeal for items in a setting where new items are created frequently, (2) adapting to trends within engagement patterns where items may undergo temporal shifts in relevance, (3) accurately m…
▽ More
Providing personalized recommendations in an environment where items exhibit ephemerality and temporal relevancy (e.g. in social media) presents a few unique challenges: (1) inductively understanding ephemeral appeal for items in a setting where new items are created frequently, (2) adapting to trends within engagement patterns where items may undergo temporal shifts in relevance, (3) accurately modeling user preferences over this item space where users may express multiple interests. In this work we introduce MiCRO, a generative statistical framework that models multi-interest user preferences and temporal multi-interest item representations. Our framework is specifically formulated to adapt to both new items and temporal patterns of engagement. MiCRO demonstrates strong empirical performance on candidate retrieval experiments performed on two large scale user-item datasets: (1) an open-source temporal dataset of (User, User) follow interactions and (2) a temporal dataset of (User, Tweet) favorite interactions which we will open-source as an additional contribution to the community.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval
Authors:
Ahmed El-Kishky,
Thomas Markovich,
Kenny Leung,
Frank Portman,
Aria Haghighi,
Ying Xiao
Abstract:
Candidate retrieval is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. As the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downst…
▽ More
Candidate retrieval is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. As the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct "interests" of the user. By querying each of a user's mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user's interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset (https://huggingface.co/datasets/Twitter/TwitterFollowGraph), to spur further research in graph-mining and representation learning for recommender systems.
△ Less
Submitted 5 August, 2023; v1 submitted 12 May, 2022;
originally announced May 2022.
-
TwHIN: Embedding the Twitter Heterogeneous Information Network for Personalized Recommendation
Authors:
Ahmed El-Kishky,
Thomas Markovich,
Serim Park,
Chetan Verma,
Baek** Kim,
Ramy Eskander,
Yury Malkov,
Frank Portman,
Sofía Samaniego,
Ying Xiao,
Aria Haghighi
Abstract:
Social networks, such as Twitter, form a heterogeneous information network (HIN) where nodes represent domain entities (e.g., user, content, advertiser, etc.) and edges represent one of many entity interactions (e.g, a user re-sharing content or "following" another). Interactions from multiple relation types can encode valuable information about social network entities not fully captured by a sing…
▽ More
Social networks, such as Twitter, form a heterogeneous information network (HIN) where nodes represent domain entities (e.g., user, content, advertiser, etc.) and edges represent one of many entity interactions (e.g, a user re-sharing content or "following" another). Interactions from multiple relation types can encode valuable information about social network entities not fully captured by a single relation; for instance, a user's preference for accounts to follow may depend on both user-content engagement interactions and the other users they follow. In this work, we investigate knowledge-graph embeddings for entities in the Twitter HIN (TwHIN); we show that these pretrained representations yield significant offline and online improvement for a diverse range of downstream recommendation and classification tasks: personalized ads rankings, account follow-recommendation, offensive content detection, and search ranking. We discuss design choices and practical challenges of deploying industry-scale HIN embeddings, including compressing them to reduce end-to-end model latency and handling parameter drift across versions.
△ Less
Submitted 5 September, 2022; v1 submitted 10 February, 2022;
originally announced February 2022.
-
The 2021 RecSys Challenge Dataset: Fairness is not optional
Authors:
Luca Belli,
Alykhan Tejani,
Frank Portman,
Alexandre Lung-Yut-Fong,
Ben Chamberlain,
Yuanpu Xie,
Kristian Lum,
Jonathan Hunt,
Michael Bronstein,
Vito Walter Anelli,
Saikishore Kalloori,
Bruce Ferwerda,
Wenzhe Shi
Abstract:
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dat…
▽ More
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dataset was synced with the Twitter platform: if a user deleted their content, the same content would be promptly removed from the dataset too. In this paper, we introduce the dataset and challenge, highlighting some of the issues that arise when creating recommender systems at Twitter scale.
△ Less
Submitted 21 September, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline
Authors:
Luca Belli,
Sofia Ira Ktena,
Alykhan Tejani,
Alexandre Lung-Yut-Fong,
Frank Portman,
Xiao Zhu,
Yuanpu Xie,
Akshay Gupta,
Michael Bronstein,
Amra Delić,
Gabriele Sottocornola,
Walter Anelli,
Nazareno Andrade,
Jessie Smith,
Wenzhe Shi
Abstract:
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagemen…
▽ More
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagements on the Home Timeline has only been explored to a limited extent. At the same time, there is a lack of large-scale public social network datasets that would enable the scientific community to both benchmark and build more powerful and comprehensive models that tailor content to user interests. By releasing an original dataset of 160 million Tweets along with engagement information, Twitter aims to address exactly that. During this release, special attention is drawn on maintaining compliance with existing privacy laws. Apart from user privacy, this paper touches on the key challenges faced by researchers and professionals striving to predict user engagements. It further describes the key aspects of the RecSys 2020 Challenge that was organized by ACM RecSys in partnership with Twitter using this dataset.
△ Less
Submitted 7 October, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.