Search | arXiv e-print repository

arXiv:2404.19591 [pdf, other]

doi 10.1145/3650203.3663327

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Authors: Stefan Grafberger, Paul Groth, Sebastian Schelter

Abstract: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improve… ▽ More Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations. △ Less

Submitted 30 April, 2024; originally announced April 2024.

ACM Class: H.2; H.2.8; H.4; D.2.6; I.2

arXiv:2310.12809 [pdf, other]

Hierarchical Forecasting at Scale

Authors: Olivier Sprangers, Wander Wadman, Sebastian Schelter, Maarten de Rijke

Abstract: Existing hierarchical forecasting techniques scale poorly when the number of time series increases. We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model by using a sparse loss function that directly optimizes the hierarchical product and/or temporal structure. The benefit of our sparse hierarchical loss function is that it provides practitio… ▽ More Existing hierarchical forecasting techniques scale poorly when the number of time series increases. We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model by using a sparse loss function that directly optimizes the hierarchical product and/or temporal structure. The benefit of our sparse hierarchical loss function is that it provides practitioners a method of producing bottom-level forecasts that are coherent to any chosen cross-sectional or temporal hierarchy. In addition, removing the need for a post-processing step as required in traditional hierarchical forecasting techniques reduces the computational cost of the prediction phase in the forecasting pipeline. On the public M5 dataset, our sparse hierarchical loss function performs up to 10% (RMSE) better compared to the baseline loss function. We implement our sparse hierarchical loss function within an existing forecasting model at bol, a large European e-commerce platform, resulting in an improved forecasting performance of 2% at the product level. Finally, we found an increase in forecasting performance of about 5-10% when evaluating the forecasting performance across the cross-sectional hierarchies that we defined. These results demonstrate the usefulness of our sparse hierarchical loss applied to a production forecasting system at a major e-commerce platform. △ Less

Submitted 26 February, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

arXiv:2307.03027 [pdf, other]

Improving Retrieval-Augmented Large Language Models via Data Importance Learning

Authors: Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang

Abstract: Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of ret… ▽ More Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient (ε, δ)-approximation algorithm. Our experimental results illustrate that we can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements). △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2305.00857 [pdf, other]

doi 10.1145/3539618.3591745

On the Impact of Outlier Bias on User Clicks

Authors: Fatemeh Sarvi, Ali Vardasbi, Mohammad Aliannejadi, Sebastian Schelter, Maarten de Rijke

Abstract: User interaction data is an important source of supervision in counterfactual learning to rank (CLTR). Such data suffers from presentation bias. Much work in unbiased learning to rank (ULTR) focuses on position bias, i.e., items at higher ranks are more likely to be examined and clicked. Inter-item dependencies also influence examination probabilities, with outlier items in a ranking as an importa… ▽ More User interaction data is an important source of supervision in counterfactual learning to rank (CLTR). Such data suffers from presentation bias. Much work in unbiased learning to rank (ULTR) focuses on position bias, i.e., items at higher ranks are more likely to be examined and clicked. Inter-item dependencies also influence examination probabilities, with outlier items in a ranking as an important example. Outliers are defined as items that observably deviate from the rest and therefore stand out in the ranking. In this paper, we identify and introduce the bias brought about by outlier items: users tend to click more on outlier items and their close neighbors. To this end, we first conduct a controlled experiment to study the effect of outliers on user clicks. Next, to examine whether the findings from our controlled experiment generalize to naturalistic situations, we explore real-world click logs from an e-commerce platform. We show that, in both scenarios, users tend to click significantly more on outlier items than on non-outlier items in the same rankings. We show that this tendency holds for all positions, i.e., for any specific position, an item receives more interactions when presented as an outlier as opposed to a non-outlier item. We conclude from our analysis that the effect of outliers on clicks is a type of bias that should be addressed in ULTR. We therefore propose an outlier-aware click model that accounts for both outlier and position bias, called outlier-aware position-based model ( OPBM). We estimate click propensities based on OPBM ; through extensive experiments performed on both real-world e-commerce data and semi-synthetic data, we verify the effectiveness of our outlier-aware click model. Our results show the superiority of OPBM against baselines in terms of ranking performance and true relevance estimation. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: Accepted at SIGIR'23, Full Paper Track

arXiv:2204.11131 [pdf, other]

Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines

Authors: Bojan Karlaš, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, Ce Zhang

Abstract: Develo** modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to… ▽ More Develo** modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to compute the Shapley value of each training example with respect to utilities such as validation accuracy and fairness of the trained ML model. Unfortunately, despite recent intensive interest and research, existing methods only consider a single ML model "in isolation" and do not consider an end-to-end ML pipeline that consists of data transformations, feature extractors, and ML training. We present DataScope (ease.ml/datascope), the first system that efficiently computes Shapley values of training examples over an end-to-end ML pipeline, and illustrate its applications in data debugging for ML training. To this end, we first develop a novel algorithmic framework that computes Shapley value over a specific family of ML pipelines that we call canonical pipelines: a positive relational algebra query followed by a K-nearest-neighbor (KNN) classifier. We show that, for many subfamilies of canonical pipelines, computing Shapley value is in PTIME, contrasting the exponential complexity of computing Shapley value in general. We then put this to practice -- given an sklearn pipeline, we approximate it with a canonical pipeline to use as a proxy. We conduct extensive experiments illustrating different use cases and utilities. Our results show that DataScope is up to four orders of magnitude faster over state-of-the-art Monte Carlo-based methods, while being comparably, and often even more, effective in data debugging. △ Less

Submitted 26 April, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

arXiv:2201.13313 [pdf, other]

Efficiently Maintaining Next Basket Recommendations under Additions and Deletions of Baskets and Items

Authors: Benjamin Longxiang Wang, Sebastian Schelter

Abstract: Recommender systems play an important role in hel** people find information and make decisions in today's increasingly digitalized societies. However, the wide adoption of such machine learning applications also causes concerns in terms of data privacy. These concerns are addressed by the recent "General Data Protection Regulation" (GDPR) in Europe, which requires companies to delete personal us… ▽ More Recommender systems play an important role in hel** people find information and make decisions in today's increasingly digitalized societies. However, the wide adoption of such machine learning applications also causes concerns in terms of data privacy. These concerns are addressed by the recent "General Data Protection Regulation" (GDPR) in Europe, which requires companies to delete personal user data upon request when users enforce their "right to be forgotten". Many researchers argue that this deletion obligation does not only apply to the data stored in primary data stores such as relational databases but also requires an update of machine learning models whose training set included the personal data to delete. We explore this direction in the context of a sequential recommendation task called Next Basket Recommendation (NBR), where the goal is to recommend a set of items based on a user's purchase history. We design efficient algorithms for incrementally and decrementally updating a state-of-the-art next basket recommendation model in response to additions and deletions of user baskets and items. Furthermore, we discuss an efficient, data-parallel implementation of our method in the Spark Structured Streaming system. We evaluate our implementation on a variety of real-world datasets, where we investigate the impact of our update techniques on several ranking metrics and measure the time to perform model updates. Our results show that our method provides constant update time efficiency with respect to an additional user basket in the incremental case, and linear efficiency in the decremental case where we delete existing baskets. With modest computational resources, we are able to update models with a latency of around 0.2~milliseconds regardless of the history size in the incremental case, and less than one millisecond in the decremental case. △ Less

Submitted 27 January, 2022; originally announced January 2022.

Comments: ORSUM Workshop at Recommender Systems Conference 2021, Amsterdam, the Netherlands

Report number: ORSUM/2021/05

arXiv:2112.11251 [pdf, other]

doi 10.1145/3488560.3498441

Understanding and Mitigating the Effect of Outliers in Fair Ranking

Authors: Fatemeh Sarvi, Maria Heuss, Mohammad Aliannejadi, Sebastian Schelter, Maarten de Rijke

Abstract: Traditional ranking systems are expected to sort items in the order of their relevance and thereby maximize their utility. In fair ranking, utility is complemented with fairness as an optimization goal. Recent work on fair ranking focuses on develo** algorithms to optimize for fairness, given position-based exposure. In contrast, we identify the potential of outliers in a ranking to influence ex… ▽ More Traditional ranking systems are expected to sort items in the order of their relevance and thereby maximize their utility. In fair ranking, utility is complemented with fairness as an optimization goal. Recent work on fair ranking focuses on develo** algorithms to optimize for fairness, given position-based exposure. In contrast, we identify the potential of outliers in a ranking to influence exposure and thereby negatively impact fairness. An outlier in a list of items can alter the examination probabilities, which can lead to different distributions of attention, compared to position-based exposure. We formalize outlierness in a ranking, show that outliers are present in realistic datasets, and present the results of an eye-tracking study, showing that users scanning order and the exposure of items are influenced by the presence of outliers. We then introduce OMIT, a method for fair ranking in the presence of outliers. Given an outlier detection method, OMIT improves fair allocation of exposure by suppressing outliers in the top-k ranking. Using an academic search dataset, we show that outlierness optimization leads to a fairer policy that displays fewer outliers in the top-k, while maintaining a reasonable trade-off between fairness and utility. △ Less

Submitted 3 January, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: 8 pages, accepted at WSDM'22, full paper track

arXiv:2112.02905 [pdf, other]

Parameter Efficient Deep Probabilistic Forecasting

Authors: Olivier Sprangers, Sebastian Schelter, Maarten de Rijke

Abstract: Probabilistic time series forecasting is crucial in many application domains such as retail, ecommerce, finance, or biology. With the increasing availability of large volumes of data, a number of neural architectures have been proposed for this problem. In particular, Transformer-based methods achieve state-of-the-art performance on real-world benchmarks. However, these methods require a large num… ▽ More Probabilistic time series forecasting is crucial in many application domains such as retail, ecommerce, finance, or biology. With the increasing availability of large volumes of data, a number of neural architectures have been proposed for this problem. In particular, Transformer-based methods achieve state-of-the-art performance on real-world benchmarks. However, these methods require a large number of parameters to be learned, which imposes high memory requirements on the computational resources for training such models. To address this problem, we introduce a novel Bidirectional Temporal Convolutional Network (BiTCN), which requires an order of magnitude less parameters than a common Transformer-based approach. Our model combines two Temporal Convolutional Networks (TCNs): the first network encodes future covariates of the time series, whereas the second network encodes past observations and covariates. We jointly estimate the parameters of an output distribution via these two networks. Experiments on four real-world datasets show that our method performs on par with four state-of-the-art probabilistic forecasting methods, including a Transformer-based approach and WaveNet, on two point metrics (sMAPE, NRMSE) as well as on a set of range metrics (quantile loss percentiles) in the majority of cases. Secondly, we demonstrate that our method requires significantly less parameters than Transformer-based methods, which means the model can be trained faster with significantly lower memory requirements, which as a consequence reduces the infrastructure cost for deploying these models. △ Less

Submitted 14 December, 2021; v1 submitted 6 December, 2021; originally announced December 2021.

Comments: Accepted as journal paper to the International Journal of Forecasting

arXiv:2106.01682 [pdf, other]

doi 10.1145/3447548.3467278

Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression

Authors: Olivier Sprangers, Sebastian Schelter, Maarten de Rijke

Abstract: Gradient Boosting Machines (GBM) are hugely popular for solving tabular data problems. However, practitioners are not only interested in point predictions, but also in probabilistic predictions in order to quantify the uncertainty of the predictions. Creating such probabilistic predictions is difficult with existing GBM-based solutions: they either require training multiple models or they become t… ▽ More Gradient Boosting Machines (GBM) are hugely popular for solving tabular data problems. However, practitioners are not only interested in point predictions, but also in probabilistic predictions in order to quantify the uncertainty of the predictions. Creating such probabilistic predictions is difficult with existing GBM-based solutions: they either require training multiple models or they become too computationally expensive to be useful for large-scale settings. We propose Probabilistic Gradient Boosting Machines (PGBM), a method to create probabilistic predictions with a single ensemble of decision trees in a computationally efficient manner. PGBM approximates the leaf weights in a decision tree as a random variable, and approximates the mean and variance of each sample in a dataset via stochastic tree ensemble update equations. These learned moments allow us to subsequently sample from a specified distribution after training. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods: (i) PGBM enables probabilistic estimates without compromising on point performance in a single model, (ii) PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting), and thereby offers a speedup of up to several orders of magnitude over existing state-of-the-art methods on large datasets, and (iii) PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions, such as hierarchical time series problems, where we observed up to 10% improvement in point forecasting performance and up to 300% improvement in probabilistic forecasting performance. △ Less

Submitted 6 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

ACM Class: I.2

arXiv:2012.08777 [pdf, other]

Analyzing and Predicting Purchase Intent in E-commerce: Anonymous vs. Identified Customers

Authors: Mariya Hendriksen, Ernst Kuiper, Pim Nauts, Sebastian Schelter, Maarten de Rijke

Abstract: The popularity of e-commerce platforms continues to grow. Being able to understand, and predict customer behavior is essential for customizing the user experience through personalized result presentations, recommendations, and special offers. Previous work has considered a broad range of prediction models as well as features inferred from clickstream data to record session characteristics, and fea… ▽ More The popularity of e-commerce platforms continues to grow. Being able to understand, and predict customer behavior is essential for customizing the user experience through personalized result presentations, recommendations, and special offers. Previous work has considered a broad range of prediction models as well as features inferred from clickstream data to record session characteristics, and features inferred from user data to record customer characteristics. So far, most previous work in the area of purchase prediction has focused on known customers, largely ignoring anonymous sessions, i.e., sessions initiated by a non-logged-in or unrecognized customer. However, in the de-identified data from a large European e-commerce platform available to us, more than 50% of the sessions start as anonymous sessions. In this paper, we focus on purchase prediction for both anonymous and identified sessions on an e-commerce platform. We start with a descriptive analysis of purchase vs. non-purchase sessions. This analysis informs the definition of a feature-based model for purchase prediction for anonymous sessions and identified sessions; our models consider a range of session-based features for anonymous sessions, such as the channel type, the number of visited pages, and the device type. For identified user sessions, our analysis points to customer history data as a valuable discriminator between purchase and non-purchase sessions. Based on our analysis, we build two types of predictors: (1) a predictor for anonymous that beats a production-ready predictor by over 17.54% F1; and (2) a predictor for identified customers that uses session data as well as customer history and achieves an F1 of 96.20%. Finally, we discuss the broader practical implications of our findings. △ Less

Submitted 16 December, 2020; originally announced December 2020.

Comments: 10 pages, accepted at SIGIR eCommerce 2020

arXiv:2007.10296 [pdf, other]

A Comparison of Supervised Learning to Match Methods for Product Search

Authors: Fatemeh Sarvi, Nikos Voskarides, Lois Mooiman, Sebastian Schelter, Maarten de Rijke

Abstract: The vocabulary gap is a core challenge in information retrieval (IR). In e-commerce applications like product search, the vocabulary gap is reported to be a bigger challenge than in more traditional application areas in IR, such as news search or web search. As recent learning to match methods have made important advances in bridging the vocabulary gap for these traditional IR areas, we investigat… ▽ More The vocabulary gap is a core challenge in information retrieval (IR). In e-commerce applications like product search, the vocabulary gap is reported to be a bigger challenge than in more traditional application areas in IR, such as news search or web search. As recent learning to match methods have made important advances in bridging the vocabulary gap for these traditional IR areas, we investigate their potential in the context of product search. In this paper we provide insights into using recent learning to match methods for product search. We compare both effectiveness and efficiency of these methods in a product search setting and analyze their performance on two product search datasets, with 50,000 queries each. One is an open dataset made available as part of a community benchmark activity at CIKM 2016. The other is a proprietary query log obtained from a European e-commerce platform. This comparison is conducted towards a better understanding of trade-offs in choosing a preferred model for this task. We find that (1) models that have been specifically designed for short text matching, like MV-LSTM and DRMMTKS, are consistently among the top three methods in all experiments; however, taking efficiency and accuracy into account at the same time, ARC-I is the preferred model for real world use cases; and (2) the performance from a state-of-the-art BERT-based model is mediocre, which we attribute to the fact that the text BERT is pre-trained on is very different from the text we have in product search. We also provide insights into factors that can influence model behavior for different types of query, such as the length of retrieved list, and query complexity, and discuss the implications of our findings for e-commerce practitioners, with respect to choosing a well performing method. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: 10 pages, 5 figures, Accepted at SIGIR Workshop on eCommerce 2020

arXiv:1911.12587 [pdf, other]

FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions

Authors: Sebastian Schelter, Yuxuan He, Jatin Khilnani, Julia Stoyanovich

Abstract: The importance of incorporating ethics and legal compliance into machine-assisted decision-making is broadly recognized. Further, several lines of recent work have argued that critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream… ▽ More The importance of incorporating ethics and legal compliance into machine-assisted decision-making is broadly recognized. Further, several lines of recent work have argued that critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. Yet, very little has been done to date to provide system-level support to data scientists who wish to develop and deploy responsible machine learning methods. We aim to fill this gap and present FairPrep, a design and evaluation framework for fairness-enhancing interventions. FairPrep is based on a developer-centered design, and helps data scientists follow best practices in software engineering and machine learning. As part of our contribution, we identify shortcomings in existing empirical studies for analyzing fairness-enhancing interventions. We then show how FairPrep can be used to measure the impact of sound best practices, such as hyperparameter tuning and feature scaling. In particular, our results suggest that the high variability of the outcomes of fairness-enhancing interventions observed in previous studies is often an artifact of a lack of hyperparameter tuning. Further, we show that the choice of a data cleaning method can impact the effectiveness of fairness-enhancing interventions. △ Less

Submitted 28 November, 2019; originally announced November 2019.

arXiv:1707.07594 [pdf, other]

'Dark Germany': Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing

Authors: Sebastian Schelter, Jérôme Kunegis

Abstract: The political discourse in Western European countries such as Germany has recently seen a resurgence of the topic of refugees, fueled by an influx of refugees from various Middle Eastern and African countries. Even though the topic of refugees evidently plays a large role in online and offline politics of the affected countries, the fact that protests against refugees stem from the right-wight pol… ▽ More The political discourse in Western European countries such as Germany has recently seen a resurgence of the topic of refugees, fueled by an influx of refugees from various Middle Eastern and African countries. Even though the topic of refugees evidently plays a large role in online and offline politics of the affected countries, the fact that protests against refugees stem from the right-wight political spectrum has lead to corresponding media to be shared in a decentralized fashion, making an analysis of the underlying social and mediatic networks difficult. In order to contribute to the analysis of these processes, we present a quantitative study of the social media activities of a contemporary nationwide protest movement against local refugee housing in Germany, which organizes itself via dedicated Facebook pages per city. We analyse data from 136 such protest pages in 2015, containing more than 46,000 posts and more than one million interactions by more than 200,000 users. In order to learn about the patterns of communication and interaction among users of far-right social media sites and pages, we investigate the temporal characteristics of the social media activities of this protest movement, as well as the connectedness of the interactions of its participants. We find several activity metrics such as the number of posts issued, discussion volume about crime and housing costs, negative polarity in comments, and user engagement to peak in late 2015, coinciding with chancellor Angela Merkel's much criticized decision of September 2015 to temporarily admit the entry of Syrian refugees to Germany. Furthermore, our evidence suggests a low degree of direct connectedness of participants in this movement, (i.a., indicated by a lack of geographical collaboration patterns), yet we encounter a strong affiliation of the pages' user base with far-right political parties. △ Less

Submitted 24 July, 2017; originally announced July 2017.

Comments: 12 pages, Proc. Int. Conf. on Soc. Inform., 2017

arXiv:1609.00585 [pdf, other]

Doubly stochastic large scale kernel learning with the empirical kernel map

Authors: Nikolaas Steenbergen, Sebastian Schelter, Felix Bießmann

Abstract: With the rise of big data sets, the popularity of kernel methods declined and neural networks took over again. The main problem with kernel methods is that the kernel matrix grows quadratically with the number of data points. Most attempts to scale up kernel methods solve this problem by discarding data points or basis functions of some approximation of the kernel map. Here we present a simple yet… ▽ More With the rise of big data sets, the popularity of kernel methods declined and neural networks took over again. The main problem with kernel methods is that the kernel matrix grows quadratically with the number of data points. Most attempts to scale up kernel methods solve this problem by discarding data points or basis functions of some approximation of the kernel map. Here we present a simple yet effective alternative for scaling up kernel methods that takes into account the entire data set via doubly stochastic optimization of the emprical kernel map. The algorithm is straightforward to implement, in particular in parallel execution settings; it leverages the full power and versatility of classical kernel functions without the need to explicitly formulate a kernel map approximation. We provide empirical evidence that the algorithm works on large data sets. △ Less

Submitted 14 September, 2016; v1 submitted 2 September, 2016; originally announced September 2016.

arXiv:1607.07403 [pdf, other]

On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl

Authors: Sebastian Schelter, Jérôme Kunegis

Abstract: We perform a large-scale analysis of third-party trackers on the World Wide Web from more than 3.5 billion web pages of the CommonCrawl 2012 corpus. We extract a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest web tracking dataset collected so far, and exceeds related studies by more than an orde… ▽ More We perform a large-scale analysis of third-party trackers on the World Wide Web from more than 3.5 billion web pages of the CommonCrawl 2012 corpus. We extract a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of domains and web pages analyzed. We perform a large-scale study of online tracking, on three levels: (1) On a global level, we give a precise figure for the extent of tracking, give insights into the structure of the `online tracking sphere' and analyse which trackers are used by how many websites. (2) On a country-specific level, we analyse which trackers are used by websites in different countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. (3) We answer the question whether the content of websites influences the choice of trackers they use, leveraging more than 90 thousand categorized domains. In particular, we analyse whether highly privacy-critical websites make different choices of trackers than other websites. Based on the performed analyses, we confirm that trackers are widespread (as expected), and that a small number of trackers dominates the web (Google, Facebook and Twitter). In particular, the three tracking domains with the highest PageRank are all owned by Google. The only exception to this pattern are a few countries such as China and Russia. Our results suggest that this dominance is strongly associated with country-specific political factors such as freedom of the press. We also confirm that websites with highly privacy-critical content are less likely to contain trackers (60% vs 90% for other websites), even though the majority of them still do contain trackers. △ Less

Submitted 29 July, 2016; v1 submitted 25 July, 2016; originally announced July 2016.

arXiv:1411.0602 [pdf, other]

Factorbird - a Parameter Server Approach to Distributed Matrix Factorization

Authors: Sebastian Schelter, Venu Satuluri, Reza Zadeh

Abstract: We present Factorbird, a prototype of a parameter server approach for factorizing large matrices with Stochastic Gradient Descent-based algorithms. We designed Factorbird to meet the following desiderata: (a) scalability to tall and wide matrices with dozens of billions of non-zeros, (b) extensibility to different kinds of models and loss functions as long as they can be optimized using Stochastic… ▽ More We present Factorbird, a prototype of a parameter server approach for factorizing large matrices with Stochastic Gradient Descent-based algorithms. We designed Factorbird to meet the following desiderata: (a) scalability to tall and wide matrices with dozens of billions of non-zeros, (b) extensibility to different kinds of models and loss functions as long as they can be optimized using Stochastic Gradient Descent (SGD), and (c) adaptability to both batch and streaming scenarios. Factorbird uses a parameter server in order to scale to models that exceed the memory of an individual machine, and employs lock-free Hogwild!-style learning with a special partitioning scheme to drastically reduce conflicting updates. We also discuss other aspects of the design of our system such as how to efficiently grid search for hyperparameters at scale. We present experiments of Factorbird on a matrix built from a subset of Twitter's interaction graph, consisting of more than 38 billion non-zeros and about 200 million rows and columns, which is to the best of our knowledge the largest matrix on which factorization results have been reported in the literature. △ Less

Submitted 3 November, 2014; originally announced November 2014.

Comments: 10 pages. Submitted to the NIPS 2014 Workshop on Distributed Matrix Computations

Showing 1–16 of 16 results for author: Schelter, S