Skip to main content

Showing 1–16 of 16 results for author: Schelter, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.19591  [pdf, other

    cs.DB cs.LG cs.SE

    Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

    Authors: Stefan Grafberger, Paul Groth, Sebastian Schelter

    Abstract: Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improve… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    ACM Class: H.2; H.2.8; H.4; D.2.6; I.2

  2. arXiv:2310.12809  [pdf, other

    cs.LG

    Hierarchical Forecasting at Scale

    Authors: Olivier Sprangers, Wander Wadman, Sebastian Schelter, Maarten de Rijke

    Abstract: Existing hierarchical forecasting techniques scale poorly when the number of time series increases. We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model by using a sparse loss function that directly optimizes the hierarchical product and/or temporal structure. The benefit of our sparse hierarchical loss function is that it provides practitio… ▽ More

    Submitted 26 February, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

  3. arXiv:2307.03027  [pdf, other

    cs.LG cs.CL cs.IR

    Improving Retrieval-Augmented Large Language Models via Data Importance Learning

    Authors: Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao, Sebastian Schelter, Ce Zhang

    Abstract: Retrieval augmentation enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of ret… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

  4. On the Impact of Outlier Bias on User Clicks

    Authors: Fatemeh Sarvi, Ali Vardasbi, Mohammad Aliannejadi, Sebastian Schelter, Maarten de Rijke

    Abstract: User interaction data is an important source of supervision in counterfactual learning to rank (CLTR). Such data suffers from presentation bias. Much work in unbiased learning to rank (ULTR) focuses on position bias, i.e., items at higher ranks are more likely to be examined and clicked. Inter-item dependencies also influence examination probabilities, with outlier items in a ranking as an importa… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted at SIGIR'23, Full Paper Track

  5. arXiv:2204.11131  [pdf, other

    cs.LG cs.AI cs.DB

    Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines

    Authors: Bojan Karlaš, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, Ce Zhang

    Abstract: Develo** modern machine learning (ML) applications is data-centric, of which one fundamental challenge is to understand the influence of data quality to ML training -- "Which training examples are 'guilty' in making the trained ML model predictions inaccurate or unfair?" Modeling data influence for ML training has attracted intensive interest over the last decade, and one popular framework is to… ▽ More

    Submitted 26 April, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

  6. arXiv:2201.13313  [pdf, other

    cs.IR cs.AI

    Efficiently Maintaining Next Basket Recommendations under Additions and Deletions of Baskets and Items

    Authors: Benjamin Longxiang Wang, Sebastian Schelter

    Abstract: Recommender systems play an important role in hel** people find information and make decisions in today's increasingly digitalized societies. However, the wide adoption of such machine learning applications also causes concerns in terms of data privacy. These concerns are addressed by the recent "General Data Protection Regulation" (GDPR) in Europe, which requires companies to delete personal us… ▽ More

    Submitted 27 January, 2022; originally announced January 2022.

    Comments: ORSUM Workshop at Recommender Systems Conference 2021, Amsterdam, the Netherlands

    Report number: ORSUM/2021/05

  7. Understanding and Mitigating the Effect of Outliers in Fair Ranking

    Authors: Fatemeh Sarvi, Maria Heuss, Mohammad Aliannejadi, Sebastian Schelter, Maarten de Rijke

    Abstract: Traditional ranking systems are expected to sort items in the order of their relevance and thereby maximize their utility. In fair ranking, utility is complemented with fairness as an optimization goal. Recent work on fair ranking focuses on develo** algorithms to optimize for fairness, given position-based exposure. In contrast, we identify the potential of outliers in a ranking to influence ex… ▽ More

    Submitted 3 January, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: 8 pages, accepted at WSDM'22, full paper track

  8. arXiv:2112.02905  [pdf, other

    cs.LG

    Parameter Efficient Deep Probabilistic Forecasting

    Authors: Olivier Sprangers, Sebastian Schelter, Maarten de Rijke

    Abstract: Probabilistic time series forecasting is crucial in many application domains such as retail, ecommerce, finance, or biology. With the increasing availability of large volumes of data, a number of neural architectures have been proposed for this problem. In particular, Transformer-based methods achieve state-of-the-art performance on real-world benchmarks. However, these methods require a large num… ▽ More

    Submitted 14 December, 2021; v1 submitted 6 December, 2021; originally announced December 2021.

    Comments: Accepted as journal paper to the International Journal of Forecasting

  9. Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression

    Authors: Olivier Sprangers, Sebastian Schelter, Maarten de Rijke

    Abstract: Gradient Boosting Machines (GBM) are hugely popular for solving tabular data problems. However, practitioners are not only interested in point predictions, but also in probabilistic predictions in order to quantify the uncertainty of the predictions. Creating such probabilistic predictions is difficult with existing GBM-based solutions: they either require training multiple models or they become t… ▽ More

    Submitted 6 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    ACM Class: I.2

  10. arXiv:2012.08777  [pdf, other

    cs.IR

    Analyzing and Predicting Purchase Intent in E-commerce: Anonymous vs. Identified Customers

    Authors: Mariya Hendriksen, Ernst Kuiper, Pim Nauts, Sebastian Schelter, Maarten de Rijke

    Abstract: The popularity of e-commerce platforms continues to grow. Being able to understand, and predict customer behavior is essential for customizing the user experience through personalized result presentations, recommendations, and special offers. Previous work has considered a broad range of prediction models as well as features inferred from clickstream data to record session characteristics, and fea… ▽ More

    Submitted 16 December, 2020; originally announced December 2020.

    Comments: 10 pages, accepted at SIGIR eCommerce 2020

  11. arXiv:2007.10296  [pdf, other

    cs.IR

    A Comparison of Supervised Learning to Match Methods for Product Search

    Authors: Fatemeh Sarvi, Nikos Voskarides, Lois Mooiman, Sebastian Schelter, Maarten de Rijke

    Abstract: The vocabulary gap is a core challenge in information retrieval (IR). In e-commerce applications like product search, the vocabulary gap is reported to be a bigger challenge than in more traditional application areas in IR, such as news search or web search. As recent learning to match methods have made important advances in bridging the vocabulary gap for these traditional IR areas, we investigat… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

    Comments: 10 pages, 5 figures, Accepted at SIGIR Workshop on eCommerce 2020

  12. arXiv:1911.12587  [pdf, other

    cs.LG cs.CY cs.DB stat.ML

    FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions

    Authors: Sebastian Schelter, Yuxuan He, Jatin Khilnani, Julia Stoyanovich

    Abstract: The importance of incorporating ethics and legal compliance into machine-assisted decision-making is broadly recognized. Further, several lines of recent work have argued that critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  13. arXiv:1707.07594  [pdf, other

    cs.SI

    'Dark Germany': Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing

    Authors: Sebastian Schelter, Jérôme Kunegis

    Abstract: The political discourse in Western European countries such as Germany has recently seen a resurgence of the topic of refugees, fueled by an influx of refugees from various Middle Eastern and African countries. Even though the topic of refugees evidently plays a large role in online and offline politics of the affected countries, the fact that protests against refugees stem from the right-wight pol… ▽ More

    Submitted 24 July, 2017; originally announced July 2017.

    Comments: 12 pages, Proc. Int. Conf. on Soc. Inform., 2017

  14. arXiv:1609.00585  [pdf, other

    cs.LG

    Doubly stochastic large scale kernel learning with the empirical kernel map

    Authors: Nikolaas Steenbergen, Sebastian Schelter, Felix Bießmann

    Abstract: With the rise of big data sets, the popularity of kernel methods declined and neural networks took over again. The main problem with kernel methods is that the kernel matrix grows quadratically with the number of data points. Most attempts to scale up kernel methods solve this problem by discarding data points or basis functions of some approximation of the kernel map. Here we present a simple yet… ▽ More

    Submitted 14 September, 2016; v1 submitted 2 September, 2016; originally announced September 2016.

  15. arXiv:1607.07403  [pdf, other

    cs.SI

    On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl

    Authors: Sebastian Schelter, Jérôme Kunegis

    Abstract: We perform a large-scale analysis of third-party trackers on the World Wide Web from more than 3.5 billion web pages of the CommonCrawl 2012 corpus. We extract a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest web tracking dataset collected so far, and exceeds related studies by more than an orde… ▽ More

    Submitted 29 July, 2016; v1 submitted 25 July, 2016; originally announced July 2016.

  16. arXiv:1411.0602  [pdf, other

    cs.LG

    Factorbird - a Parameter Server Approach to Distributed Matrix Factorization

    Authors: Sebastian Schelter, Venu Satuluri, Reza Zadeh

    Abstract: We present Factorbird, a prototype of a parameter server approach for factorizing large matrices with Stochastic Gradient Descent-based algorithms. We designed Factorbird to meet the following desiderata: (a) scalability to tall and wide matrices with dozens of billions of non-zeros, (b) extensibility to different kinds of models and loss functions as long as they can be optimized using Stochastic… ▽ More

    Submitted 3 November, 2014; originally announced November 2014.

    Comments: 10 pages. Submitted to the NIPS 2014 Workshop on Distributed Matrix Computations