Skip to main content

Showing 1–25 of 25 results for author: Samulowitz, H

.
  1. arXiv:2407.01619  [pdf, other

    cs.LG cs.AI cs.DB

    TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

    Authors: Aamod Khatiwada, Harsha Kokel, Ibrahim Abdelaziz, Subhajit Chaudhury, Julian Dolby, Oktie Hassanzadeh, Zhenhan Huang, Tejaswini Pedapati, Horst Samulowitz, Kavitha Srinivas

    Abstract: Enterprises have a growing need to identify relevant tables in data lakes; e.g. tables that are unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such data discovery tasks. In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose a novel pre-training sketch-based approach to enhance the effectiveness o… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2307.04217

  2. arXiv:2406.13720  [pdf, other

    cs.CL cs.LG

    On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot Problems

    Authors: Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar

    Abstract: Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using domain-adjacent models. While several fine-tuned models for various tasks are available, finding an appropriate domain-adjacent model for a… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Main paper is 8 pages, followed by limitations, references and appendix

  3. arXiv:2402.04874  [pdf, other

    cs.AI cs.LG

    Choosing a Classical Planner with Graph Neural Networks

    Authors: Jana Vatter, Ruben Mayer, Hans-Arno Jacobsen, Horst Samulowitz, Michael Katz

    Abstract: Online planner selection is the task of choosing a solver out of a predefined set for a given planning problem. As planning is computationally hard, the performance of solvers varies greatly on planning problems. Thus, the ability to predict their performance on a given problem is of great importance. While a variety of learning methods have been employed, for classical cost-optimal planning the p… ▽ More

    Submitted 25 January, 2024; originally announced February 2024.

  4. arXiv:2401.12406  [pdf, other

    cs.CL cs.AI cs.LG

    Enhancing In-context Learning via Linear Probe Calibration

    Authors: Momin Abbas, Yi Zhou, Parikshit Ram, Nathalie Baracaldo, Horst Samulowitz, Theodoros Salonidis, Tianyi Chen

    Abstract: In-context learning (ICL) is a new paradigm for natural language processing that utilizes Generative Pre-trained Transformer (GPT)-like models. This approach uses prompts that include in-context demonstrations to generate the corresponding output for a new query input. However, applying ICL in real cases does not scale with the number of samples, and lacks robustness to different prompt templates… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted at AISTATS2024

  5. arXiv:2309.11506  [pdf, other

    cs.IR cs.AI cs.CL

    Matching Table Metadata with Business Glossaries Using Large Language Models

    Authors: Elita Lobo, Oktie Hassanzadeh, Nhan Pham, Nandana Mihindukulasooriya, Dharmashankar Subramanian, Horst Samulowitz

    Abstract: Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents and, therefore, limit the application of classic retrieval and analysis solutions. As a result, there is a need for solutions that can effectively utilize the av… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: This paper is a work in progress with findings based on limited evidence. Please exercise discretion when interpreting the findings

  6. arXiv:2307.04217  [pdf, other

    cs.DB cs.AI

    LakeBench: Benchmarks for Data Discovery over Data Lakes

    Authors: Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh, Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, Horst Samulowitz

    Abstract: Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private data… ▽ More

    Submitted 9 July, 2023; originally announced July 2023.

  7. arXiv:2303.01378  [pdf, other

    cs.AI cs.DB cs.LG

    A Vision for Semantically Enriched Data Science

    Authors: Udayan Khurana, Kavitha Srinivas, Sainyam Galhotra, Horst Samulowitz

    Abstract: The recent efforts in automation of machine learning or data science has achieved success in various tasks such as hyper-parameter optimization or model selection. However, key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. Data Scientists have long leveraged common sense reasoning and domain knowledge to understand and enrich data for b… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2205.08018

  8. arXiv:2301.05131  [pdf, other

    cs.LG

    Toward Theoretical Guidance for Two Common Questions in Practical Cross-Validation based Hyperparameter Selection

    Authors: Parikshit Ram, Alexander G. Gray, Horst C. Samulowitz, Gregory Bramble

    Abstract: We show, to our knowledge, the first theoretical treatments of two common questions in cross-validation based hyperparameter selection: (1) After selecting the best hyperparameter using a held-out set, we train the final model using {\em all} of the training data -- since this may or may not improve future generalization error, should one do this? (2) During optimization such as via SGD (stochasti… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

    Comments: Extended version of the paper appearing at the SIAM International Conference on Data Mining 2023 (SDM23)

  9. arXiv:2205.08018  [pdf, other

    cs.AI

    A Survey on Semantics in Automated Data Science

    Authors: Udayan Khurana, Kavitha Srinivas, Horst Samulowitz

    Abstract: Data Scientists leverage common sense reasoning and domain knowledge to understand and enrich data for building predictive models. In recent years, we have witnessed a surge in tools and techniques for {\em automated machine learning}. While data scientists can employ various such tools to help with model building, many other aspects such as {\em feature engineering} that require semantic understa… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

  10. arXiv:2202.08338  [pdf, other

    cs.LG cs.DC

    Single-shot Hyper-parameter Optimization for Federated Learning: A General Algorithm & Analysis

    Authors: Yi Zhou, Parikshit Ram, Theodoros Salonidis, Nathalie Baracaldo, Horst Samulowitz, Heiko Ludwig

    Abstract: We address the relatively unexplored problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss SuRface Aggregation (FLoRA), a general FL-HPO solution framework that can address use cases of tabular data and any Machine Learning (ML) model including gradient boosting training algorithms and therefore further expands the scope of FL-HPO. FLoRA enables… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2112.08524

  11. arXiv:2112.08524  [pdf, ps, other

    cs.LG cs.DC

    FLoRA: Single-shot Hyper-parameter Optimization for Federated Learning

    Authors: Yi Zhou, Parikshit Ram, Theodoros Salonidis, Nathalie Baracaldo, Horst Samulowitz, Heiko Ludwig

    Abstract: We address the relatively unexplored problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss suRface Aggregation (FLoRA), the first FL-HPO solution framework that can address use cases of tabular data and gradient boosting training algorithms in addition to stochastic gradient descent/neural networks commonly addressed in the FL literature. The fr… ▽ More

    Submitted 15 December, 2021; originally announced December 2021.

  12. arXiv:2102.12347  [pdf, other

    cs.LG cs.AI

    AutoAI-TS: AutoAI for Time Series Forecasting

    Authors: Syed Yousaf Shah, Dhaval Patel, Long Vu, Xuan-Hong Dang, Bei Chen, Peter Kirchner, Horst Samulowitz, David Wood, Gregory Bramble, Wesley M. Gifford, Giridhar Ganapavarapu, Roman Vaculin, Petros Zerfos

    Abstract: A large number of time series forecasting models including traditional statistical models, machine learning models and more recently deep learning have been proposed in the literature. However, choosing the right model along with good parameter values that performs well on a given data is still challenging. Automatically providing a good set of models to users for a given dataset saves both time a… ▽ More

    Submitted 8 March, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: Accepted for publication at ACM SIGMOD 2021 Industry Track

  13. arXiv:2101.03970  [pdf, other

    cs.LG cs.HC

    How Much Automation Does a Data Scientist Want?

    Authors: Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, Lisa Amini

    Abstract: Data science and machine learning (DS/ML) are at the heart of the recent advancements of many Artificial Intelligence (AI) applications. There is an active research thread in AI, \autoai, that aims to develop systems for automating end-to-end the DS/ML Lifecycle. However, do DS and ML workers really want to automate their DS/ML workflow? To answer this question, we first synthesize a human-centere… ▽ More

    Submitted 6 January, 2021; originally announced January 2021.

  14. arXiv:2006.09635  [pdf, other

    cs.LG math.OC stat.ML

    Solving Constrained CASH Problems with ADMM

    Authors: Parikshit Ram, Sijia Liu, Deepak Vijaykeerthi, Dakuo Wang, Djallel Bouneffouf, Greg Bramble, Horst Samulowitz, Alexander G. Gray

    Abstract: The CASH problem has been widely studied in the context of automated configurations of machine learning (ML) pipelines and various solvers and toolkits are available. However, CASH solvers do not directly handle black-box constraints such as fairness, robustness or other domain-specific custom constraints. We present our recent approach [Liu, et al., 2020] that leverages the ADMM optimization fram… ▽ More

    Submitted 10 July, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: 7th ICML Workshop on Automated Machine Learning (2020)

  15. arXiv:1910.14436  [pdf, other

    cs.AI cs.LG

    How can AI Automate End-to-End Data Science?

    Authors: Charu Aggarwal, Djallel Bouneffouf, Horst Samulowitz, Beat Buesser, Thanh Hoang, Udayan Khurana, Sijia Liu, Tejaswini Pedapati, Parikshit Ram, Ambrish Rawat, Martin Wistuba, Alexander Gray

    Abstract: Data science is labor-intensive and human experts are scarce but heavily involved in every aspect of it. This makes data science time consuming and restricted to experts with the resulting quality heavily dependent on their experience and skills. To make data science more accessible and scalable, we need its democratization. Automated Data Science (AutoDS) is aimed towards that goal and is emergin… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  16. arXiv:1909.02309  [pdf, other

    cs.HC cs.AI cs.LG

    Human-AI Collaboration in Data Science: Exploring Data Scientists' Perceptions of Automated AI

    Authors: Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, Alexander Gray

    Abstract: The rapid advancement of artificial intelligence (AI) is changing our lives in many ways. One application domain is data science. New techniques in automating the creation of AI, known as AutoAI or AutoML, aim to automate the work practices of data scientists. AutoAI systems are capable of autonomously ingesting and pre-processing data, engineering new features, and creating and scoring models bas… ▽ More

    Submitted 5 September, 2019; originally announced September 2019.

  17. arXiv:1906.03979  [pdf, other

    cs.LG stat.ML

    Optimal Exploitation of Clustering and History Information in Multi-Armed Bandit

    Authors: Djallel Bouneffouf, Srinivasan Parthasarathy, Horst Samulowitz, Martin Wistub

    Abstract: We consider the stochastic multi-armed bandit problem and the contextual bandit problem with historical observations and pre-clustered arms. The historical observations can contain any number of instances for each arm, and the pre-clustering information is a fixed clustering of arms provided as part of the input. We develop a variety of algorithms which incorporate this offline information effecti… ▽ More

    Submitted 31 May, 2019; originally announced June 2019.

    Comments: IJCAI 2019, International Joint Conferences on Artificial Intelligence

  18. arXiv:1905.00424  [pdf, other

    cs.LG stat.ML

    An ADMM Based Framework for AutoML Pipeline Configuration

    Authors: Sijia Liu, Parikshit Ram, Deepak Vijaykeerthy, Djallel Bouneffouf, Gregory Bramble, Horst Samulowitz, Dakuo Wang, Andrew Conn, Alexander Gray

    Abstract: We study the AutoML problem of automatically configuring machine learning pipelines by jointly selecting algorithms and their appropriate hyper-parameters for all steps in supervised learning pipelines. This black-box (gradient-free) optimization with mixed integer & continuous variables is a challenging problem. We propose a novel AutoML scheme by leveraging the alternating direction method of mu… ▽ More

    Submitted 6 December, 2019; v1 submitted 1 May, 2019; originally announced May 2019.

    Journal ref: published at AAAI 2020

  19. arXiv:1903.00743  [pdf, other

    cs.LG cs.AI stat.ML

    Automating Predictive Modeling Process using Reinforcement Learning

    Authors: Udayan Khurana, Horst Samulowitz

    Abstract: Building a good predictive model requires an array of activities such as data imputation, feature transformations, estimator selection, hyper-parameter search and ensemble construction. Given the large, complex and heterogenous space of options, off-the-shelf optimization methods are infeasible for realistic response times. In practice, much of the predictive modeling process is conducted by exper… ▽ More

    Submitted 2 March, 2019; originally announced March 2019.

  20. arXiv:1901.06261  [pdf, other

    cs.LG cs.SE stat.ML

    NeuNetS: An Automated Synthesis Engine for Neural Network Design

    Authors: Atin Sood, Benjamin Elder, Benjamin Herta, Chao Xue, Costas Bekas, A. Cristiano I. Malossi, Debashish Saha, Florian Scheidegger, Ganesh Venkataraman, Gegi Thomas, Giovanni Mariani, Hendrik Strobelt, Horst Samulowitz, Martin Wistuba, Matteo Manica, Mihir Choudhury, Rong Yan, Roxana Istrate, Ruchir Puri, Tejaswini Pedapati

    Abstract: Application of neural networks to a vast variety of practical applications is transforming the way AI is applied in practice. Pre-trained neural network models available through APIs or capability to custom train pre-built neural network architectures with customer data has made the consumption of AI by developers much simpler and resulted in broad adoption of these complex AI models. While prebui… ▽ More

    Submitted 16 January, 2019; originally announced January 2019.

    Comments: 14 pages, 12 figures. arXiv admin note: text overlap with arXiv:1806.00250

  21. arXiv:1711.06195  [pdf, other

    stat.ML cs.LG

    Neurology-as-a-Service for the Develo** World

    Authors: Tejas Dharamsi, Payel Das, Tejaswini Pedapati, Gregory Bramble, Vinod Muthusamy, Horst Samulowitz, Kush R. Varshney, Yuvaraj Rajamanickam, John Thomas, Justin Dauwels

    Abstract: Electroencephalography (EEG) is an extensively-used and well-studied technique in the field of medical diagnostics and treatment for brain disorders, including epilepsy, migraines, and tumors. The analysis and interpretation of EEGs require physicians to have specialized training, which is not common even among most doctors in the developed world, let alone the develo** world where physician sho… ▽ More

    Submitted 21 November, 2017; v1 submitted 16 November, 2017; originally announced November 2017.

    Comments: Presented at NIPS 2017 Workshop on Machine Learning for the Develo** World

  22. arXiv:1709.07150  [pdf, other

    cs.AI cs.LG stat.ML

    Feature Engineering for Predictive Modeling using Reinforcement Learning

    Authors: Udayan Khurana, Horst Samulowitz, Deepak Turaga

    Abstract: Feature engineering is a crucial step in the process of predictive modeling. It involves the transformation of given feature space, typically using mathematical functions, with the objective of reducing the modeling error for a given target. However, there is no well-defined basis for performing effective feature engineering. It involves domain knowledge, intuition, and most of all, a lengthy proc… ▽ More

    Submitted 21 September, 2017; originally announced September 2017.

  23. arXiv:1705.08520  [pdf

    cs.AI cs.LG cs.NE

    An effective algorithm for hyperparameter optimization of neural networks

    Authors: Gonzalo Diaz, Achille Fokoue, Giacomo Nannicini, Horst Samulowitz

    Abstract: A major challenge in designing neural network (NN) systems is to determine the best structure and parameters for the network given the data for the machine learning problem at hand. Examples of parameters are the number of layers and nodes, the learning rates, and the dropout rates. Typically, these parameters are chosen based on heuristic rules and manually fine-tuned, which may be very time-cons… ▽ More

    Submitted 23 May, 2017; originally announced May 2017.

  24. arXiv:1601.00024  [pdf, other

    cs.LG stat.ML

    Selecting Near-Optimal Learners via Incremental Data Allocation

    Authors: Ashish Sabharwal, Horst Samulowitz, Gerald Tesauro

    Abstract: We study a novel machine learning (ML) problem setting of sequentially allocating small subsets of training data amongst a large set of classifiers. The goal is to select a classifier that will give near-optimal accuracy when trained on all data, while also minimizing the cost of misallocated samples. This is motivated by large modern datasets and ML toolkits with many combinations of learning alg… ▽ More

    Submitted 31 December, 2015; originally announced January 2016.

    Comments: AAAI-2016: The Thirtieth AAAI Conference on Artificial Intelligence

  25. arXiv:1203.1095  [pdf, other

    cs.AI

    Search Combinators

    Authors: Tom Schrijvers, Guido Tack, Pieter Wuille, Horst Samulowitz, Peter J. Stuckey

    Abstract: The ability to model search in a constraint solver can be an essential asset for solving combinatorial problems. However, existing infrastructure for defining search heuristics is often inadequate. Either modeling capabilities are extremely limited or users are faced with a general-purpose programming language whose features are not tailored towards writing search heuristics. As a result, major im… ▽ More

    Submitted 5 March, 2012; originally announced March 2012.