Skip to main content

Showing 1–16 of 16 results for author: Dolby, J

.
  1. arXiv:2407.01619  [pdf, other

    cs.LG cs.AI cs.DB

    TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

    Authors: Aamod Khatiwada, Harsha Kokel, Ibrahim Abdelaziz, Subhajit Chaudhury, Julian Dolby, Oktie Hassanzadeh, Zhenhan Huang, Tejaswini Pedapati, Horst Samulowitz, Kavitha Srinivas

    Abstract: Enterprises have a growing need to identify relevant tables in data lakes; e.g. tables that are unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such data discovery tasks. In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose a novel pre-training sketch-based approach to enhance the effectiveness o… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2307.04217

  2. arXiv:2406.10320  [pdf, other

    cs.SE cs.AI

    Out of style: Misadventures with LLMs and code style transfer

    Authors: Karl Munson, Chih-Kai Ting, Serenity Wade, Anish Savla, Julian Dolby, Kiran Kate, Kavitha Srinivas

    Abstract: Like text, programs have styles, and certain programming styles are more desirable than others for program readability, maintainability, and performance. Code style transfer, however, is difficult to automate except for trivial style guidelines such as limits on line length. Inspired by the success of using language models for text style transfer, we investigate if code language models can perform… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  3. arXiv:2307.04217  [pdf, other

    cs.DB cs.AI

    LakeBench: Benchmarks for Data Discovery over Data Lakes

    Authors: Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh, Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, Horst Samulowitz

    Abstract: Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private data… ▽ More

    Submitted 9 July, 2023; originally announced July 2023.

  4. arXiv:2301.05108  [pdf, other

    cs.PL cs.AI

    Serenity: Library Based Python Code Analysis for Code Completion and Automated Machine Learning

    Authors: Wenting Zhao, Ibrahim Abdelaziz, Julian Dolby, Kavitha Srinivas, Mossad Helali, Essam Mansour

    Abstract: Dynamically typed languages such as Python have become very popular. Among other strengths, Python's dynamic nature and its straightforward linking to native code have made it the de-facto language for many research areas such as Artificial Intelligence. This flexibility, however, makes static analysis very hard. While creating a sound, or a soundy, analysis for Python remains an open problem, we… ▽ More

    Submitted 4 January, 2023; originally announced January 2023.

  5. Automatically Debugging AutoML Pipelines using Maro: ML Automated Remediation Oracle (Extended Version)

    Authors: Julian Dolby, Jason Tsay, Martin Hirzel

    Abstract: Machine learning in practice often involves complex pipelines for data cleansing, feature engineering, preprocessing, and prediction. These pipelines are composed of operators, which have to be correctly connected and whose hyperparameters must be correctly configured. Unfortunately, it is quite common for certain combinations of datasets, operators, or hyperparameters to cause failures. Diagnosin… ▽ More

    Submitted 5 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

    Comments: Extended version of MAPS 2022 paper

    Journal ref: Symposium on Machine Programming (MAPS), pages 60-69, June 2022

  6. arXiv:2201.12242  [pdf, other

    cs.PL

    Large Scale Generation of Labeled Type Data for Python

    Authors: Ibrahim Abdelaziz, Julian Dolby, Kavitha Srinivas

    Abstract: Recently, dynamically typed languages, such as Python, have gained unprecedented popularity. Although these languages alleviate the need for mandatory type annotations, types still play a critical role in program understanding and preventing runtime errors. An attractive option is to infer types automatically to get static guarantees without writing types. Existing inference techniques rely mostly… ▽ More

    Submitted 6 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

  7. arXiv:2111.00083  [pdf, other

    cs.LG

    A Scalable AutoML Approach Based on Graph Neural Networks

    Authors: Mossad Helali, Essam Mansour, Ibrahim Abdelaziz, Julian Dolby, Kavitha Srinivas

    Abstract: AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines b… ▽ More

    Submitted 14 July, 2022; v1 submitted 29 October, 2021; originally announced November 2021.

    Comments: 14 pages, 9 figures. Accepted in VLDB22

  8. arXiv:2109.07452  [pdf, other

    cs.CL cs.AI

    Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better Language Models for Code Understanding

    Authors: Ibrahim Abdelaziz, Julian Dolby, Jamie McCusker, Kavitha Srinivas

    Abstract: Code understanding is an increasingly important application of Artificial Intelligence. A fundamental aspect of understanding code is understanding text about code, e.g., documentation and forum discussions. Pre-trained language models (e.g., BERT) are a popular approach for various NLP tasks, and there are now a variety of benchmarks, such as GLUE, to help improve the development of such models f… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

  9. arXiv:2105.12655  [pdf, other

    cs.SE cs.AI

    CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

    Authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

    Abstract: Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs,… ▽ More

    Submitted 29 August, 2021; v1 submitted 24 May, 2021; originally announced May 2021.

    Comments: 22 pages including references

  10. arXiv:2002.09440  [pdf, other

    cs.DB cs.AI

    A Toolkit for Generating Code Knowledge Graphs

    Authors: Ibrahim Abdelaziz, Julian Dolby, Jamie McCusker, Kavitha Srinivas

    Abstract: Knowledge graphs have been proven extremely useful in powering diverse applications in semantic search and natural language understanding. In this paper, we present GraphGen4Code, a toolkit to build code knowledge graphs that can similarly power various applications such as program search, code understanding, bug detection, and code automation. GraphGen4Code uses generic techniques to capture code… ▽ More

    Submitted 27 September, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

  11. arXiv:1809.01604  [pdf, other

    cs.LG cs.AI stat.ML

    Merging datasets through deep learning

    Authors: Kavitha Srinivas, Abraham Gale, Julian Dolby

    Abstract: Merging datasets is a key operation for data analytics. A frequent requirement for merging is joining across columns that have different surface forms for the same entity (e.g., the name of a person might be represented as "Douglas Adams" or "Adams, Douglas"). Similarly, ontology alignment can require recognizing distinct surface forms of the same entity, especially when ontologies are independent… ▽ More

    Submitted 5 September, 2018; originally announced September 2018.

  12. Ariadne: Analysis for Machine Learning Program

    Authors: Julian Dolby, Avraham Shinnar, Allison Allain, Jenna Reinen

    Abstract: Machine learning has transformed domains like vision and translation, and is now increasingly used in science, where the correctness of such code is vital. Python is popular for machine learning, in part because of its wealth of machine learning libraries, and is felt to make development faster; however, this dynamic language has less support for error detection at code creation time than tools li… ▽ More

    Submitted 10 May, 2018; originally announced May 2018.

  13. arXiv:1801.08928  [pdf, other

    cs.SE

    Automatically Extracting Web API Specifications from HTML Documentation

    Authors: **qiu Yang, Erik Wittern, Annie T. T. Ying, Julian Dolby, Lin Tan

    Abstract: Web API specifications are machine-readable descriptions of APIs. These specifications, in combination with related tooling, simplify and support the consumption of APIs. However, despite the increased distribution of web APIs, specifications are rare and their creation and maintenance heavily relies on manual efforts by third parties. In this paper, we propose an automatic approach and an associa… ▽ More

    Submitted 26 January, 2018; originally announced January 2018.

  14. arXiv:1705.06629  [pdf, other

    cs.SE

    Who you gonna call? Analyzing Web Requests in Android Applications

    Authors: Marianna Rapoport, Philippe Suter, Erik Wittern, Ondřej Lhoták, Julian Dolby

    Abstract: Relying on ubiquitous Internet connectivity, applications on mobile devices frequently perform web requests during their execution. They fetch data for users to interact with, invoke remote functionalities, or send user-generated content or meta-data. These requests collectively reveal common practices of mobile application development, like what external services are used and how, and they point… ▽ More

    Submitted 18 May, 2017; v1 submitted 18 May, 2017; originally announced May 2017.

  15. arXiv:1705.06586  [pdf, other

    cs.SE

    Opportunities in Software Engineering Research for Web API Consumption

    Authors: Erik Wittern, Annie Ying, Yunhui Zheng, Jim A. Laredo, Julian Dolby, Christopher C. Young, Aleksander A. Slominski

    Abstract: Nowadays, invoking third party code increasingly involves calling web services via their web APIs, as opposed to the more traditional scenario of downloading a library and invoking the library's API. However, there are also new challenges for developers calling these web APIs. In this paper, we highlight a broad set of these challenges and argue for resulting opportunities for software engineering… ▽ More

    Submitted 18 May, 2017; originally announced May 2017.

    Comments: Erik Wittern and Annie Ying are both first authors

  16. arXiv:1702.03906  [pdf, other

    cs.SE

    Statically Checking Web API Requests in JavaScript

    Authors: Erik Wittern, Annie T. T. Ying, Yunhui Zheng, Julian Dolby, Jim A. Laredo

    Abstract: Many JavaScript applications perform HTTP requests to web APIs, relying on the request URL, HTTP method, and request data to be constructed correctly by string operations. Traditional compile-time error checking, such as calling a non-existent method in Java, are not available for checking whether such requests comply with the requirements of a web API. In this paper, we propose an approach to sta… ▽ More

    Submitted 15 February, 2017; v1 submitted 13 February, 2017; originally announced February 2017.

    Comments: International Conference on Software Engineering, 2017