Skip to main content

Showing 1–50 of 69 results for author: Sutton, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.14662  [pdf, other

    cs.LG cs.CL cs.PL cs.SE

    NExT: Teaching Large Language Models to Reason about Code Execution

    Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

    Abstract: A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of h… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 35 pages

  2. arXiv:2403.06955  [pdf, other

    cond-mat.mtrl-sci cs.LG

    Accurate Crystal Structure Prediction of New 2D Hybrid Organic Inorganic Perovskites

    Authors: Nima Karimitari, William J. Baldwin, Evan W. Muller, Zachary J. L. Bare, W. Joshua Kennedy, Gábor Csányi, Christopher Sutton

    Abstract: Low dimensional hybrid organic-inorganic perovskites (HOIPs) represent a promising class of electronically active materials for both light absorption and emission. The design space of HOIPs is extremely large, since a diverse space of organic cations can be combined with different inorganic frameworks. This immense design space allows for tunable electronic and mechanical properties, but also nece… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 14 pages and 9 figures in the main text. Supplementary included in pdf

  3. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  4. arXiv:2312.02179  [pdf, other

    cs.LG cs.AI cs.CL

    Training Chain-of-Thought via Latent-Variable Inference

    Authors: Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

    Abstract: Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training se… ▽ More

    Submitted 28 November, 2023; originally announced December 2023.

    Comments: 23 pages, 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  5. arXiv:2311.17311  [pdf, other

    cs.CL cs.AI

    Universal Self-Consistency for Large Language Model Generation

    Authors: Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, Denny Zhou

    Abstract: Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Con… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  6. arXiv:2307.13883  [pdf, other

    cs.LG cs.PL

    ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

    Authors: Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton

    Abstract: When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more co… ▽ More

    Submitted 6 May, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: ICLR 2024

  7. arXiv:2306.12272  [pdf, other

    cond-mat.mtrl-sci cs.CE cs.LG math.CO

    From structure mining to unsupervised exploration of atomic octahedral networks

    Authors: R. Patrick Xian, Ryan J. Morelock, Ido Hadar, Charles B. Musgrave, Christopher Sutton

    Abstract: Networks of atom-centered coordination octahedra commonly occur in inorganic and hybrid solid-state materials. Characterizing their spatial arrangements and characteristics is crucial for relating structures to properties for many materials families. The traditional method using case-by-case inspection becomes prohibitive for discovering trends and similarities in large datasets. Here, we operatio… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: 56 pages

  8. arXiv:2306.06545  [pdf, other

    cs.LG stat.ML

    A Probabilistic Framework for Modular Continual Learning

    Authors: Lazar Valkov, Akash Srivastava, Swarat Chaudhuri, Charles Sutton

    Abstract: Modular approaches that use a different composition of modules for each problem are a promising direction in continual learning (CL). However, searching through the large, discrete space of module compositions is challenging, especially because evaluating a composition's performance requires a round of neural network training. We address this challenge through a modular CL framework, PICLE, that u… ▽ More

    Submitted 2 May, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

  9. arXiv:2306.02049  [pdf, other

    cs.LG cs.PL

    LambdaBeam: Neural Program Search with Higher-Order Functions and Lambdas

    Authors: Kensen Shi, Hanjun Dai, Wen-Ding Li, Kevin Ellis, Charles Sutton

    Abstract: Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambd… ▽ More

    Submitted 28 October, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

  10. arXiv:2212.09248  [pdf, other

    cs.CL cs.SE

    Natural Language to Code Generation in Interactive Data Science Notebooks

    Authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton

    Abstract: Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: 46 pages. 32 figures

  11. arXiv:2208.07461  [pdf, other

    cs.LG cs.PL cs.SE

    A Library for Representing Python Programs as Graphs for Machine Learning

    Authors: David Bieber, Kensen Shi, Petros Maniatis, Charles Sutton, Vincent Hellendoorn, Daniel Johnson, Daniel Tarlow

    Abstract: Graph representations of programs are commonly a central element of machine learning for code research. We introduce an open source Python library python_graphs that applies static analysis to construct graph representations of Python programs suitable for training machine learning models. Our library admits the construction of control-flow graphs, data-flow graphs, and composite ``program graphs'… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: 21 pages, 14 figures

  12. arXiv:2207.10342  [pdf, ps, other

    cs.CL cs.AI

    Language Model Cascades

    Authors: David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

    Abstract: Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities. These compositions are probabilistic models, and may be expressed in the language of graphical models with random variables whose values are complex data types such as strings. Cases with cont… ▽ More

    Submitted 28 July, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

    Comments: Presented as spotlight at the Beyond Bases workshop at ICML 2022 (https://beyond-bayes.github.io)

  13. arXiv:2207.08050  [pdf, other

    cs.LG stat.ML

    Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

    Authors: Simao Eduardo, Kai Xu, Alfredo Nazabal, Charles Sutton

    Abstract: Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of… ▽ More

    Submitted 16 July, 2022; originally announced July 2022.

    Comments: Submitted for review in ICLR 2022

  14. arXiv:2204.03758  [pdf, other

    cs.LG cs.PL stat.ML

    Compositional Generalization and Decomposition in Neural Program Synthesis

    Authors: Kensen Shi, Joey Hong, Manzil Zaheer, Pengcheng Yin, Charles Sutton

    Abstract: When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Published at the Deep Learning for Code (DL4C) Workshop at ICLR 2022

  15. arXiv:2204.02311  [pdf, other

    cs.CL

    PaLM: Scaling Language Modeling with Pathways

    Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin , et al. (42 additional authors not shown)

    Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran… ▽ More

    Submitted 5 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

  16. arXiv:2203.10452  [pdf, other

    cs.LG cs.PL stat.ML

    CrossBeam: Learning to Search in Bottom-Up Program Synthesis

    Authors: Kensen Shi, Hanjun Dai, Kevin Ellis, Charles Sutton

    Abstract: Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we… ▽ More

    Submitted 20 March, 2022; originally announced March 2022.

    Comments: Published at ICLR 2022

  17. arXiv:2112.00114  [pdf, other

    cs.LG cs.NE

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Authors: Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena

    Abstract: Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even… ▽ More

    Submitted 30 November, 2021; originally announced December 2021.

  18. arXiv:2108.07732  [pdf, other

    cs.PL cs.LG

    Program Synthesis with Large Language Models

    Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton

    Abstract: This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize… ▽ More

    Submitted 15 August, 2021; originally announced August 2021.

    Comments: Jacob and Augustus contributed equally

  19. arXiv:2106.15339  [pdf, other

    cs.SE cs.LG cs.PL

    SpreadsheetCoder: Formula Prediction from Semi-structured Context

    Authors: Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, Denny Zhou

    Abstract: Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First,… ▽ More

    Submitted 26 June, 2021; originally announced June 2021.

    Comments: Published in ICML 2021

  20. arXiv:2012.00377  [pdf, other

    cs.LG cs.AI

    Latent Programmer: Discrete Latent Codes for Program Synthesis

    Authors: Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer

    Abstract: In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that are specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. Discrete latent codes are appealing for this purpose,… ▽ More

    Submitted 5 August, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

    Comments: ICML 2021; 15 pages, 9 figures

  21. arXiv:2011.05363  [pdf, other

    cs.LG

    Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration

    Authors: Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton, Dale Schuurmans

    Abstract: Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions,… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    Comments: NeurIPS 2020

  22. arXiv:2010.12621  [pdf, other

    cs.LG

    Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

    Authors: David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow

    Abstract: Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural n… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: Accepted at NeurIPS 2020

  23. arXiv:2010.11887  [pdf, other

    cs.PL cs.LG stat.ML

    Conditional independence by ty**

    Authors: Maria I. Gorinova, Andrew D. Gordon, Charles Sutton, Matthijs Vákár

    Abstract: A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that cap… ▽ More

    Submitted 18 February, 2022; v1 submitted 22 October, 2020; originally announced October 2020.

    Journal ref: ACM Transactions on Programming Languages and Systems, Volume 44, Issue 1, March 2022, Article No 4, pp 1-54

  24. arXiv:2007.14381  [pdf, other

    cs.PL cs.LG stat.ML

    BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

    Authors: Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, Charles Sutton, Hanjun Dai

    Abstract: Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analyzing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up sear… ▽ More

    Submitted 30 September, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

  25. arXiv:2006.10924  [pdf, other

    stat.ML cs.LG

    Neural Program Synthesis with a Differentiable Fixer

    Authors: Matej Balog, Rishabh Singh, Petros Maniatis, Charles Sutton

    Abstract: We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distri… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

  26. arXiv:2004.13214  [pdf, ps, other

    cs.SE cs.LG

    SCELMo: Source Code Embeddings from Language Models

    Authors: Rafael - Michael Karampatsis, Charles Sutton

    Abstract: Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: 12 pages

  27. arXiv:2004.00348  [pdf, other

    cs.PL cs.LG

    OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints

    Authors: Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gordon, Charles Sutton

    Abstract: We present a new approach to the type inference problem for dynamic languages. Our goal is to combine \emph{logical} constraints, that is, deterministic information from a type system, with \emph{natural} constraints, that is, uncertain statistical information about types learnt from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines… ▽ More

    Submitted 26 March, 2021; v1 submitted 1 April, 2020; originally announced April 2020.

    Comments: 29 pages, 5 figures, 2 tables

  28. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

    Authors: Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

    Abstract: Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large… ▽ More

    Submitted 17 March, 2020; originally announced March 2020.

    Comments: 13 pages; to appear in Proceedings of ICSE 2020

  29. arXiv:2003.04227  [pdf, other

    cs.LG cs.AI

    Towards Modular Algorithm Induction

    Authors: Daniel A. Abolafia, Rishabh Singh, Manzil Zaheer, Charles Sutton

    Abstract: We present a modular neural network architecture Main that learns algorithms given a set of input-output examples. Main consists of a neural controller that interacts with a variable-length input tape and learns to compose modules together with their corresponding argument choices. Unlike previous approaches, Main uses a general domain-agnostic mechanism for selection of modules and their argument… ▽ More

    Submitted 27 February, 2020; originally announced March 2020.

    Comments: 10 pages, 4 figures, 2 tables

  30. arXiv:2002.09067  [pdf, other

    cs.LG cs.DS stat.ML

    Incremental Sampling Without Replacement for Sequence Models

    Authors: Kensen Shi, David Bieber, Charles Sutton

    Abstract: Sampling is a fundamental technique, and sampling without replacement is often desirable when duplicate samples are not beneficial. Within machine learning, sampling is useful for generating diverse outputs from a trained model. We present an elegant procedure for sampling without replacement from a broad class of randomized programs, including generative neural models that construct outputs seque… ▽ More

    Submitted 19 July, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

  31. arXiv:2002.09030  [pdf, other

    cs.PL cs.LG

    Learning to Represent Programs with Property Signatures

    Authors: Augustus Odena, Charles Sutton

    Abstract: We introduce the notion of property signatures, a representation for programs and program specifications meant for consumption by machine learning algorithms. Given a function with input type $τ_{in}$ and output type $τ_{out}$, a property is a function of type: $(τ_{in}, τ_{out}) \rightarrow \texttt{Bool}$ that (informally) describes some simple property of the function under consideration. For in… ▽ More

    Submitted 12 February, 2020; originally announced February 2020.

    Comments: ICLR 2020

  32. arXiv:1911.01205  [pdf, other

    cs.LG cs.AI cs.SE stat.ML

    Learning to Fix Build Errors with Graph2Diff Neural Networks

    Authors: Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, Edward Aftandilian

    Abstract: Professional software developers spend a significant amount of time fixing builds, but this has received little attention as a problem in automatic program repair. We present a new deep learning architecture, called Graph2Diff, for automatically localizing and fixing build errors. We represent source code, build configuration files, and compiler diagnostic messages as a graph, and then use a Graph… ▽ More

    Submitted 4 November, 2019; originally announced November 2019.

    Comments: Submitted for review on Aug 23, 2019

  33. arXiv:1907.06671  [pdf, other

    cs.LG stat.ML

    Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

    Authors: Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, Charles Sutton

    Abstract: We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE)… ▽ More

    Submitted 3 March, 2020; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: Accepted for publication at AISTATS 2020

  34. arXiv:1906.00781  [pdf, other

    cs.DB cs.IR cs.LG

    Learning Semantic Annotations for Tabular Data

    Authors: Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, Charles Sutton

    Abstract: The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), a… ▽ More

    Submitted 30 May, 2019; originally announced June 2019.

    Comments: 7 pages

    Journal ref: IJCAI 2019

  35. How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset

    Authors: Rafael-Michael Karampatsis, Charles Sutton

    Abstract: Program repair is an important but difficult software engineering problem. One way to achieve acceptable performance is to focus on classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates. However, it is very difficult to estimate the recall of repair techniques for simple bugs, as there are no datasets about how often the associated bugs occur… ▽ More

    Submitted 10 April, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: 5 pages; to appear in Proceedings of MSR 2020

  36. arXiv:1903.05734  [pdf, ps, other

    cs.SE cs.LG

    Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

    Authors: Rafael-Michael Karampatsis, Charles Sutton

    Abstract: Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

  37. Wrangling Messy CSV Files by Detecting Row and Type Patterns

    Authors: Gerrit J. J. van den Burg, Alfredo Nazabal, Charles Sutton

    Abstract: It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently,… ▽ More

    Submitted 27 November, 2018; originally announced November 2018.

    ACM Class: E.5; H.2.8

    Journal ref: Data Mining and Knowledge Discovery (July, 2019)

  38. arXiv:1811.01304  [pdf, other

    cs.CL cs.AI

    ColNet: Embedding the Semantics of Web Tables for Column Type Prediction

    Authors: Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ian Horrocks, Charles Sutton

    Abstract: Automatically annotating column types with knowledge base (KB) concepts is a critical task to gain a basic understanding of web tables. Current methods rely on either table metadata like column name or entity correspondences of cells in the KB, and may fail to deal with growing web tables with incomplete meta information. In this paper we propose a neural network based column type annotation frame… ▽ More

    Submitted 14 November, 2018; v1 submitted 3 November, 2018; originally announced November 2018.

    Comments: AAAI 2019

  39. arXiv:1811.00890  [pdf, other

    cs.PL stat.CO stat.ML

    Probabilistic Programming with Densities in SlicStan: Efficient, Flexible and Deterministic

    Authors: Maria I. Gorinova, Andrew D. Gordon, Charles Sutton

    Abstract: Stan is a probabilistic programming language that has been increasingly used for real-world scalable projects. However, to make practical inference possible, the language sacrifices some of its usability by adopting a block syntax, which lacks compositionality and flexible user-defined functions. Moreover, the semantics of the language has been mainly given in terms of intuition about implementati… ▽ More

    Submitted 2 November, 2018; originally announced November 2018.

    Journal ref: Proc. ACM Program. Lang. 3, POPL, Article 35 (January 2019)

  40. arXiv:1806.04616  [pdf, ps, other

    cs.SE cs.CL

    Deep Learning to Detect Redundant Method Comments

    Authors: Annie Louis, Santanu Kumar Dash, Earl T. Barr, Charles Sutton

    Abstract: Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from cod… ▽ More

    Submitted 12 June, 2018; originally announced June 2018.

    Comments: 12 pages

  41. arXiv:1806.00101  [pdf, other

    stat.ML cs.LG

    Generative Ratio Matching Networks

    Authors: Akash Srivastava, Kai Xu, Michael U. Gutmann, Charles Sutton

    Abstract: Deep generative models can learn to generate realistic-looking images, but many of the most effective methods are adversarial and involve a saddlepoint optimization, which requires a careful balancing of training between a generator network and a critic network. Maximum mean discrepancy networks (MMD-nets) avoid this issue by using kernel as a fixed adversary, but unfortunately, they have not on t… ▽ More

    Submitted 14 February, 2020; v1 submitted 31 May, 2018; originally announced June 2018.

    Comments: ICLR 2020; Code: https://github.com/GRAM-nets

  42. arXiv:1804.07944  [pdf, other

    cs.CL cs.LG stat.ML

    Variational Inference In Pachinko Allocation Machines

    Authors: Akash Srivastava, Charles Sutton

    Abstract: The Pachinko Allocation Machine (PAM) is a deep topic model that allows representing rich correlation structures among topics by a directed acyclic graph over topics. Because of the flexibility of the model, however, approximate inference is very difficult. Perhaps for this reason, only a small number of potential PAM architectures have been explored in the literature. In this paper we present an… ▽ More

    Submitted 21 April, 2018; originally announced April 2018.

  43. arXiv:1804.00218  [pdf, other

    cs.LG cs.PL stat.ML

    HOUDINI: Lifelong Learning as Program Synthesis

    Authors: Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, Swarat Chaudhuri

    Abstract: We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these… ▽ More

    Submitted 28 October, 2018; v1 submitted 31 March, 2018; originally announced April 2018.

  44. arXiv:1803.04042  [pdf, other

    cs.LG stat.ML

    Interpreting Deep Classifier by Visual Distillation of Dark Knowledge

    Authors: Kai Xu, Dae Hoon Park, Chang Yi, Charles Sutton

    Abstract: Interpreting black box classifiers, such as deep networks, allows an analyst to validate a classifier before it is deployed in a high-stakes setting. A natural idea is to visualize the deep network's representations, so as to "see what the network sees". In this paper, we demonstrate that standard dimension reduction methods in this setting can yield uninformative or even misleading visualizations… ▽ More

    Submitted 11 March, 2018; originally announced March 2018.

  45. arXiv:1802.03997  [pdf, ps, other

    cs.SI

    GEMSEC: Graph Embedding with Self Clustering

    Authors: Benedek Rozemberczki, Ryan Davies, Rik Sarkar, Charles Sutton

    Abstract: Modern graph embedding procedures can efficiently process graphs with millions of nodes. In this paper, we propose GEMSEC -- a graph embedding algorithm which learns a clustering of the nodes simultaneously with computing their embedding. GEMSEC is a general extension of earlier work in the domain of sequence-based graph embedding. GEMSEC places nodes in an abstract feature space where the vertex… ▽ More

    Submitted 25 July, 2019; v1 submitted 12 February, 2018; originally announced February 2018.

    Journal ref: ASONAM 2019

  46. arXiv:1710.05225  [pdf, other

    cs.DL

    Popularity of arXiv.org within Computer Science

    Authors: Charles Sutton, Linan Gong

    Abstract: It may seem surprising that, out of all areas of science, computer scientists have been slow to post electronic versions of papers on sites like arXiv.org. Instead, computer scientists have tended to place papers on our individual home pages, but this loses the benefits of aggregation, namely notification and browsing. But this is changing. More and more computer scientists are now using the arX… ▽ More

    Submitted 14 October, 2017; originally announced October 2017.

  47. arXiv:1709.06182  [pdf, ps, other

    cs.SE cs.LG cs.PL

    A Survey of Machine Learning for Big Code and Naturalness

    Authors: Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton

    Abstract: Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design… ▽ More

    Submitted 4 May, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

    Comments: Website accompanying this survey paper can be found at https://ml4code.github.io

  48. arXiv:1612.09106  [pdf, other

    stat.AP cs.LG

    Sequence-to-point learning with neural networks for nonintrusive load monitoring

    Authors: Chaoyun Zhang, Mingjun Zhong, Zongzuo Wang, Nigel Goddard, Charles Sutton

    Abstract: Energy disaggregation (a.k.a nonintrusive load monitoring, NILM), a single-channel blind source separation problem, aims to decompose the mains which records the whole house electricity consumption into appliance-wise readings. This problem is difficult because it is inherently unidentifiable. Recent approaches have shown that the identifiability problem could be reduced by introducing domain know… ▽ More

    Submitted 18 September, 2017; v1 submitted 29 December, 2016; originally announced December 2016.

    Comments: 8 pages, 3 figures

    Journal ref: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018

  49. arXiv:1611.02516  [pdf, other

    cs.SE

    Tailored Mutants Fit Bugs Better

    Authors: Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton

    Abstract: Mutation analysis measures test suite adequacy, the degree to which a test suite detects seeded faults: one test suite is better than another if it detects more mutants. Mutation analysis effectiveness rests on the assumption that mutants are coupled with real faults i.e. mutant detection is strongly correlated with real fault detection. The work that validated this also showed that a large portio… ▽ More

    Submitted 8 November, 2016; originally announced November 2016.

  50. arXiv:1611.01423  [pdf, other

    cs.LG cs.AI

    Learning Continuous Semantic Representations of Symbolic Expressions

    Authors: Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, Charles Sutton

    Abstract: Combining abstract, symbolic reasoning with continuous neural reasoning is a grand challenge of representation learning. As a step in this direction, we propose a new architecture, called neural equivalence networks, for the problem of learning continuous semantic representations of algebraic and logical expressions. These networks are trained to represent semantic equivalence, even of expressions… ▽ More

    Submitted 10 June, 2017; v1 submitted 4 November, 2016; originally announced November 2016.

    Comments: Accepted to ICML 2017