-
In-Context Learning with Long-Context Models: An In-Depth Exploration
Authors:
Amanda Bertsch,
Maor Ivgi,
Uri Alon,
Jonathan Berant,
Matthew R. Gormley,
Graham Neubig
Abstract:
As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations.…
▽ More
As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations. We contrast this with example retrieval and finetuning: example retrieval shows excellent performance at low context lengths but has diminished gains with more demonstrations; finetuning is more data hungry than ICL but can sometimes exceed long-context ICL performance with additional data. We use this ICL setting as a testbed to study several properties of both in-context learning and long-context models. We show that long-context ICL is less sensitive to random input shuffling than short-context ICL, that grou** of same-label examples can negatively impact performance, and that the performance boosts we see do not arise from cumulative gain from encoding many examples together. We conclude that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples rather than task learning.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
Transformers Can Achieve Length Generalization But Not Robustly
Authors:
Yongchao Zhou,
Uri Alon,
Xinyun Chen,
Xuezhi Wang,
Rishabh Agarwal,
Denny Zhou
Abstract:
Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the succe…
▽ More
Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
In-Context Principle Learning from Mistakes
Authors:
Tianjun Zhang,
Aman Madaan,
Luyu Gao,
Steven Zheng,
Swaroop Mishra,
Yiming Yang,
Niket Tandon,
Uri Alon
Abstract:
In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): Firs…
▽ More
In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.
△ Less
Submitted 9 February, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Universal Self-Consistency for Large Language Model Generation
Authors:
Xinyun Chen,
Renat Aksitov,
Uri Alon,
Jie Ren,
Kefan Xiao,
Pengcheng Yin,
Sushant Prakash,
Charles Sutton,
Xuezhi Wang,
Denny Zhou
Abstract:
Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Con…
▽ More
Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
CAT-LM: Training Language Models on Aligned Code And Tests
Authors:
Nikitha Rao,
Kush Jain,
Uri Alon,
Claire Le Goues,
Vincent J. Hellendoorn
Abstract:
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but…
▽ More
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code-under-test context when producing a test file. In this work, we propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the map** between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger language models trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Contextual Predictive Mutation Testing
Authors:
Kush Jain,
Uri Alon,
Alex Groce,
Claire Le Goues
Abstract:
Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect…
▽ More
Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
Authors:
Shuyan Zhou,
Frank F. Xu,
Hao Zhu,
Xuhui Zhou,
Robert Lo,
Abishek Sridhar,
Xianyi Cheng,
Tianyue Ou,
Yonatan Bisk,
Daniel Fried,
Uri Alon,
Graham Neubig
Abstract:
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, w…
▽ More
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.
△ Less
Submitted 16 April, 2024; v1 submitted 25 July, 2023;
originally announced July 2023.
-
GPT-Calls: Enhancing Call Segmentation and Tagging by Generating Synthetic Conversations via Large Language Models
Authors:
Itzik Malkiel,
Uri Alon,
Yakir Yehuda,
Shahar Keren,
Oren Barkan,
Royi Ronen,
Noam Koenigstein
Abstract:
Transcriptions of phone calls are of significant value across diverse fields, such as sales, customer service, healthcare, and law enforcement. Nevertheless, the analysis of these recorded conversations can be an arduous and time-intensive process, especially when dealing with extended or multifaceted dialogues. In this work, we propose a novel method, GPT-distilled Calls Segmentation and Tagging…
▽ More
Transcriptions of phone calls are of significant value across diverse fields, such as sales, customer service, healthcare, and law enforcement. Nevertheless, the analysis of these recorded conversations can be an arduous and time-intensive process, especially when dealing with extended or multifaceted dialogues. In this work, we propose a novel method, GPT-distilled Calls Segmentation and Tagging (GPT-Calls), for efficient and accurate call segmentation and topic extraction. GPT-Calls is composed of offline and online phases. The offline phase is applied once to a given list of topics and involves generating a distribution of synthetic sentences for each topic using a GPT model and extracting anchor vectors. The online phase is applied to every call separately and scores the similarity between the transcripted conversation and the topic anchors found in the offline phase. Then, time domain analysis is applied to the similarity scores to group utterances into segments and tag them with topics. The proposed paradigm provides an accurate and efficient method for call segmentation and topic extraction that does not require labeled data, thus making it a versatile approach applicable to various domains. Our algorithm operates in production under Dynamics 365 Sales Conversation Intelligence, and our research is based on real sales conversations gathered from various Dynamics 365 Sales tenants.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
On the Expressivity Role of LayerNorm in Transformers' Attention
Authors:
Shaked Brody,
Uri Alon,
Eran Yahav
Abstract:
Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geome…
▽ More
Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a $d-1$ space that is orthogonal to the $\left[1,1,...,1\right]$ vector, and (b) scaling of all vectors to the same norm of $\sqrt{d}$. We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .
△ Less
Submitted 11 May, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Unlimiformer: Long-Range Transformers with Unlimited Length Input
Authors:
Amanda Bertsch,
Uri Alon,
Graham Neubig,
Matthew R. Gormley
Abstract:
Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances ar…
▽ More
Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .
△ Less
Submitted 30 October, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Self-Refine: Iterative Refinement with Self-Feedback
Authors:
Aman Madaan,
Niket Tandon,
Prakhar Gupta,
Skyler Hallinan,
Luyu Gao,
Sarah Wiegreffe,
Uri Alon,
Nouha Dziri,
Shrimai Prabhumoye,
Yiming Yang,
Shashank Gupta,
Bodhisattwa Prasad Majumder,
Katherine Hermann,
Sean Welleck,
Amir Yazdanbakhsh,
Peter Clark
Abstract:
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it…
▽ More
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
△ Less
Submitted 25 May, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Learning Performance-Improving Code Edits
Authors:
Alexander Shypula,
Aman Madaan,
Yimeng Zeng,
Uri Alon,
Jacob Gardner,
Milad Hashemi,
Graham Neubig,
Parthasarathy Ranganathan,
Osbert Bastani,
Amir Yazdanbakhsh
Abstract:
With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To t…
▽ More
With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements." To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves a mean speedup of 6.86 with eight generations, higher than average optimizations from individual programmers (3.66). Using our model's fastest generations, we set a new upper limit on the fastest speedup possible for our dataset at 9.64 compared to using the fastest human submissions available (9.56).
△ Less
Submitted 26 April, 2024; v1 submitted 15 February, 2023;
originally announced February 2023.
-
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Authors:
Shuyan Zhou,
Uri Alon,
Sumit Agarwal,
Graham Neubig
Abstract:
Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of encoding only the generated…
▽ More
Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the natural language input preceding the generated code, thus modeling the consistency between the generated code and its given natural language context as well. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. We release five language-specific pretrained models to use with our publicly available code. Our language-specific models have been downloaded more than 1,000,000 times from the Huggingface Hub. Our code and data are available at https://github.com/neulab/code-bert-score
△ Less
Submitted 31 October, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
Why do Nearest Neighbor Language Models Work?
Authors:
Frank F. Xu,
Uri Alon,
Graham Neubig
Abstract:
Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by access…
▽ More
Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.
△ Less
Submitted 17 January, 2023; v1 submitted 7 January, 2023;
originally announced January 2023.
-
PAL: Program-aided Language Models
Authors:
Luyu Gao,
Aman Madaan,
Shuyan Zhou,
Uri Alon,
Pengfei Liu,
Yiming Yang,
Jamie Callan,
Graham Neubig
Abstract:
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solv…
▽ More
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .
△ Less
Submitted 27 January, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Language Models of Code are Few-Shot Commonsense Learners
Authors:
Aman Madaan,
Shuyan Zhou,
Uri Alon,
Yiming Yang,
Graham Neubig
Abstract:
We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corp…
▽ More
We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.
△ Less
Submitted 6 December, 2022; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Oversquashing in GNNs through the lens of information contraction and graph expansion
Authors:
Pradeep Kr. Banerjee,
Kedar Karhadkar,
Yu Guang Wang,
Uri Alon,
Guido Montúfar
Abstract:
The quality of signal propagation in message-passing graph neural networks (GNNs) strongly influences their expressivity as has been observed in recent works. In particular, for prediction tasks relying on long-range interactions, recursive aggregation of node features can lead to an undesired phenomenon called "oversquashing". We present a framework for analyzing oversquashing based on informatio…
▽ More
The quality of signal propagation in message-passing graph neural networks (GNNs) strongly influences their expressivity as has been observed in recent works. In particular, for prediction tasks relying on long-range interactions, recursive aggregation of node features can lead to an undesired phenomenon called "oversquashing". We present a framework for analyzing oversquashing based on information contraction. Our analysis is guided by a model of reliable computation due to von Neumann that lends a new insight into oversquashing as signal quenching in noisy computation graphs. Building on this, we propose a graph rewiring algorithm aimed at alleviating oversquashing. Our algorithm employs a random local edge flip primitive motivated by an expander graph construction. We compare the spectral expansion properties of our algorithm with that of an existing curvature-based non-local rewiring strategy. Synthetic experiments show that while our algorithm in general has a slower rate of expansion, it is overall computationally cheaper, preserves the node degrees exactly and never disconnects the graph.
△ Less
Submitted 6 August, 2022;
originally announced August 2022.
-
DocPrompting: Generating Code by Retrieving the Docs
Authors:
Shuyan Zhou,
Uri Alon,
Frank F. Xu,
Zhiruo Wang,
Zhengbao Jiang,
Graham Neubig
Abstract:
Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human pr…
▽ More
Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce DocPrompting: a natural-language-to-code generation approach that explicitly leverages documentation by (1) retrieving the relevant documentation pieces given an NL intent, and (2) generating code based on the NL intent and the retrieved documentation. DocPrompting is general: it can be applied to any programming language and is agnostic to the underlying neural model. We demonstrate that DocPrompting consistently improves NL-to-code models: DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to absolute 6.9% exact match.
△ Less
Submitted 18 February, 2023; v1 submitted 13 July, 2022;
originally announced July 2022.
-
A Systematic Evaluation of Large Language Models of Code
Authors:
Frank F. Xu,
Uri Alon,
Graham Neubig,
Vincent J. Hellendoorn
Abstract:
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation…
▽ More
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.
△ Less
Submitted 4 May, 2022; v1 submitted 26 February, 2022;
originally announced February 2022.
-
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
Authors:
Uri Alon,
Frank F. Xu,
Junxian He,
Sudipta Sengupta,
Dan Roth,
Graham Neubig
Abstract:
Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present Reto…
▽ More
Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .
△ Less
Submitted 9 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
How Attentive are Graph Attention Networks?
Authors:
Shaked Brody,
Uri Alon,
Eran Yahav
Abstract:
Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on…
▽ More
Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats . GATv2 is available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library.
△ Less
Submitted 31 January, 2022; v1 submitted 30 May, 2021;
originally announced May 2021.
-
Single-Node Attacks for Fooling Graph Neural Networks
Authors:
Ben Finkelshtein,
Chaim Baskin,
Evgenii Zheltonozhskii,
Uri Alon
Abstract:
Graph neural networks (GNNs) have shown broad applicability in a variety of domains. These domains, e.g., social networks and product recommendations, are fertile ground for malicious users and behavior. In this paper, we show that GNNs are vulnerable to the extremely limited (and thus quite realistic) scenarios of a single-node adversarial attack, where the perturbed node cannot be chosen by the…
▽ More
Graph neural networks (GNNs) have shown broad applicability in a variety of domains. These domains, e.g., social networks and product recommendations, are fertile ground for malicious users and behavior. In this paper, we show that GNNs are vulnerable to the extremely limited (and thus quite realistic) scenarios of a single-node adversarial attack, where the perturbed node cannot be chosen by the attacker. That is, an attacker can force the GNN to classify any target node to a chosen label, by only slightly perturbing the features or the neighbor list of another single arbitrary node in the graph, even when not being able to select that specific attacker node. When the adversary is allowed to select the attacker node, these attacks are even more effective. We demonstrate empirically that our attack is effective across various common GNN types (e.g., GCN, GraphSAGE, GAT, GIN) and robustly optimized GNNs (e.g., Robust GCN, SM GCN, GAL, LAT-GCN), outperforming previous attacks across different real-world datasets both in a targeted and non-targeted attacks. Our code is available at https://github.com/benfinkelshtein/SINGLE .
△ Less
Submitted 29 September, 2022; v1 submitted 6 November, 2020;
originally announced November 2020.
-
On the Bottleneck of Graph Neural Networks and its Practical Implications
Authors:
Uri Alon,
Eran Yahav
Abstract:
Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing o…
▽ More
Since the proposal of the graph neural network (GNN) by Gori et al. (2005) and Scarselli et al. (2008), one of the major problems in training GNNs was their struggle to propagate information between distant nodes in the graph. We propose a new explanation for this problem: GNNs are susceptible to a bottleneck when aggregating messages across a long path. This bottleneck causes the over-squashing of exponentially growing information into fixed-size vectors. As a result, GNNs fail to propagate messages originating from distant nodes and perform poorly when the prediction task depends on long-range interaction. In this paper, we highlight the inherent problem of over-squashing in GNNs: we demonstrate that the bottleneck hinders popular GNNs from fitting long-range signals in the training data; we further show that GNNs that absorb incoming edges equally, such as GCN and GIN, are more susceptible to over-squashing than GAT and GGNN; finally, we show that prior work, which extensively tuned GNN models of long-range problems, suffers from over-squashing, and that breaking the bottleneck improves their state-of-the-art results without any tuning or additional weights. Our code is available at https://github.com/tech-srl/bottleneck/ .
△ Less
Submitted 9 March, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
A Structural Model for Contextual Code Changes
Authors:
Shaked Brody,
Uri Alon,
Eran Yahav
Abstract:
We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows u…
▽ More
We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program's Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task.
We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves a 28% relative gain over state-of-the-art sequential models and 2x higher accuracy than syntactic models that learn to generate the edited code, as opposed to modeling the edits directly.
Our code, dataset, and trained models are publicly available at https://github.com/tech-srl/c3po/ .
△ Less
Submitted 12 October, 2020; v1 submitted 27 May, 2020;
originally announced May 2020.
-
Adversarial Examples for Models of Code
Authors:
Noam Yefet,
Uri Alon,
Eran Yahav
Abstract:
Neural models of code have shown impressive results when performing tasks such as predicting method names and identifying certain kinds of bugs. We show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code using adversarial examples. The main idea of our approach is to force a given trained model to make an incorrect predicti…
▽ More
Neural models of code have shown impressive results when performing tasks such as predicting method names and identifying certain kinds of bugs. We show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code using adversarial examples. The main idea of our approach is to force a given trained model to make an incorrect prediction, as specified by the adversary, by introducing small perturbations that do not change the program's semantics, thereby creating an adversarial example. To find such perturbations, we present a new technique for Discrete Adversarial Manipulation of Programs (DAMP). DAMP works by deriving the desired prediction with respect to the model's inputs, while holding the model weights constant, and following the gradients to slightly modify the input code. We show that our DAMP attack is effective across three neural architectures: code2vec, GGNN, and GNN-FiLM, in both Java and C#. Our evaluations demonstrate that DAMP has up to 89% success rate in changing a prediction to the adversary's choice (a targeted attack) and a success rate of up to 94% in changing a given prediction to any incorrect prediction (a non-targeted attack). To defend a model against such attacks, we empirically examine a variety of possible defenses and discuss their trade-offs. We show that some of these defenses can dramatically drop the success rate of the attacker, with a minor penalty of 2% relative degradation in accuracy when they are not performing under attack. Our code, data, and trained models are available at https://github.com/tech-srl/adversarial-examples .
△ Less
Submitted 12 October, 2020; v1 submitted 15 October, 2019;
originally announced October 2019.
-
Structural Language Models of Code
Authors:
Uri Alon,
Roy Sadaka,
Omer Levy,
Eran Yahav
Abstract:
We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstrac…
▽ More
We address the problem of any-code completion - generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/slm-code-generation/ . An online demo is available at http://AnyCodeGen.org .
△ Less
Submitted 29 July, 2020; v1 submitted 30 September, 2019;
originally announced October 2019.
-
Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs
Authors:
Yaniv David,
Uri Alon,
Eran Yahav
Abstract:
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations.
We present a novel approach for predicting procedure names in stripped executables. Our approach co…
▽ More
We address the problem of reverse engineering of stripped executables, which contain no debug information. This is a challenging problem because of the low amount of syntactic information available in stripped executables, and the diverse assembly code patterns arising from compiler optimizations.
We present a novel approach for predicting procedure names in stripped executables. Our approach combines static analysis with neural models. The main idea is to use static analysis to obtain augmented representations of call sites; encode the structure of these call sites using the control-flow graph (CFG) and finally, generate a target name while attending to these call sites. We use our representation to drive graph-based, LSTM-based and Transformer-based architectures.
Our evaluation shows that our models produce predictions that are difficult and time consuming for humans, while improving on existing methods by 28% and by 100% over state-of-the-art neural textual models that do not use any static analysis. Code and data for this evaluation are available at https://github.com/tech-srl/Nero .
△ Less
Submitted 16 October, 2020; v1 submitted 25 February, 2019;
originally announced February 2019.
-
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
Authors:
Jonathan Shen,
Patrick Nguyen,
Yonghui Wu,
Zhifeng Chen,
Mia X. Chen,
Ye Jia,
Anjuli Kannan,
Tara Sainath,
Yuan Cao,
Chung-Cheng Chiu,
Yanzhang He,
Jan Chorowski,
Smit Hinsu,
Stella Laurenzo,
James Qin,
Orhan Firat,
Wolfgang Macherey,
Suyog Gupta,
Ankur Bapna,
Shuyuan Zhang,
Ruoming Pang,
Ron J. Weiss,
Rohit Prabhavalkar,
Qiao Liang,
Benoit Jacob
, et al. (66 additional authors not shown)
Abstract:
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w…
▽ More
Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Contextual Speech Recognition with Difficult Negative Training Examples
Authors:
Uri Alon,
Golan Pundak,
Tara N. Sainath
Abstract:
Improving the representation of contextual information is key to unlocking the potential of end-to-end (E2E) automatic speech recognition (ASR). In this work, we present a novel and simple approach for training an ASR context mechanism with difficult negative examples. The main idea is to focus on proper nouns (e.g., unique entities such as names of people and places) in the reference transcript,…
▽ More
Improving the representation of contextual information is key to unlocking the potential of end-to-end (E2E) automatic speech recognition (ASR). In this work, we present a novel and simple approach for training an ASR context mechanism with difficult negative examples. The main idea is to focus on proper nouns (e.g., unique entities such as names of people and places) in the reference transcript, and use phonetically similar phrases as negative examples, encouraging the neural model to learn more discriminative representations. We apply our approach to an end-to-end contextual ASR model that jointly learns to transcribe and select the correct context items, and show that our proposed method gives up to $53.1\%$ relative improvement in word error rate (WER) across several benchmarks.
△ Less
Submitted 29 October, 2018;
originally announced October 2018.
-
code2seq: Generating Sequences from Structured Representations of Code
Authors:
Uri Alon,
Shaked Brody,
Omer Levy,
Eran Yahav
Abstract:
The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present ${\rm {\scriptsize CODE2SEQ}}$:…
▽ More
The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present ${\rm {\scriptsize CODE2SEQ}}$: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding. We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to $16$M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. An interactive online demo of our model is available at http://code2seq.org. Our code, data and trained models are available at http://github.com/tech-srl/code2seq.
△ Less
Submitted 21 February, 2019; v1 submitted 3 August, 2018;
originally announced August 2018.
-
Approximating Functions on Boxes
Authors:
Avichai Tendler,
Uri Alon
Abstract:
The vector space of all polynomial functions of degree $k$ on a box of dimension $n$ is of dimension ${n \choose k}$. A consequence of this fact is that a function can be approximated on vertices of the box using other vertices to higher degrees than expected. This approximation is useful for various biological applications such as predicting the effect of a treatment with drug combinations and co…
▽ More
The vector space of all polynomial functions of degree $k$ on a box of dimension $n$ is of dimension ${n \choose k}$. A consequence of this fact is that a function can be approximated on vertices of the box using other vertices to higher degrees than expected. This approximation is useful for various biological applications such as predicting the effect of a treatment with drug combinations and computing values of fitness landscape.
△ Less
Submitted 9 May, 2018; v1 submitted 25 March, 2018;
originally announced April 2018.
-
A General Path-Based Representation for Predicting Program Properties
Authors:
Uri Alon,
Meital Zilberstein,
Omer Levy,
Eran Yahav
Abstract:
Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming and increase programmer productivity. A major challenge when learning from programs is $\textit{how to represent programs in a way that facilitates effective learning}$.
We present a $\textit{general path-based representation}$ for learning from programs. Our repr…
▽ More
Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming and increase programmer productivity. A major challenge when learning from programs is $\textit{how to represent programs in a way that facilitates effective learning}$.
We present a $\textit{general path-based representation}$ for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens.
We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages.
We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C\#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.
△ Less
Submitted 22 April, 2018; v1 submitted 26 March, 2018;
originally announced March 2018.
-
code2vec: Learning Distributed Representations of Code
Authors:
Uri Alon,
Meital Zilberstein,
Omer Levy,
Eran Yahav
Abstract:
We present a neural model for representing snippets of code as continuous distributed vectors ("code embeddings"). The main idea is to represent a code snippet as a single fixed-length $\textit{code vector}$, which can be used to predict semantic properties of the snippet. This is performed by decomposing code to a collection of paths in its abstract syntax tree, and learning the atomic representa…
▽ More
We present a neural model for representing snippets of code as continuous distributed vectors ("code embeddings"). The main idea is to represent a code snippet as a single fixed-length $\textit{code vector}$, which can be used to predict semantic properties of the snippet. This is performed by decomposing code to a collection of paths in its abstract syntax tree, and learning the atomic representation of each path $\textit{simultaneously}$ with learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 14M methods. We show that code vectors trained on this dataset can predict method names from files that were completely unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. Comparing previous techniques over the same data set, our approach obtains a relative improvement of over 75%, being the first to successfully predict method names based on a large, cross-project, corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data, and trained models are available at https://github.com/tech-srl/code2vec.
△ Less
Submitted 30 October, 2018; v1 submitted 26 March, 2018;
originally announced March 2018.
-
Creative Foraging: A Quantitative Paradigm for Studying Creative Exploration
Authors:
Yuval Hart,
Avraham E Mayo,
Ruth Mayo,
Liron Rozenkrantz,
Avichai Tendler,
Uri Alon,
Lior Noy
Abstract:
Creative exploration is central to science, art and cognitive development. However, research on creative exploration is limited by a lack of high-resolution automated paradigms. To address this, we present such an automated paradigm, the creative foraging game, in which people search for novel and valuable solutions in a large and well-defined space made of all possible shapes made of ten connecte…
▽ More
Creative exploration is central to science, art and cognitive development. However, research on creative exploration is limited by a lack of high-resolution automated paradigms. To address this, we present such an automated paradigm, the creative foraging game, in which people search for novel and valuable solutions in a large and well-defined space made of all possible shapes made of ten connected squares. Players discovered shape categories such as digits, letters, and airplanes. They exploited each category, then dropped it to explore once again, and so on. Aligned with a prediction of optimal foraging theory (OFT) prediction, during exploration phases, people moved along meandering paths that are about three times longer than the minimal paths between shapes, when exploiting a category of related shapes, they moved along the minimal paths. The moment of discovery of a new category was usually done at a nonprototypical and ambiguous shape, which can serve as an experimental proxy for creative leaps. People showed individual differences in their search patterns, along a continuum between two strategies: a mercurial quick-to-discover/quick-to-drop strategy and a thorough slow-to-discover/slow-to-drop strategy. Contrary to optimal foraging theory, players leave exploitation to explore again far before categories are depleted. This paradigm opens the way for automated high-resolution study of creative exploration.
△ Less
Submitted 5 June, 2017;
originally announced June 2017.
-
Evolution of bow-tie architectures in biology
Authors:
Tamar Friedlander,
Avraham E. Mayo,
Tsvi Tlusty,
Uri Alon
Abstract:
Bow-tie or hourglass structure is a common architectural feature found in biological and technological networks. A bow-tie in a multi-layered structure occurs when intermediate layers have much fewer components than the input and output layers. Examples include metabolism where a handful of building blocks mediate between multiple input nutrients and multiple output biomass components, and signali…
▽ More
Bow-tie or hourglass structure is a common architectural feature found in biological and technological networks. A bow-tie in a multi-layered structure occurs when intermediate layers have much fewer components than the input and output layers. Examples include metabolism where a handful of building blocks mediate between multiple input nutrients and multiple output biomass components, and signaling networks where information from numerous receptor types passes through a small set of signaling pathways to regulate multiple output genes. Little is known, however, about how bow-tie architectures evolve. Here, we address the evolution of bow-tie architectures using simulations of multi-layered systems evolving to fulfill a given input-output goal. We find that bow-ties spontaneously evolve when two conditions are met: (i) the evolutionary goal is rank deficient, where the rank corresponds to the minimal number of input features on which the outputs depend, and (ii) The effects of mutations on interaction intensities between components are described by product rule - namely the mutated element is multiplied by a random number. Product-rule mutations are more biologically realistic than the commonly used sum-rule mutations that add a random number to the mutated element. These conditions robustly lead to bow-tie structures. The minimal width of the intermediate network layers (the waist or knot of the bow-tie) equals the rank of the evolutionary goal. These findings can help explain the presence of bow-ties in diverse biological systems, and can also be relevant for machine learning applications that employ multi-layered networks.
△ Less
Submitted 30 April, 2014;
originally announced April 2014.
-
Mutation Rules and the Evolution of Sparseness and Modularity in Biological Systems
Authors:
Tamar Friedlander,
Avraham E. Mayo,
Tsvi Tlusty,
Uri Alon
Abstract:
Biological systems exhibit two structural features on many levels of organization: sparseness, in which only a small fraction of possible interactions between components actually occur; and modularity - the near decomposability of the system into modules with distinct functionality. Recent work suggests that modularity can evolve in a variety of circumstances, including goals that vary in time suc…
▽ More
Biological systems exhibit two structural features on many levels of organization: sparseness, in which only a small fraction of possible interactions between components actually occur; and modularity - the near decomposability of the system into modules with distinct functionality. Recent work suggests that modularity can evolve in a variety of circumstances, including goals that vary in time such that they share the same subgoals (modularly varying goals), or when connections are costly. Here, we studied the origin of modularity and sparseness focusing on the nature of the mutation process, rather than on connection cost or variations in the goal. We use simulations of evolution with different mutation rules. We found that commonly used sum-rule mutations, in which interactions are mutated by adding random numbers, do not lead to modularity or sparseness except for in special situations. In contrast, product-rule mutations in which interactions are mutated by multiplying by random numbers - a better model for the effects of biological mutations - led to sparseness naturally. When the goals of evolution are modular, in the sense that specific groups of inputs affect specific groups of outputs, product-rule mutations also lead to modular structure; sum-rule mutations do not. Product-rule mutations generate sparseness and modularity because they tend to reduce interactions, and to keep small interaction terms small.
△ Less
Submitted 13 June, 2013; v1 submitted 18 February, 2013;
originally announced February 2013.
-
Entrainment of noise-induced and limit cycle oscillators under weak noise
Authors:
Namiko Mitarai,
Uri Alon,
Mogens H. Jensen
Abstract:
Theoretical models that describe oscillations in biological systems are often either a limit cycle oscillator, where the deterministic nonlinear dynamics gives sustained periodic oscillations, or a noise-induced oscillator, where a fixed point is linearly stable with complex eigenvalues and addition of noise gives oscillations around the fixed point with fluctuating amplitude. We investigate how e…
▽ More
Theoretical models that describe oscillations in biological systems are often either a limit cycle oscillator, where the deterministic nonlinear dynamics gives sustained periodic oscillations, or a noise-induced oscillator, where a fixed point is linearly stable with complex eigenvalues and addition of noise gives oscillations around the fixed point with fluctuating amplitude. We investigate how each class of model behaves under the external periodic forcing, taking the well-studied van der Pol equation as an example. We find that, when the forcing is additive, the noise-induced oscillator can show only one-to-one entrainment to the external frequency, in contrast to the limit cycle oscillator which is known to entrain to any ratio. When the external forcing is multiplicative, on the other hand, the noise-induced oscillator can show entrainment to a few ratios other than one-to-one, while the limit cycle oscillator shows entrain to any ratio. The noise blurs the entrainment in general, but clear entrainment regions for limit cycles can be identified as long as the noise is not too strong.
△ Less
Submitted 16 May, 2013; v1 submitted 11 January, 2013;
originally announced January 2013.
-
Symmetry invariance for adapting biological systems
Authors:
Oren Shoval,
Uri Alon,
Eduardo Sontag
Abstract:
We study in this paper certain properties of the responses of dynamical systems to external inputs. The motivation arises from molecular systems biology. and, in particular, the recent discovery of an important transient property, related to Weber's law in psychophysics: "fold-change detection" in adapting systems, the property that scale uncertainty does not affect responses. FCD appears to play…
▽ More
We study in this paper certain properties of the responses of dynamical systems to external inputs. The motivation arises from molecular systems biology. and, in particular, the recent discovery of an important transient property, related to Weber's law in psychophysics: "fold-change detection" in adapting systems, the property that scale uncertainty does not affect responses. FCD appears to play an important role in key signaling transduction mechanisms in eukaryotes, including the ERK and Wnt pathways, as well as in E.coli and possibly other prokaryotic chemotaxis pathways. In this paper, we provide further theoretical results regarding this property. Far more generally, we develop a necessary and sufficient characterization of adapting systems whose transient behaviors are invariant under the action of a set (often, a group) of symmetries in their sensory field. A particular instance is FCD, which amounts to invariance under the action of the multiplicative group of positive real numbers. Our main result is framed in terms of a notion which extends equivariant actions of compact Lie groups. Its proof relies upon control theoretic tools, and in particular the uniqueness theorem for minimal realizations.
△ Less
Submitted 13 December, 2010;
originally announced December 2010.
-
Rules for biological regulation based on error minimization
Authors:
Guy Shinar,
Erez Dekel,
Tsvi Tlusty,
Uri Alon
Abstract:
The control of gene expression involves complex mechanisms that show large variation in design. For example, genes can be turned on either by the binding of an activator (positive control) or the unbinding of a repressor (negative control). What determines the choice of mode of control for each gene? This study proposes rules for gene regulation based on the assumption that free regulatory sites a…
▽ More
The control of gene expression involves complex mechanisms that show large variation in design. For example, genes can be turned on either by the binding of an activator (positive control) or the unbinding of a repressor (negative control). What determines the choice of mode of control for each gene? This study proposes rules for gene regulation based on the assumption that free regulatory sites are exposed to nonspecific binding errors, whereas sites bound to their cognate regulators are protected from errors. Hence, the selected mechanisms keep the sites bound to their designated regulators for most of the time, thus minimizing fitness-reducing errors. This offers an explanation of the empirically demonstrated Savageau demand rule: Genes that are needed often in the natural environment tend to be regulated by activators, and rarely needed genes tend to be regulated by repressors; in both cases, sites are bound for most of the time, and errors are minimized. The fitness advantage of error minimization appears to be readily selectable. The present approach can also generate rules for multi-regulator systems. The error-minimization framework raises several experimentally testable hypotheses. It may also apply to other biological regulation systems, such as those involving protein-protein interactions.
△ Less
Submitted 26 July, 2010;
originally announced July 2010.
-
Coding limits on the number of transcription factors
Authors:
Shalev Itzkovitz,
Tsvi Tlusty,
Uri Alon
Abstract:
Transcription factor proteins bind specific DNA sequences to control the expression of genes. They contain DNA binding domains which belong to several super-families, each with a specific mechanism of DNA binding. The total number of transcription factors encoded in a genome increases with the number of genes in the genome. Here, we examined the number of transcription factors from each super-fa…
▽ More
Transcription factor proteins bind specific DNA sequences to control the expression of genes. They contain DNA binding domains which belong to several super-families, each with a specific mechanism of DNA binding. The total number of transcription factors encoded in a genome increases with the number of genes in the genome. Here, we examined the number of transcription factors from each super-family in diverse organisms.
We find that the number of transcription factors from most super-families appears to be bounded. For example, the number of winged helix factors does not generally exceed 300, even in very large genomes. The magnitude of the maximal number of transcription factors from each super-family seems to correlate with the number of DNA bases effectively recognized by the binding mechanism of that super-family. Coding theory predicts that such upper bounds on the number of transcription factors should exist, in order to minimize cross-binding errors between transcription factors. This theory further predicts that factors with similar binding sequences should tend to have similar biological effect, so that errors based on mis-recognition are minimal. We present evidence that transcription factors with similar binding sequences tend to regulate genes with similar biological functions, supporting this prediction.
The present study suggests limits on the transcription factor repertoire of cells, and suggests coding constraints that might apply more generally to the map** between binding sites and biological function.
△ Less
Submitted 26 July, 2010;
originally announced July 2010.
-
Subgraphs and network motifs in geometric networks
Authors:
Shalev Itzkovitz,
Uri Alon
Abstract:
Many real-world networks describe systems in which interactions decay with the distance between nodes. Examples include systems constrained in real space such as transportation and communication networks, as well as systems constrained in abstract spaces such as multivariate biological or economic datasets and models of social networks. These networks often display network motifs: subgraphs that…
▽ More
Many real-world networks describe systems in which interactions decay with the distance between nodes. Examples include systems constrained in real space such as transportation and communication networks, as well as systems constrained in abstract spaces such as multivariate biological or economic datasets and models of social networks. These networks often display network motifs: subgraphs that recur in the network much more often than in randomized networks. To understand the origin of the network motifs in these networks, it is important to study the subgraphs and network motifs that arise solely from geometric constraints. To address this, we analyze geometric network models, in which nodes are arranged on a lattice and edges are formed with a probability that decays with the distance between nodes. We present analytical solutions for the numbers of all 3 and 4-node subgraphs, in both directed and non-directed geometric networks. We also analyze geometric networks with arbitrary degree sequences, and models with a field that biases for directed edges in one direction. Scaling rules for scaling of subgraph numbers with system size, lattice dimension and interaction range are given. Several invariant measures are found, such as the ratio of feedback and feed-forward loops, which do not depend on system size, dimension or connectivity function. We find that network motifs in many real-world networks, including social networks and neuronal networks, are not captured solely by these geometric models. This is in line with recent evidence that biological network motifs were selected as basic circuit elements with defined information-processing functions.
△ Less
Submitted 7 September, 2004;
originally announced September 2004.
-
Coarse-Graining and Self-Dissimilarity of Complex Networks
Authors:
Shalev Itzkovitz,
Reuven Levitt,
Nadav Kashtan,
Ron Milo,
Michael Itzkovitz,
Uri Alon
Abstract:
Can complex engineered and biological networks be coarse-grained into smaller and more understandable versions in which each node represents an entire pattern in the original network? To address this, we define coarse-graining units (CGU) as connectivity patterns which can serve as the nodes of a coarse-grained network, and present algorithms to detect them. We use this approach to systematicall…
▽ More
Can complex engineered and biological networks be coarse-grained into smaller and more understandable versions in which each node represents an entire pattern in the original network? To address this, we define coarse-graining units (CGU) as connectivity patterns which can serve as the nodes of a coarse-grained network, and present algorithms to detect them. We use this approach to systematically reverse-engineer electronic circuits, forming understandable high-level maps from incomprehensible transistor wiring: first, a coarse-grained version in which each node is a gate made of several transistors is established. Then, the coarse-grained network is itself coarse-grained, resulting in a high-level blueprint in which each node is a circuit-module made of multiple gates. We apply our approach also to a mammalian protein-signaling network, to find a simplified coarse-grained network with three main signaling channels that correspond to cross-interacting MAP-kinase cascades. We find that both biological and electronic networks are 'self-dissimilar', with different network motifs found at each level. The present approach can be used to simplify a wide variety of directed and nondirected, natural and designed networks.
△ Less
Submitted 18 October, 2004; v1 submitted 14 May, 2004;
originally announced May 2004.
-
Topological Generalizations of network motifs
Authors:
N. Kashtan,
S. Itzkovitz,
R. Milo,
U. Alon
Abstract:
Biological and technological networks contain patterns, termed network motifs, which occur far more often than in randomized networks. Network motifs were suggested to be elementary building blocks that carry out key functions in the network. It is of interest to understand how network motifs combine to form larger structures. To address this, we present a systematic approach to define 'motif ge…
▽ More
Biological and technological networks contain patterns, termed network motifs, which occur far more often than in randomized networks. Network motifs were suggested to be elementary building blocks that carry out key functions in the network. It is of interest to understand how network motifs combine to form larger structures. To address this, we present a systematic approach to define 'motif generalizations': families of motifs of different sizes that share a common architectural theme. To define motif generalizations, we first define 'roles' in a subgraph according to structural equivalence. For example, the feedforward loop triad, a motif in transcription, neuronal and some electronic networks, has three roles, an input node, an output node and an internal node. The roles are used to define possible generalizations of the motif. The feedforward loop can have three simple generalizations, based on replicating each of the three roles and their connections. We present algorithms for efficiently detecting motif generalizations. We find that the transcription networks of bacteria and yeast display only one of the three generalizations, the multi-output feedforward generalization. In contrast, the neuronal network of \emph{C. elegans} mainly displays the multi-input generalization. Forward-logic electronic circuits display a multi-input, multi-output hybrid. Thus, networks which share a common motif can have very different generalizations of that motif. Using mathematical modelling, we describe the information processing functions of the different motif generalizations in transcription, neuronal and electronic networks.
△ Less
Submitted 16 May, 2004; v1 submitted 15 December, 2003;
originally announced December 2003.
-
On the uniform generation of random graphs with prescribed degree sequences
Authors:
R. Milo,
N. Kashtan,
S. Itzkovitz,
M. E. J. Newman,
U. Alon
Abstract:
Random graphs with prescribed degree sequences have been widely used as a model of complex networks. Comparing an observed network to an ensemble of such graphs allows one to detect deviations from randomness in network properties. Here we briefly review two existing methods for the generation of random graphs with arbitrary degree sequences, which we call the ``switching'' and ``matching'' meth…
▽ More
Random graphs with prescribed degree sequences have been widely used as a model of complex networks. Comparing an observed network to an ensemble of such graphs allows one to detect deviations from randomness in network properties. Here we briefly review two existing methods for the generation of random graphs with arbitrary degree sequences, which we call the ``switching'' and ``matching'' methods, and present a new method based on the ``go with the winners'' Monte Carlo method. The matching method may suffer from nonuniform sampling, while the switching method has no general theoretical bound on its mixing time. The ``go with the winners'' method has neither of these drawbacks, but is slow. It can however be used to evaluate the reliability of the other two methods and, by doing this, we demonstrate that the deviations of the switching and matching algorithms under realistic conditions are small compared to the ``go with the winners'' algorithm. Because of its combination of speed and accuracy we recommend the use of the switching method for most calculations.
△ Less
Submitted 30 May, 2004; v1 submitted 1 December, 2003;
originally announced December 2003.
-
Subgraphs in random networks
Authors:
S. Itzkovitz,
R. Milo,
N. Kashtan,
G. Ziv,
U. Alon
Abstract:
Understanding the subgraph distribution in random networks is important for modelling complex systems. In classic Erdos networks, which exhibit a Poissonian degree distribution, the number of appearances of a subgraph G with n nodes and g edges scales with network size as \mean{G} ~ N^{n-g}. However, many natural networks have a non-Poissonian degree distribution. Here we present approximate equ…
▽ More
Understanding the subgraph distribution in random networks is important for modelling complex systems. In classic Erdos networks, which exhibit a Poissonian degree distribution, the number of appearances of a subgraph G with n nodes and g edges scales with network size as \mean{G} ~ N^{n-g}. However, many natural networks have a non-Poissonian degree distribution. Here we present approximate equations for the average number of subgraphs in an ensemble of random sparse directed networks, characterized by an arbitrary degree sequence. We find new scaling rules for the commonly occurring case of directed scale-free networks, in which the outgoing degree distribution scales as P(k) ~ k^{-γ}. Considering the power exponent of the degree distribution, γ, as a control parameter, we show that random networks exhibit transitions between three regimes. In each regime the subgraph number of appearances follows a different scaling law, \mean{G} ~ N^α, where α=n-g+s-1 for γ<2, α=n-g+s+1-γfor 2<γ<γ_c, and α=n-g for γ>γ_c, s is the maximal outdegree in the subgraph, and γ_c=s+1. We find that certain subgraphs appear much more frequently than in Erdos networks. These results are in very good agreement with numerical simulations. This has implications for detecting network motifs, subgraphs that occur in natural networks significantly more than in their randomized counterparts.
△ Less
Submitted 26 August, 2003; v1 submitted 19 February, 2003;
originally announced February 2003.
-
Smooth Phases, Roughening Transitions and Novel Exponents in One-dimensional Growth Models
Authors:
Uri Alon,
Martin Evans,
Haye Hinrichsen,
David Mukamel
Abstract:
A class of solid-on-solid growth models with short range interactions and sequential updates is studied. The models exhibit both smooth and rough phases in dimension d=1. Some of the features of the roughening transition which takes place in these models are related to contact processes or directed percolation type problems. The models are analyzed using a mean field approximation, scaling argum…
▽ More
A class of solid-on-solid growth models with short range interactions and sequential updates is studied. The models exhibit both smooth and rough phases in dimension d=1. Some of the features of the roughening transition which takes place in these models are related to contact processes or directed percolation type problems. The models are analyzed using a mean field approximation, scaling arguments and numerical simulations. In the smooth phase the symmetry of the underlying dynamics is spontaneously broken. A family of order parameters which are not conserved by the dynamics is defined as well as conjugate fields which couple to these order parameters. The corresponding critical behavior is studied and novel exponents identified and measured. We also show how continuous symmetries can be broken in one dimension. A field theory appropriate for studying the roughening transition is introduced and discussed.
△ Less
Submitted 15 October, 1997;
originally announced October 1997.
-
Gel-Electrophoresis and Diffusion of Ring-Shaped DNA
Authors:
Uri Alon,
David Mukamel
Abstract:
A model for the motion of ring-shaped DNA in a gel is introduced and studied by numerical simulations and a mean-field approximation. The ring motion is mediated by finger-shaped loops (hernias) that move in an amoeba-like fashion around the gel obstructions. This constitutes an extension of previous reptation tube treatments. It is shown that tension is essential for describing the dynamics in…
▽ More
A model for the motion of ring-shaped DNA in a gel is introduced and studied by numerical simulations and a mean-field approximation. The ring motion is mediated by finger-shaped loops (hernias) that move in an amoeba-like fashion around the gel obstructions. This constitutes an extension of previous reptation tube treatments. It is shown that tension is essential for describing the dynamics in the presence of hernias. It is included in the model as long range interactions over stretched DNA regions. The mobility of ring-shaped DNA is found to saturate much as in the well-studied case of linear DNA. Experiments in polymer gels, however, show that the mobility drops exponentially with the DNA ring size. This is commonly attributed to dangling-ends in the gel that can impale the ring. The predictions of the present model are expected to apply to artificial 2D obstacle arrays (W.D. Volkmuth, R.H. Austin, Nature 358,600 (1992)) which have no dangling-ends. In the zero-field case an exact solution of the model steady-state is obtained, and quantities such as the average ring size are calculated. An approximate treatment of the ring dynamics is given, and the diffusion coefficient is derived. The model is also discussed in the context of spontaneous symmetry breaking in one dimension.
△ Less
Submitted 17 February, 1997;
originally announced February 1997.
-
Roughening Transition in a One-Dimensional Growth Process
Authors:
Uri Alon,
Martin Evans,
Haye Hinrichsen,
David Mukamel
Abstract:
A class of nonequilibrium models with short-range interactions and sequential updates is presented. The models describe one dimensional growth processes which display a roughening transition between a smooth and a rough phase. This transition is accompanied by spontaneous symmetry breaking, which is described by an order parameter whose dynamics is non-conserving. Some aspects of models in this…
▽ More
A class of nonequilibrium models with short-range interactions and sequential updates is presented. The models describe one dimensional growth processes which display a roughening transition between a smooth and a rough phase. This transition is accompanied by spontaneous symmetry breaking, which is described by an order parameter whose dynamics is non-conserving. Some aspects of models in this class are related to directed percolation in 1+1 dimensions, although unlike directed percolation the models have no absorbing states. Scaling relations are derived and compared with Monte Carlo simulations.
△ Less
Submitted 11 December, 1995;
originally announced December 1995.