Search | arXiv e-print repository

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Authors: Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang

Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second… ▽ More We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 71 pages, 29 figures

arXiv:2305.14200 [pdf, other]

Accessing Higher Dimensions for Unsupervised Word Translation

Authors: Sida I. Wang

Abstract: The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we sh… ▽ More The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and useful world knowledge in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought -- less than 80MB and minutes of CPU time is required to achieve over 50\% accuracy for English to Finnish, Hungarian, and Chinese translations when trained on similar data; even under domain mismatch, we show coocmap still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors, and suggest that similarly processed co-occurrences can outperform dense vectors on other tasks too. △ Less

Submitted 23 May, 2023; originally announced May 2023.

ACM Class: I.2.7

arXiv:2305.08195 [pdf, other]

Learning to Simulate Natural Language Feedback for Interactive Semantic Parsing

Authors: Hao Yan, Saurabh Srivastava, Yintao Tai, Sida I. Wang, Wen-tau Yih, Ziyu Yao

Abstract: Interactive semantic parsing based on natural language (NL) feedback, where users provide feedback to correct the parser mistakes, has emerged as a more practical scenario than the traditional one-shot semantic parsing. However, prior work has heavily relied on human-annotated feedback data to train the interactive semantic parser, which is prohibitively expensive and not scalable. In this work, w… ▽ More Interactive semantic parsing based on natural language (NL) feedback, where users provide feedback to correct the parser mistakes, has emerged as a more practical scenario than the traditional one-shot semantic parsing. However, prior work has heavily relied on human-annotated feedback data to train the interactive semantic parser, which is prohibitively expensive and not scalable. In this work, we propose a new task of simulating NL feedback for interactive semantic parsing. We accompany the task with a novel feedback evaluator. The evaluator is specifically designed to assess the quality of the simulated feedback, based on which we decide the best feedback simulator from our proposed variants. On a text-to-SQL dataset, we show that our feedback simulator can generate high-quality NL feedback to boost the error correction ability of a specific parser. In low-data settings, our feedback simulator can help achieve comparable error correction performance as trained using the costly, full set of human annotations. △ Less

Submitted 4 June, 2023; v1 submitted 14 May, 2023; originally announced May 2023.

Comments: Accepted to ACL 2023. 18 pages, 6 figures

arXiv:2302.08468 [pdf, other]

LEVER: Learning to Verify Language-to-Code Generation with Execution

Authors: Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, Xi Victoria Lin

Abstract: The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics… ▽ More The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them. △ Less

Submitted 1 September, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: ICML'23; code available at https://github.com/niansong1996/lever

arXiv:2211.16490 [pdf, other]

Coder Reviewer Reranking for Code Generation

Authors: Tianyi Zhang, Tao Yu, Tatsunori B. Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, Sida I. Wang

Abstract: Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the… ▽ More Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the likelihood of the instruction given the generated programs. We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk method. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameters. △ Less

Submitted 29 November, 2022; originally announced November 2022.

arXiv:2204.11454 [pdf, other]

Natural Language to Code Translation with Execution

Authors: Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, Sida I. Wang

Abstract: Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly incorporate program semantics (i.e., execution results) during training, they are able to generate correct solutions for many problems. However, choosing a sin… ▽ More Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly incorporate program semantics (i.e., execution results) during training, they are able to generate correct solutions for many problems. However, choosing a single correct program from a generated set for each problem remains challenging. In this work, we introduce execution result--based minimum Bayes risk decoding (MBR-EXEC) for program selection and show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks. We select output programs from a generated candidate set by marginalizing over program implementations that share the same semantics. Because exact equivalence is intractable, we execute each program on a small number of test inputs to approximate semantic equivalence. Across datasets, execution or simulated execution significantly outperforms the methods that do not involve program semantics. We find that MBR-EXEC consistently improves over all execution-unaware selection methods, suggesting it as an effective approach for natural language to code translation. We open-source our code at github.com/facebookresearch/mbr-exec and data at dl.fbaipublicfiles.com/mbr-exec/mbr-exec-release.zip △ Less

Submitted 1 November, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: EMNLP 2022

arXiv:2201.05966 [pdf, other]

UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models

Authors: Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, Tao Yu

Abstract: Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs and outputs of SKG tasks are heterogeneous, they have been studied separately by different communities, which limits systematic and compatible research on SKG. In this paper, we overcome this limitation… ▽ More Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs and outputs of SKG tasks are heterogeneous, they have been studied separately by different communities, which limits systematic and compatible research on SKG. In this paper, we overcome this limitation by proposing the UnifiedSKG framework, which unifies 21 SKG tasks into a text-to-text format, aiming to promote systematic SKG research, instead of being exclusive to a single task, domain, or dataset. We use UnifiedSKG to benchmark T5 with different sizes and show that T5, with simple modifications when necessary, achieves state-of-the-art performance on almost all of the 21 tasks. We further demonstrate that multi-task prefix-tuning improves the performance on most tasks, largely improving the overall performance. UnifiedSKG also facilitates the investigation of zero-shot and few-shot learning, and we show that T0, GPT-3, and Codex struggle in zero-shot and few-shot learning for SKG. We also use UnifiedSKG to conduct a series of controlled experiments on structured knowledge encoding variants across SKG tasks. UnifiedSKG is easily extensible to more tasks, and it is open-sourced at https://github.com/hkunlp/unifiedskg. △ Less

Submitted 18 October, 2022; v1 submitted 15 January, 2022; originally announced January 2022.

Comments: EMNLP 2022

arXiv:2111.06787 [pdf, other]

BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation

Authors: Eleftheria Briakou, Sida I. Wang, Luke Zettlemoyer, Marjan Ghazvininejad

Abstract: Mined bitexts can contain imperfect translations that yield unreliable training signals for Neural Machine Translation (NMT). While filtering such pairs out is known to improve final model quality, we argue that it is suboptimal in low-resource conditions where even mined data can be limited. In our work, we propose instead, to refine the mined bitexts via automatic editing: given a sentence in a… ▽ More Mined bitexts can contain imperfect translations that yield unreliable training signals for Neural Machine Translation (NMT). While filtering such pairs out is known to improve final model quality, we argue that it is suboptimal in low-resource conditions where even mined data can be limited. In our work, we propose instead, to refine the mined bitexts via automatic editing: given a sentence in a language xf, and a possibly imperfect translation of it xe, our model generates a revised version xf' or xe' that yields a more equivalent translation pair (i.e., <xf, xe'> or <xf', xe>). We use a simple editing strategy by (1) mining potentially imperfect translations for each sentence in a given bitext, (2) learning a model to reconstruct the original translations and translate, in a multi-task fashion. Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to ~ 8 BLEU points, in most cases improving upon a competitive back-translation baseline. △ Less

Submitted 30 May, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

arXiv:2110.10661 [pdf, other]

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Authors: Victor Zhong, Austin W. Hanjie, Sida I. Wang, Karthik Narasimhan, Luke Zettlemoyer

Abstract: Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require genera… ▽ More Existing work in language grounding typically study single environments. How do we build unified models that apply across multiple environments? We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG), which unifies a collection of diverse grounded language learning environments under a common interface. SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). Together, these environments provide diverse grounding challenges in richness of observation space, action space, language specification, and plan complexity. In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. Moreover, we find that many recent modelling advances do not result in significant gains on environments other than the one they were designed for. This highlights the need for a multi-environment benchmark. Finally, the best models significantly underperform humans on SILG, which suggests ample room for future work. We hope SILG enables the community to quickly identify new methodologies for language grounding that generalize to a diverse set of environments and their associated challenges. △ Less

Submitted 24 January, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

Comments: NeurIPS 2021. 14 pages, 8 figures

arXiv:2108.02799 [pdf, other]

doi 10.1145/3472538.3472579

Using Machine Learning to Predict Game Outcomes Based on Player-Champion Experience in League of Legends

Authors: Tiffany D. Do, Seong Ioi Wang, Dylan S. Yu, Matthew G. McMillian, Ryan P. McMahan

Abstract: League of Legends (LoL) is the most widely played multiplayer online battle arena (MOBA) game in the world. An important aspect of LoL is competitive ranked play, which utilizes a skill-based matchmaking system to form fair teams. However, players' skill levels vary widely depending on which champion, or hero, that they choose to play as. In this paper, we propose a method for predicting game outc… ▽ More League of Legends (LoL) is the most widely played multiplayer online battle arena (MOBA) game in the world. An important aspect of LoL is competitive ranked play, which utilizes a skill-based matchmaking system to form fair teams. However, players' skill levels vary widely depending on which champion, or hero, that they choose to play as. In this paper, we propose a method for predicting game outcomes in ranked LoL games based on players' experience with their selected champion. Using a deep neural network, we found that game outcomes can be predicted with 75.1% accuracy after all players have selected champions, which occurs before gameplay begins. Our results have important implications for playing LoL and matchmaking. Firstly, individual champion skill plays a significant role in the outcome of a match, regardless of team composition. Secondly, even after the skill-based matchmaking, there is still a wide variance in team skill before gameplay begins. Finally, players should only play champions that they have mastered, if they want to win games. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: 5 pages, In The 16th International Conference on the Foundations of Digital Games (FDG) 2021

arXiv:2101.00148 [pdf, other]

Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment

Authors: Haoyue Shi, Luke Zettlemoyer, Sida I. Wang

Abstract: Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent al… ▽ More Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context. Further analysis of our output and the standard reference lexicons suggests they are of comparable quality, and new benchmarks may be needed to measure further progress on this task. △ Less

Submitted 12 June, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

Comments: ACL-IJCNLP 2021 camera-ready version, with full supplementary material

arXiv:2009.07396 [pdf, other]

Grounded Adaptation for Zero-shot Executable Semantic Parsing

Authors: Victor Zhong, Mike Lewis, Sida I. Wang, Luke Zettlemoyer

Abstract: We propose Grounded Adaptation for Zero-shot Executable Semantic Parsing (GAZP) to adapt an existing semantic parser to new environments (e.g. new database schemas). GAZP combines a forward semantic parser with a backward utterance generator to synthesize data (e.g. utterances and SQL queries) in the new environment, then selects cycle-consistent examples to adapt the parser. Unlike data-augmentat… ▽ More We propose Grounded Adaptation for Zero-shot Executable Semantic Parsing (GAZP) to adapt an existing semantic parser to new environments (e.g. new database schemas). GAZP combines a forward semantic parser with a backward utterance generator to synthesize data (e.g. utterances and SQL queries) in the new environment, then selects cycle-consistent examples to adapt the parser. Unlike data-augmentation, which typically synthesizes unverified examples in the training environment, GAZP synthesizes examples in the new environment whose input-output consistency are verified. On the Spider, Sparc, and CoSQL zero-shot semantic parsing tasks, GAZP improves logical form and execution accuracy of the baseline parser. Our analyses show that GAZP outperforms data-augmentation in the training environment, performance increases with the amount of GAZP-synthesized data, and cycle-consistency is central to successful adaptation. △ Less

Submitted 1 February, 2021; v1 submitted 15 September, 2020; originally announced September 2020.

Comments: EMNLP 2020 long paper. 14 pages, 5 figures

arXiv:2006.10191 [pdf, other]

Using Collaborative Filtering to Recommend Champions in League of Legends

Authors: Tiffany D. Do, Dylan S. Yu, Salman Anwer, Seong Ioi Wang

Abstract: League of Legends (LoL), one of the most widely played computer games in the world, has over 140 playable characters known as champions that have highly varying play styles. However, there is not much work on providing champion recommendations to a player in LoL. In this paper, we propose that a recommendation system based on a collaborative filtering approach using singular value decomposition pr… ▽ More League of Legends (LoL), one of the most widely played computer games in the world, has over 140 playable characters known as champions that have highly varying play styles. However, there is not much work on providing champion recommendations to a player in LoL. In this paper, we propose that a recommendation system based on a collaborative filtering approach using singular value decomposition provides champion recommendations that players enjoy. We discuss the implementation behind our recommendation system and also evaluate the practicality of our system using a preliminary user study. Our results indicate that players significantly preferred recommendations from our system over random recommendations. △ Less

Submitted 17 June, 2020; originally announced June 2020.

Comments: 4 pages, in proceedings of 2020 IEEE Conference on Games, COG 2020

arXiv:1709.02755 [pdf, other]

Simple Recurrent Units for Highly Parallelizable Recurrence

Authors: Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, Yoav Artzi

Abstract: Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate tr… ▽ More Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model on translation by incorporating SRU into the architecture. △ Less

Submitted 7 September, 2018; v1 submitted 8 September, 2017; originally announced September 2017.

Comments: EMNLP

arXiv:1704.06956 [pdf, other]

Naturalizing a Programming Language via Interactive Learning

Authors: Sida I. Wang, Samuel Ginn, Percy Liang, Christoper D. Manning

Abstract: Our goal is to create a convenient natural language interface for performing well-specified but complex actions such as analyzing data, manipulating text, and querying databases. However, existing natural language interfaces for such tasks are quite primitive compared to the power one wields with a programming language. To bridge this gap, we start with a core programming language and allow users… ▽ More Our goal is to create a convenient natural language interface for performing well-specified but complex actions such as analyzing data, manipulating text, and querying databases. However, existing natural language interfaces for such tasks are quite primitive compared to the power one wields with a programming language. To bridge this gap, we start with a core programming language and allow users to "naturalize" the core language incrementally by defining alternative, more natural syntax and increasingly complex concepts in terms of compositions of simpler ones. In a voxel world, we show that a community of users can simultaneously teach a common system a diverse language and use it to build hundreds of complex voxel structures. Over the course of three days, these users went from using only the core language to using the naturalized language in 85.9\% of the last 10K utterances. △ Less

Submitted 23 April, 2017; originally announced April 2017.

Comments: 10 pages, ACL2017

ACM Class: I.2.7; I.2.6; I.2.1

arXiv:1703.02573 [pdf, other]

Data Noising as Smoothing in Neural Network Language Models

Authors: Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, Andrew Y. Ng

Abstract: Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noising in neural network language models and smoothing in $n$-gra… ▽ More Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noising in neural network language models and smoothing in $n$-gram models. Using this connection, we draw upon ideas from smoothing to develop effective noising schemes. We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing. △ Less

Submitted 7 March, 2017; originally announced March 2017.

Comments: ICLR 2017

arXiv:1606.02447 [pdf, other]

Learning Language Games through Interaction

Authors: Sida I. Wang, Percy Liang, Christopher D. Manning

Abstract: We introduce a new language learning setting relevant to building adaptive natural language interfaces. It is inspired by Wittgenstein's language games: a human wishes to accomplish some task (e.g., achieving a certain configuration of blocks), but can only communicate with a computer, who performs the actual actions (e.g., removing all red blocks). The computer initially knows nothing about langu… ▽ More We introduce a new language learning setting relevant to building adaptive natural language interfaces. It is inspired by Wittgenstein's language games: a human wishes to accomplish some task (e.g., achieving a certain configuration of blocks), but can only communicate with a computer, who performs the actual actions (e.g., removing all red blocks). The computer initially knows nothing about language and therefore must learn it from scratch through interaction, while the human adapts to the computer's capabilities. We created a game in a blocks world and collected interactions from 100 people playing it. First, we analyze the humans' strategies, showing that using compositionality and avoiding synonyms correlates positively with task performance. Second, we compare computer strategies, showing how to quickly learn a semantic parsing model from scratch, and that modeling pragmatics further accelerates learning for successful players. △ Less

Submitted 8 June, 2016; originally announced June 2016.

Comments: 11 pages, ACL 2016

ACM Class: I.2.6; I.2.7

arXiv:1603.08482 [pdf, other]

Estimating Mixture Models via Mixtures of Polynomials

Authors: Sida I. Wang, Arun Tejasvi Chaganty, Percy Liang

Abstract: Mixture modeling is a general technique for making any simple model more expressive through weighted combination. This generality and simplicity in part explains the success of the Expectation Maximization (EM) algorithm, in which updates are easy to derive for a wide class of mixture models. However, the likelihood of a mixture model is non-convex, so EM has no known global convergence guarantees… ▽ More Mixture modeling is a general technique for making any simple model more expressive through weighted combination. This generality and simplicity in part explains the success of the Expectation Maximization (EM) algorithm, in which updates are easy to derive for a wide class of mixture models. However, the likelihood of a mixture model is non-convex, so EM has no known global convergence guarantees. Recently, method of moments approaches offer global guarantees for some mixture models, but they do not extend easily to the range of mixture models that exist. In this work, we present Polymom, an unifying framework based on method of moments in which estimation procedures are easily derivable, just as in EM. Polymom is applicable when the moments of a single mixture component are polynomials of the parameters. Our key observation is that the moments of the mixture model are a mixture of these polynomials, which allows us to cast estimation as a Generalized Moment Problem. We solve its relaxations using semidefinite optimization, and then extract parameters using ideas from computer algebra. This framework allows us to draw insights and apply tools from convex optimization, computer algebra and the theory of moments to study problems in statistical estimation. △ Less

Submitted 28 March, 2016; originally announced March 2016.

Comments: NIPS 2015

arXiv:1408.2270 [pdf, ps, other]

A sub-constant improvement in approximating the positive semidefinite Grothendieck problem

Authors: Roy Frostig, Sida I. Wang

Abstract: Semidefinite relaxations are a powerful tool for approximately solving combinatorial optimization problems such as MAX-CUT and the Grothendieck problem. By exploiting a bounded rank property of extreme points in the semidefinite cone, we make a sub-constant improvement in the approximation ratio of one such problem. Precisely, we describe a polynomial-time algorithm for the positive semidefinite G… ▽ More Semidefinite relaxations are a powerful tool for approximately solving combinatorial optimization problems such as MAX-CUT and the Grothendieck problem. By exploiting a bounded rank property of extreme points in the semidefinite cone, we make a sub-constant improvement in the approximation ratio of one such problem. Precisely, we describe a polynomial-time algorithm for the positive semidefinite Grothendieck problem -- based on rounding from the standard relaxation -- that achieves a ratio of $2/π+ Θ(1/{\sqrt n})$, whereas the previous best is $2/π+ Θ(1/n)$. We further show a corresponding integrality gap of $2/π+\tilde{O}(1/n^{1/3})$. △ Less

Submitted 10 August, 2014; originally announced August 2014.

arXiv:1312.6205 [pdf, other]

Relaxations for inference in restricted Boltzmann machines

Authors: Sida I. Wang, Roy Frostig, Percy Liang, Christopher D. Manning

Abstract: We propose a relaxation-based approximate inference algorithm that samples near-MAP configurations of a binary pairwise Markov random field. We experiment on MAP inference tasks in several restricted Boltzmann machines. We also use our underlying sampler to estimate the log-partition function of restricted Boltzmann machines and compare against other sampling-based methods. We propose a relaxation-based approximate inference algorithm that samples near-MAP configurations of a binary pairwise Markov random field. We experiment on MAP inference tasks in several restricted Boltzmann machines. We also use our underlying sampler to estimate the log-partition function of restricted Boltzmann machines and compare against other sampling-based methods. △ Less

Submitted 2 January, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

Comments: ICLR 2014 workshop track submission

Showing 1–20 of 20 results for author: Wang, S I