Search | arXiv e-print repository

Trace is the New AutoDiff -- Unlocking Efficient Optimization of Computational Workflows

Authors: Ching-An Cheng, Allen Nie, Adith Swaminathan

Abstract: We study a class of optimization problems motivated by automating the design and update of AI systems like coding assistants, robots, and copilots. We propose an end-to-end optimization framework, Trace, which treats the computational workflow of an AI system as a graph akin to neural networks, based on a generalization of back-propagation. Optimization of computational workflows often involves ri… ▽ More We study a class of optimization problems motivated by automating the design and update of AI systems like coding assistants, robots, and copilots. We propose an end-to-end optimization framework, Trace, which treats the computational workflow of an AI system as a graph akin to neural networks, based on a generalization of back-propagation. Optimization of computational workflows often involves rich feedback (e.g. console output or user's responses), heterogeneous parameters (e.g. prompts, hyper-parameters, codes), and intricate objectives (beyond maximizing a score). Moreover, its computation graph can change dynamically with the inputs and parameters. We frame a new mathematical setup of iterative optimization, Optimization with Trace Oracle (OPTO), to capture and abstract these properties so as to design optimizers that work across many domains. In OPTO, an optimizer receives an execution trace along with feedback on the computed output and updates parameters iteratively. Trace is the tool to implement OPTO in practice. Trace has a Python interface that efficiently converts a computational workflow into an OPTO instance using a PyTorch-like interface. Using Trace, we develop a general-purpose LLM-based optimizer called OptoPrime that can effectively solve OPTO problems. In empirical studies, we find that OptoPrime is capable of first-order numerical optimization, prompt optimization, hyper-parameter tuning, robot controller design, code debugging, etc., and is often competitive with specialized optimizers for each domain. We believe that Trace, OptoPrime and the OPTO framework will enable the next generation of interactive agents that automatically adapt using various kinds of feedback. Website: https://microsoft.github.io/Trace △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2405.17708 [pdf, other]

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Authors: Allen Nie, Yash Chandak, Christina J. Yuan, Anirudhan Badrinath, Yannis Flet-Berliac, Emma Brunskil

Abstract: Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been pro… ▽ More Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 22 pages

arXiv:2405.16434 [pdf, other]

The Importance of Directional Feedback for LLM-based Optimizers

Authors: Allen Nie, Ching-An Cheng, Andrey Kolobov, Adith Swaminathan

Abstract: We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems in a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural lan… ▽ More We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems in a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space. We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations. Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques. △ Less

Submitted 20 June, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

Comments: Accepted and Presented at Foundation Models for Decision Making at NeurIPS 2023 (December 15, 2023). Work completed from June 2023 to September 2023

arXiv:2312.06853 [pdf, other]

LLF-Bench: Benchmark for Interactive Learning from Language Feedback

Authors: Ching-An Cheng, Andrey Kolobov, Dipendra Misra, Allen Nie, Adith Swaminathan

Abstract: We introduce a new benchmark, LLF-Bench (Learning from Language Feedback Benchmark; pronounced as "elf-bench"), to evaluate the ability of AI agents to interactively learn from natural language feedback and instructions. Learning from language feedback (LLF) is essential for people, largely because the rich information this feedback provides can help a learner avoid much of trial and error and the… ▽ More We introduce a new benchmark, LLF-Bench (Learning from Language Feedback Benchmark; pronounced as "elf-bench"), to evaluate the ability of AI agents to interactively learn from natural language feedback and instructions. Learning from language feedback (LLF) is essential for people, largely because the rich information this feedback provides can help a learner avoid much of trial and error and thereby speed up the learning process. Large Language Models (LLMs) have recently enabled AI agents to comprehend natural language -- and hence AI agents can potentially benefit from language feedback during learning like humans do. But existing interactive benchmarks do not assess this crucial capability: they either use numeric reward feedback or require no learning at all (only planning or information retrieval). LLF-Bench is designed to fill this omission. LLF-Bench is a diverse collection of sequential decision-making tasks that includes user recommendation, poem writing, navigation, and robot control. The objective of an agent is to interactively solve these tasks based on their natural-language instructions and the feedback received after taking actions. Crucially, to ensure that the agent actually "learns" from the feedback, LLF-Bench implements several randomization techniques (such as paraphrasing and environment randomization) to ensure that the task isn't familiar to the agent and that the agent is robust to various verbalizations. In addition, LLF-Bench provides a unified OpenAI Gym interface for all its tasks and allows the users to easily configure the information the feedback conveys (among suggestion, explanation, and instantaneous performance) to study how agents respond to different types of feedback. Together, these features make LLF-Bench a unique research platform for develo** and testing LLF agents. △ Less

Submitted 13 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2310.19677 [pdf, other]

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks

Authors: Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto, Tobias Gerstenberg

Abstract: Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people… ▽ More Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. On the aggregate level, alignment has improved with more recent LLMs. However, using statistical analyses, we find that LLMs weigh the different factors quite differently from human participants. These results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover LLMs implicit tendencies and show to what extent these align with human intuitions. △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: 34 pages, 7 figures. NeurIPS 2023

arXiv:2306.14069 [pdf, other]

Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

Authors: Anirudhan Badrinath, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

Abstract: Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) and the success of the decision transformer (DT) architecture in various domains, DTs have fallen short in several challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories. To overcome this limitation, we present a… ▽ More Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) and the success of the decision transformer (DT) architecture in various domains, DTs have fallen short in several challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories. To overcome this limitation, we present a novel approach to enhance RvS methods by integrating intermediate targets. We introduce the Waypoint Transformer (WT), using an architecture that builds upon the DT framework and conditioned on automatically-generated waypoints. The results show a significant increase in the final return compared to existing RvS methods, with performance on par or greater than existing state-of-the-art temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including AntMaze Large Play/Diverse and Kitchen Mixed/Partial. △ Less

Submitted 18 November, 2023; v1 submitted 24 June, 2023; originally announced June 2023.

Comments: Accepted to the Conference on Neural Information Processing Systems 2023 (NeurIPS 2023)

arXiv:2304.04933 [pdf, other]

Reinforcement Learning Tutor Better Supported Lower Performers in a Math Task

Authors: Sherry Ruan, Allen Nie, William Steenbergen, Jiayu He, JQ Zhang, Meng Guo, Yao Liu, Kyle Dang Nguyen, Catherine Y Wang, Rui Ying, James A Landay, Emma Brunskill

Abstract: Resource limitations make it hard to provide all students with one of the most effective educational interventions: personalized instruction. Reinforcement learning could be a key tool to reduce the development cost and improve the effectiveness of intelligent tutoring software that aims to provide the right support, at the right time, to a student. Here we illustrate that deep reinforcement learn… ▽ More Resource limitations make it hard to provide all students with one of the most effective educational interventions: personalized instruction. Reinforcement learning could be a key tool to reduce the development cost and improve the effectiveness of intelligent tutoring software that aims to provide the right support, at the right time, to a student. Here we illustrate that deep reinforcement learning can be used to provide adaptive pedagogical support to students learning about the concept of volume in a narrative storyline software. Using explainable artificial intelligence tools, we extracted interpretable insights about the pedagogical policy learned and demonstrated that the resulting policy had similar performance in a different student population. Most importantly, in both studies, the reinforcement-learning narrative system had the largest benefit for those students with the lowest initial pretest scores, suggesting the opportunity for AI to adapt and provide support for those most in need. △ Less

Submitted 13 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: 23 pages. Under review

arXiv:2301.11426 [pdf, other]

Model-based Offline Reinforcement Learning with Local Misspecification

Authors: Kefan Dong, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

Abstract: We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to join… ▽ More We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: Accepted by AAAI-23

arXiv:2211.10829 [pdf]

Depositing boron on Cu(111): Borophene or boride?

Authors: Xiao-Ji Weng, Jie Bai, **gyu Hou, Yi Zhu, Li Wang, Penghui Li, Anmin Nie, Bo Xu, Xiang-Feng Zhou, Yongjun Tian

Abstract: Large-area single-crystal surface structures were successfully prepared on Cu(111) substrate with boron deposition, which is critical for prospective applications. However, the proposed borophene structures do not match the scanning tunneling microscopy (STM) results very well, while the proposed copper boride is at odds with the traditional knowledge that ordered copper-rich borides normally do n… ▽ More Large-area single-crystal surface structures were successfully prepared on Cu(111) substrate with boron deposition, which is critical for prospective applications. However, the proposed borophene structures do not match the scanning tunneling microscopy (STM) results very well, while the proposed copper boride is at odds with the traditional knowledge that ordered copper-rich borides normally do not exist due to small difference in electronegativity and large difference in atomic size. To clarify the controversy and elucidate the formation mechanism of the unexpected copper boride, we conducted systematic STM, X-ray photoelectron spectroscopy and angle-resolved photoemission spectroscopy investigations, confirming the synthesis of two-dimensional copper boride rather than borophene on Cu(111) after boron deposition under ultrahigh vacuum. First-principles calculations with defective surface models further indicate that boron atoms tend to react with Cu atoms near terrace edges or defects, which in turn shapes the intermediate structures of copper boride and leads to the formation of stable Cu-B monolayer via large-scale surface reconstruction eventually. △ Less

Submitted 19 November, 2022; originally announced November 2022.

Comments: 15 pages, 4 figures

arXiv:2211.08909 [pdf]

Continuous Electrical Manipulation of Magnetic Anisotropy and Spin Flop** in van der Waals Ferromagnetic Devices

Authors: Ming Tang, Junwei Huang, Feng Qin, Kun Zhai, Toshiya Ideue, Zeya Li, Fanhao Meng, Anmin Nie, Linglu Wu, Xiangyu Bi, Caorong Zhang, Ling Zhou, Peng Chen, Caiyu Qiu, Peizhe Tang, Haijun Zhang, Xiangang Wan, Lin Wang, Zhongyuan Liu, Yongjun Tian, Yoshihiro Iwasa, Hongtao Yuan

Abstract: Controlling the magnetic anisotropy of ferromagnetic materials plays a key role in magnetic switching devices and spintronic applications. Examples of spin-orbit torque devices with different magnetic anisotropy geometries (in-plane or out-of-plane directions) have been demonstrated with novel magnetization switching mechanisms for extended device functionalities. Normally, the intrinsic magnetic… ▽ More Controlling the magnetic anisotropy of ferromagnetic materials plays a key role in magnetic switching devices and spintronic applications. Examples of spin-orbit torque devices with different magnetic anisotropy geometries (in-plane or out-of-plane directions) have been demonstrated with novel magnetization switching mechanisms for extended device functionalities. Normally, the intrinsic magnetic anisotropy in ferromagnetic materials is unchanged within a fixed direction, and thus, it is difficult to realize multifunctionality devices. Therefore, continuous modulation of magnetic anisotropy in ferromagnetic materials is highly desired but remains challenging. Here, we demonstrate a gate-tunable magnetic anisotropy transition from out-of-plane to canted and finally to in-plane in layered Fe$_5$GeTe$_2$ by combining the measurements of the angle-dependent anomalous Hall effect and magneto-optical Kerr effect with quantitative Stoner-Wohlfarth analysis. The magnetic easy axis continuously rotates in a spin-flop pathway by gating or temperature modulation. Such observations offer a new avenue for exploring magnetization switching mechanisms and realizing new spintronic functionalities. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 4 figures

arXiv:2211.08802 [pdf, other]

Giving Feedback on Interactive Student Programs with Meta-Exploration

Authors: Evan Zheran Liu, Moritz Stephan, Allen Nie, Chris Piech, Emma Brunskill, Chelsea Finn

Abstract: Develo** interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is time-consuming -- standard approaches require instructors to manually grade student-implemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on as… ▽ More Develo** interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is time-consuming -- standard approaches require instructors to manually grade student-implemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students' ability to learn. One approach toward automatic grading is to learn an agent that interacts with a student's program and explores states indicative of errors via reinforcement learning. However, existing work on this approach only provides binary feedback of whether a program is correct or not, while students require finer-grained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a meta-exploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides fine-grained feedback. We evaluate our approach on a set of over 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by 17.7% and coming within 1.5% of human-level accuracy. Project web page: https://ezliu.github.io/dreamgrader. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: Advances in Neural Information Processing Systems (NeurIPS 2022). Selected as Oral

arXiv:2210.08642 [pdf, other]

Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data

Authors: Allen Nie, Yannis Flet-Berliac, Deon R. Jordan, William Steenbergen, Emma Brunskill

Abstract: Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically pe… ▽ More Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL. △ Less

Submitted 12 January, 2023; v1 submitted 16 October, 2022; originally announced October 2022.

Comments: 32 pages. Published at NeurIPS 2022. Presented at RLDM 2022

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2201.03181 [pdf, other]

Spiked eigenvalues of high-dimensional sample autocovariance matrices: CLT and applications

Authors: Daning Bi, Xiao Han, Adam Nie, Yanrong Yang

Abstract: High-dimensional autocovariance matrices play an important role in dimension reduction for high-dimensional time series. In this article, we establish the central limit theorem (CLT) for spiked eigenvalues of high-dimensional sample autocovariance matrices, which are developed under general conditions. The spiked eigenvalues are allowed to go to infinity in a flexible way without restrictions in d… ▽ More High-dimensional autocovariance matrices play an important role in dimension reduction for high-dimensional time series. In this article, we establish the central limit theorem (CLT) for spiked eigenvalues of high-dimensional sample autocovariance matrices, which are developed under general conditions. The spiked eigenvalues are allowed to go to infinity in a flexible way without restrictions in divergence order. Moreover, the number of spiked eigenvalues and the time lag of the autocovariance matrix under this study could be either fixed or tending to infinity when the dimension p and the time length T go to infinity together. As a further statistical application, a novel autocovariance test is proposed to detect the equivalence of spiked eigenvalues for two high-dimensional time series. Various simulation studies are illustrated to justify the theoretical findings. Furthermore, a hierarchical clustering approach based on the autocovariance test is constructed and applied to clustering mortality data from multiple countries. △ Less

Submitted 13 May, 2024; v1 submitted 10 January, 2022; originally announced January 2022.

arXiv:2110.14615 [pdf, other]

Play to Grade: Testing Coding Games as Classifying Markov Decision Process

Authors: Allen Nie, Emma Brunskill, Chris Piech

Abstract: Contemporary coding education often presents students with the task of develo** programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, there are no contemporary autonomous methods for providing feedback. Notably, interactive programs are impossible to grade by traditional unit tests. In this paper we formalize the challenge of… ▽ More Contemporary coding education often presents students with the task of develo** programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, there are no contemporary autonomous methods for providing feedback. Notably, interactive programs are impossible to grade by traditional unit tests. In this paper we formalize the challenge of providing feedback to interactive programs as a task of classifying Markov Decision Processes (MDPs). Each student's program fully specifies an MDP where the agent needs to operate and decide, under reasonable generalization, if the dynamics and reward model of the input MDP should be categorized as correct or broken. We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classifier to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711,274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research. △ Less

Submitted 14 December, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: NeurIPS 2021, 16 pages, 7 figures

arXiv:2109.06100 [pdf]

Atomic-Scale Visualization and Manipulation of Domain boundaries in 2D Ferroelectric In2Se3

Authors: Fan Zhang, Zhe Wang, Lixuan Liu, Anmin Nie, Yongji Gong, Wenguang Zhu, Chenggang Tao

Abstract: Domain boundaries in ferroelectric materials exhibit rich and diverse physical properties distinct from their parent materials and have been proposed for novel applications in nanoelectronics and quantum information technology. Due to their complexity and diversity, the internal atomic and electronic structure of domain boundaries that governs the electronic properties as well as the kinetics of d… ▽ More Domain boundaries in ferroelectric materials exhibit rich and diverse physical properties distinct from their parent materials and have been proposed for novel applications in nanoelectronics and quantum information technology. Due to their complexity and diversity, the internal atomic and electronic structure of domain boundaries that governs the electronic properties as well as the kinetics of domain switching remains far from being elucidated. By using scanning tunneling microscopy and spectroscopy (STM/S) combined with density functional theory (DFT) calculations, we directly visualize the atomic structure of domain boundaries in two-dimensional (2D) ferroelectric beta' In2Se3 down to the monolayer limit and reveal a double-barrier energy potential of the 60° tail to tail domain boundaries for the first time. We further controllably manipulate the domain boundaries with atomic precision by STM and show that the movements of domain boundaries can be driven by the electric field from an STM tip and proceed by the collective shifting of atoms at the domain boundaries. The results will deepen our understanding of domain boundaries in 2D ferroelectric materials and stimulate innovative applications of these materials. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: 26 pages (not including SI), 4 figures

arXiv:2108.07258 [pdf, other]

On the Opportunities and Risks of Foundation Models

Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature. △ Less

Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

arXiv:2011.14819 [pdf]

doi 10.1093/nsr/nwab140

Discovery of carbon-based strongest and hardest amorphous material

Authors: Shuangshuang Zhang, Zihe Li, Kun Luo, Julong He, Yufei Gao, Alexander V. Soldatov, Vicente Benavides, Kaiyuan Shi, Anmin Nie, Bin Zhang, Wentao Hu, Mengdong Ma, Yong Liu, Bin Wen, Guoying Gao, Bing Liu, Yang Zhang, Dongli Yu, Xiang-Feng Zhou, Zhisheng Zhao, Bo Xu, Lei Su, Guoqiang Yang, Olga P. Chernogorova, Yongjun Tian

Abstract: Carbon is likely the most fascinating element of the periodic table because of the diversity of its allotropes stemming from its variable (sp, sp2, and sp3) bonding motifs. Exploration of new forms of carbon has been an eternal theme of contemporary scientific research. Here we report on novel amorphous carbon phases containing high fraction of sp3 bonded atoms recovered after compressing fulleren… ▽ More Carbon is likely the most fascinating element of the periodic table because of the diversity of its allotropes stemming from its variable (sp, sp2, and sp3) bonding motifs. Exploration of new forms of carbon has been an eternal theme of contemporary scientific research. Here we report on novel amorphous carbon phases containing high fraction of sp3 bonded atoms recovered after compressing fullerene C60 to previously unexplored high pressure and temperature. The synthesized carbons are the hardest and strongest amorphous materials known to date, capable of scratching diamond crystal and approaching its strength which is evidenced by complimentary mechanical tests. Photoluminescence and absorption spectra of the materials demonstrate they are semiconductors with tunable bandgaps in the range of 1.5-2.2 eV, comparable to that of amorphous silicon. A remarkable combination of the outstanding mechanical and electronic properties makes this class of amorphous carbons an excellent candidate for photovoltaic applications demanding ultrahigh strength and wear resistance. △ Less

Submitted 25 June, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

Comments: 40 pages, 17 figures

Report number: nwab140

Journal ref: National Science Review, 2021

arXiv:2005.04926 [pdf]

Orthogonal electric control of the out-of-plane field-effect in two-dimensional ferroelectric alpha-In2Se3

Authors: Yue Li, Chen Chen, Wei Li, Xiaoyu Mao, Heng Liu, Jianyong Xiang, Anmin Nie, Zhongyuan Liu, Wenguang Zhu, Hualing Zeng

Abstract: Tuning the electric properties of crystalline solids is at the heart of material science and electronics. Generating the electric field-effect via an external voltage is a clean, continuous and systematic method. Here, utilizing the unique electric dipole locking in van der Waals (vdW) ferroelectric alpha-In2Se3, we report a new approach to establish the electric gating effect, where the electrost… ▽ More Tuning the electric properties of crystalline solids is at the heart of material science and electronics. Generating the electric field-effect via an external voltage is a clean, continuous and systematic method. Here, utilizing the unique electric dipole locking in van der Waals (vdW) ferroelectric alpha-In2Se3, we report a new approach to establish the electric gating effect, where the electrostatic do** in the out-of-plane direction is induced and controlled by an in-plane voltage. With the vertical vdW heterostructure of ultrathin alpha-In2Se3 and MoS2, we validate an in-plane voltage gated coplanar field-effect transistor (CP-FET) with distinguished and retentive on/off ratio. Our results demonstrate unprecedented electric control of ferroelectricity, which paves the way for integrating two-dimensional (2D) ferroelectric into novel nanoelectronic devices with broad applications. △ Less

Submitted 11 May, 2020; originally announced May 2020.

arXiv:2004.14451 [pdf, other]

Pragmatic Issue-Sensitive Image Captioning

Authors: Allen Nie, Reuben Cohn-Gordon, Christopher Potts

Abstract: Image captioning systems have recently improved dramatically, but they still tend to produce captions that are insensitive to the communicative goals that captions should meet. To address this, we propose Issue-Sensitive Image Captioning (ISIC). In ISIC, a captioning system is given a target image and an issue, which is a set of images partitioned in a way that specifies what information is releva… ▽ More Image captioning systems have recently improved dramatically, but they still tend to produce captions that are insensitive to the communicative goals that captions should meet. To address this, we propose Issue-Sensitive Image Captioning (ISIC). In ISIC, a captioning system is given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant. The goal of the captioner is to produce a caption that resolves this issue. To model this task, we use an extension of the Rational Speech Acts model of pragmatic language use. Our extension is built on top of state-of-the-art pretrained neural image captioners and explicitly reasons about issues in our sense. We establish experimentally that these models generate captions that are both highly descriptive and issue-sensitive, and we show how ISIC can complement and enrich the related task of Visual Question Answering. △ Less

Submitted 5 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: 15 pages, 7 figures. EMNLP 2020 Findings Accepted

arXiv:2002.05913 [pdf]

Direct Observation of Room-Temperature Dislocation Plasticity in Diamond

Authors: Anmin Nie, Yeqiang Bu, Junquan Huang, Yecheng Shao, Yizhi Zhang, Wentao Hu, Jiabin Liu, Yanbin Wang, Bo Xu, Zhongyuan Liu, Hongtao Wang, Wei Yang, Yongjun Tian

Abstract: It is well known that diamond does not deform plastically at room temperature and usually fails in catastrophic brittle fracture. Here we demonstrate room-temperature dislocation plasticity in sub-micrometer sized diamond pillars by in-situ mechanical testing in the transmission electron microscope. We document in unprecedented details of spatio-temporal features of the dislocations introduced by… ▽ More It is well known that diamond does not deform plastically at room temperature and usually fails in catastrophic brittle fracture. Here we demonstrate room-temperature dislocation plasticity in sub-micrometer sized diamond pillars by in-situ mechanical testing in the transmission electron microscope. We document in unprecedented details of spatio-temporal features of the dislocations introduced by the confinement-free compression, including dislocation generation and propagation. Atom-resolved observations with tomographic reconstructions show unequivocally that mixed-type dislocations with Burgers vectors of 1/2<110> are activated in the non-close-packed {001} planes of diamond under uniaxial compression of <111> and <110> directions, respectively, while being activated in the {111} planes under the <100> directional loading, indicating orientation-dependent dislocation plasticity. These results provide new insights into the mechanical behavior of diamond and stimulate reconsideration of the basic deformation mechanism in diamond as well as in other brittle covalent crystals at low temperatures. △ Less

Submitted 14 February, 2020; originally announced February 2020.

arXiv:2002.01104 [pdf]

Dislocation Slip or Phase Transformation Lead to Room-Temperature Plasticity in Diamond: Comment on Plastic Deformation of Single-Crystal Diamond Nanopillars

Authors: Yeqiang Bu, Peng Wang, Anmin Nie, Hongtao Wang

Abstract: Despite decades of extensive research on mechanical properties of diamond, much remains to be understood in term of plastic deformation mechanisms due to the poor deformability at room temperature. In a recent work in Advanced Materials, it was claimed that room-temperature plasticity occurred in <001>-oriented single-crystal diamond nanopillars based on observation of unrecovered deformation insi… ▽ More Despite decades of extensive research on mechanical properties of diamond, much remains to be understood in term of plastic deformation mechanisms due to the poor deformability at room temperature. In a recent work in Advanced Materials, it was claimed that room-temperature plasticity occurred in <001>-oriented single-crystal diamond nanopillars based on observation of unrecovered deformation inside scanning electron microscope. The plastic deformation was suggested to be mediated by a phase transition from sp3 carbon to an O8-carbon phase by molecular dynamics simulations. By comparison, our in-situ transmission electron microscopy study reveals that the room-temperature plasticity can be carried out by dislocation slip in both <100> and <111>-oriented diamond nanopillars. The brittle-to-ductile transition is highly dependent on the stress state. We note that the surface structure may play a significant role in the deformation mechanisms as the incipient plasticity always occurs from the surface region in nanoscale diamonds. △ Less

Submitted 3 February, 2020; originally announced February 2020.

arXiv:1909.10699 [pdf, other]

LitGen: Genetic Literature Recommendation Guided by Human Explanations

Authors: Allen Nie, Arturo L. Pineda, Matt W. Wright Hannah Wand, Bryan Wulf, Helio A. Costa, Ronak Y. Patel, Carlos D. Bustamante, James Zou

Abstract: As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathog… ▽ More As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences---e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)---the flagship NIH program for clinical curation---we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evidence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: 12 pages; 5 figures. Accepted by PSB 2020 (Pacific Symposium on Biocomputing) track: Artificial Intelligence for Enhancing Clinical Medicine

arXiv:1906.01243 [pdf, other]

Learning to Explain: Answering Why-Questions via Rephrasing

Authors: Allen Nie, Erin D. Bennett, Noah D. Goodman

Abstract: Providing plausible responses to why questions is a challenging but critical goal for language based human-machine interaction. Explanations are challenging in that they require many different forms of abstract knowledge and reasoning. Previous work has either relied on human-curated structured knowledge bases or detailed domain representation to generate satisfactory explanations. They are also o… ▽ More Providing plausible responses to why questions is a challenging but critical goal for language based human-machine interaction. Explanations are challenging in that they require many different forms of abstract knowledge and reasoning. Previous work has either relied on human-curated structured knowledge bases or detailed domain representation to generate satisfactory explanations. They are also often limited to ranking pre-existing explanation choices. In our work, we contribute to the under-explored area of generating natural language explanations for general phenomena. We automatically collect large datasets of explanation-phenomenon pairs which allow us to train sequence-to-sequence models to generate natural language explanations. We compare different training strategies and evaluate their performance using both automatic scores and human ratings. We demonstrate that our strategy is sufficient to generate highly plausible explanations for general open-domain phenomena compared to other models trained on different datasets. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: 8 pages, 5 figures. 1st ConvAI Workshop at ACL 2019

arXiv:1811.11958 [pdf, other]

Large-scale Generative Modeling to Improve Automated Veterinary Disease Coding

Authors: Yuhui Zhang, Allen Nie, James Zou

Abstract: Supervised learning is limited both by the quantity and quality of the labeled data. In the field of medical record tagging, writing styles between hospitals vary drastically. The knowledge learned from one hospital might not transfer well to another. This problem is amplified in veterinary medicine domain because veterinary clinics rarely apply medical codes to their records. We proposed and trai… ▽ More Supervised learning is limited both by the quantity and quality of the labeled data. In the field of medical record tagging, writing styles between hospitals vary drastically. The knowledge learned from one hospital might not transfer well to another. This problem is amplified in veterinary medicine domain because veterinary clinics rarely apply medical codes to their records. We proposed and trained the first large-scale generative modeling algorithm in automated disease coding. We demonstrate that generative modeling can learn discriminative features when additionally trained with supervised fine-tuning. We systematically ablate and evaluate the effect of generative modeling on the final system's performance. We compare the performance of our model with several baselines in a challenging cross-hospital setting with substantial domain shift. We outperform competitive baselines by a large margin. In addition, we provide interpretation for what is learned by our model. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

Report number: ML4H/2018/83

arXiv:1810.05328 [pdf]

Non-volatile ferroelectric memory effect in ultrathin α-In2Se3

Authors: Siyuan Wan, Yue Li, Wei Li, Xiaoyu Mao, Chen Wang, Jiyu Dong, Anmin Nie, Jianyong Xiang, Zhongyuan Liu, Wenguang Zhu, Hualing Zeng

Abstract: Recent experiments on layered α-In2Se3 have confirmed its room-temperature ferroelectricity under ambient condition. This observation renders α-In2Se3 an excellent platform for develo** two-dimensional (2D) layered-material based electronics with nonvolatile functionality. In this letter, we demonstrate non-volatile memory effect in a hybrid 2D ferroelectric field effect transistor (FeFET) made… ▽ More Recent experiments on layered α-In2Se3 have confirmed its room-temperature ferroelectricity under ambient condition. This observation renders α-In2Se3 an excellent platform for develo** two-dimensional (2D) layered-material based electronics with nonvolatile functionality. In this letter, we demonstrate non-volatile memory effect in a hybrid 2D ferroelectric field effect transistor (FeFET) made of ultrathin α-In2Se3 and graphene. The resistance of graphene channel in the FeFET is tunable and retentive due to the electrostatic do**, which stems from the electric polarization of the ferroelectric α-In2Se3. The electronic logic bit can be represented and stored with different orientations of electric dipoles in the top-gate ferroelectric. The 2D FeFET can be randomly re-written over more than $10^5$ cycles without losing the non-volatility. Our approach demonstrates a protype of re-writable non-volatile memory with ferroelectricity in van de Waals 2D materials. △ Less

Submitted 11 October, 2018; originally announced October 2018.

Comments: 19 pages, 4 figures

arXiv:1806.10722 [pdf, other]

DeepTag: inferring all-cause diagnoses from clinical notes in under-resourced medical domain

Authors: Allen Nie, Ashley Zehnder, Rodney L. Page, Arturo L. Pineda, Manuel A. Rivas, Carlos D. Bustamante, James Zou

Abstract: Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impedi… ▽ More Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impediment to cross-species translational research, which relies on the ability to accurately identify patient cohorts with specific diagnostic criteria in humans and animals. In order to reduce the coding burden for veterinary clinical practice and aid translational research, we have developed a deep learning algorithm, DeepTag, which automatically infers diagnostic codes from veterinary free text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multi-task LSTM with an improved hierarchical objective that captures the semantic structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defers them to human experts, resulting in improved performance. DeepTag accurately infers disease codes from free text even in challenging cross-hospital settings where the text comes from different clinical settings than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal pre-processing. The technical framework in this work can be applied in other medical domains that currently lack medical coding resources. △ Less

Submitted 3 September, 2018; v1 submitted 27 June, 2018; originally announced June 2018.

Comments: 17 pages, 6 figures. Updated the text for clarity

arXiv:1804.08824 [pdf, other]

A Continuous Time GARCH(p,q) Process with Delay

Authors: Adam Nie

Abstract: We investigate the properties of a continuous time GARCH process as the solution to a Lévy driven stochastic functional integral equation. This process occurs as a weak limit of a sequence of discrete time GARCH processes as the time between observations converges to zero and the number of lags grows to infinity. The resulting limit generalizes the COGARCH process and can be interpreted as a COGAR… ▽ More We investigate the properties of a continuous time GARCH process as the solution to a Lévy driven stochastic functional integral equation. This process occurs as a weak limit of a sequence of discrete time GARCH processes as the time between observations converges to zero and the number of lags grows to infinity. The resulting limit generalizes the COGARCH process and can be interpreted as a COGARCH process with higher orders of lags. We give conditions for the existence, uniqueness and regularity of the solution to the integral equation, and derive a more conventional representation of the process in terms of a stochastic delayed differential equation. Path properties of the volatility process, including piecewise differentiability and positivity, are studied, as well as second order properties of the process, such as uniform L1 and L2 bounds, mean stationarity and asymptotic covariance stationarity. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: 24 pages, 2 figures

arXiv:1710.04334 [pdf, other]

DisSent: Sentence Representation Learning from Explicit Discourse Relations

Authors: Allen Nie, Erin D. Bennett, Noah D. Goodman

Abstract: Learning effective representations of sentences is one of the core missions of natural language understanding. Existing models either train on a vast amount of text, or require costly, manually curated sentence relation datasets. We show that with dependency parsing and rule-based rubrics, we can curate a high quality sentence relation task by leveraging explicit discourse relations. We show that… ▽ More Learning effective representations of sentences is one of the core missions of natural language understanding. Existing models either train on a vast amount of text, or require costly, manually curated sentence relation datasets. We show that with dependency parsing and rule-based rubrics, we can curate a high quality sentence relation task by leveraging explicit discourse relations. We show that our curated dataset provides an excellent signal for learning vector representations of sentence meaning, representing relations that can only be determined when the meanings of two sentences are combined. We demonstrate that the automatically curated corpus allows a bidirectional LSTM sentence encoder to yield high quality sentence embeddings and can serve as a supervised fine-tuning dataset for larger models such as BERT. Our fixed sentence embeddings achieve high performance on a variety of transfer tasks, including SentEval, and we achieve state-of-the-art results on Penn Discourse Treebank's implicit relation prediction task. △ Less

Submitted 4 June, 2019; v1 submitted 11 October, 2017; originally announced October 2017.

Comments: 13 pages, 4 figures. ACL 2019

arXiv:1703.02573 [pdf, other]

Data Noising as Smoothing in Neural Network Language Models

Authors: Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, Andrew Y. Ng

Abstract: Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noising in neural network language models and smoothing in $n$-gra… ▽ More Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noising in neural network language models and smoothing in $n$-gram models. Using this connection, we draw upon ideas from smoothing to develop effective noising schemes. We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing. △ Less

Submitted 7 March, 2017; originally announced March 2017.

Comments: ICLR 2017

Showing 1–30 of 30 results for author: Nie, A