Skip to main content

Showing 1–25 of 25 results for author: Xu, F F

.
  1. arXiv:2406.14497  [pdf, other

    cs.SE cs.CL

    CodeRAG-Bench: Can Retrieval Augment Code Generation?

    Authors: Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried

    Abstract: While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. Abnormal Bifurcation of the Double Binding Energy Differences and Proton-Neutron Pairing: Nuclei Close to $N=Z$ Line from Ni to Rb

    Authors: Y. P. Wang, Y. K. Wang, F. F. Xu, P. W. Zhao, J. Meng

    Abstract: The recently observed abnormal bifurcation of the double binding energy differences $δV_{pn}$ between the odd-odd and even-even nuclei along the $N=Z$ line from Ni to Rb has challenged the nuclear theories. To solve this problem, a shell-model-like approach based on the relativistic density functional theory is established, by treating simultaneously the neutron-neutron, proton-neutron, and proton… ▽ More

    Submitted 5 June, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

    Comments: 6 pages, 4 figures

    Journal ref: Phys. Rev. Lett. 132, 232501 (2024)

  3. arXiv:2307.13854  [pdf, other

    cs.AI cs.CL cs.LG

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

    Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, w… ▽ More

    Submitted 16 April, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/

  4. arXiv:2305.14257  [pdf, other

    cs.CL cs.AI cs.LG

    Hierarchical Prompting Assists Large Language Model on Web Navigation

    Authors: Abishek Sridhar, Robert Lo, Frank F. Xu, Hao Zhu, Shuyan Zhou

    Abstract: Large language models (LLMs) struggle on processing complicated observations in interactive decision making tasks. To alleviate this issue, we propose a simple hierarchical prompting approach. Diverging from previous prompting approaches that always put the full observation (e.g. a web page) to the prompt, we propose to first construct an action-aware observation which is more condensed and releva… ▽ More

    Submitted 29 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Findings; Natural Language Reasoning and Structured Explanations Workshop at ACL 2023

  5. arXiv:2305.06983  [pdf, other

    cs.CL cs.LG

    Active Retrieval Augmented Generation

    Authors: Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, Graham Neubig

    Abstract: Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on th… ▽ More

    Submitted 21 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  6. arXiv:2303.15790  [pdf, other

    hep-ex hep-ph physics.ins-det

    STCF Conceptual Design Report: Volume 1 -- Physics & Detector

    Authors: M. Achasov, X. C. Ai, R. Aliberti, L. P. An, Q. An, X. Z. Bai, Y. Bai, O. Bakina, A. Barnyakov, V. Blinov, V. Bobrovnikov, D. Bodrov, A. Bogomyagkov, A. Bondar, I. Boyko, Z. H. Bu, F. M. Cai, H. Cai, J. J. Cao, Q. H. Cao, Z. Cao, Q. Chang, K. T. Chao, D. Y. Chen, H. Chen , et al. (413 additional authors not shown)

    Abstract: The Super $τ$-Charm facility (STCF) is an electron-positron collider proposed by the Chinese particle physics community. It is designed to operate in a center-of-mass energy range from 2 to 7 GeV with a peak luminosity of $0.5\times 10^{35}{\rm cm}^{-2}{\rm s}^{-1}$ or higher. The STCF will produce a data sample about a factor of 100 larger than that by the present $τ$-Charm factory -- the BEPCII,… ▽ More

    Submitted 5 October, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Journal ref: Front. Phys. 19(1), 14701 (2024)

  7. arXiv:2301.02828  [pdf, other

    cs.CL cs.LG

    Why do Nearest Neighbor Language Models Work?

    Authors: Frank F. Xu, Uri Alon, Graham Neubig

    Abstract: Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by access… ▽ More

    Submitted 17 January, 2023; v1 submitted 7 January, 2023; originally announced January 2023.

    Comments: Preprint, 21 pages

  8. arXiv:2207.05987  [pdf, other

    cs.CL cs.AI cs.SE

    DocPrompting: Generating Code by Retrieving the Docs

    Authors: Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, Graham Neubig

    Abstract: Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human pr… ▽ More

    Submitted 18 February, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: ICLR 2023 (notable-top-25%); code and data are available at https://github.com/shuyanzhou/docprompting

  9. arXiv:2203.08388  [pdf, other

    cs.CL

    MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

    Authors: Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig

    Abstract: While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for program developers who are not proficient in English. To mitigate this gap in technology development across languages, we propose a multilingual dataset, MCoNaLa,… ▽ More

    Submitted 6 February, 2023; v1 submitted 16 March, 2022; originally announced March 2022.

  10. arXiv:2202.13169  [pdf, other

    cs.PL cs.CL

    A Systematic Evaluation of Large Language Models of Code

    Authors: Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn

    Abstract: Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation… ▽ More

    Submitted 4 May, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

    Comments: DL4C@ICLR 2022, and MAPS@PLDI 2022

  11. arXiv:2201.12431  [pdf, other

    cs.CL cs.LG

    Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

    Authors: Uri Alon, Frank F. Xu, Junxian He, Sudipta Sengupta, Dan Roth, Graham Neubig

    Abstract: Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present Reto… ▽ More

    Submitted 9 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Accepted to ICML'2022. Code and models are available at https://github.com/neulab/retomaton

  12. arXiv:2110.02870  [pdf, other

    cs.CL cs.SE

    Capturing Structural Locality in Non-parametric Language Models

    Authors: Frank F. Xu, Junxian He, Graham Neubig, Vincent J. Hellendoorn

    Abstract: Structural locality is a ubiquitous feature of real-world datasets, wherein data points are organized into local hierarchies. Some examples include topical clusters in text or project hierarchies in source code repositories. In this paper, we explore utilizing this structural locality within non-parametric language models, which generate sequences that reference retrieved examples from an external… ▽ More

    Submitted 1 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: ICLR 2022

  13. arXiv:2101.12087  [pdf, other

    cs.LG cs.SE

    Learning Structural Edits via Incremental Tree Transformations

    Authors: Ziyu Yao, Frank F. Xu, Pengcheng Yin, Huan Sun, Graham Neubig

    Abstract: While most neural generative models generate outputs in a single pass, the human creative process is usually one of iterative building and refinement. Recent work has proposed models of editing processes, but these mostly focus on editing sequential data and/or only model a single editing pass. In this paper, we present a generic model for incremental editing of structured data (i.e., "structural… ▽ More

    Submitted 4 March, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

    Comments: ICLR 2021

  14. arXiv:2101.11149  [pdf, other

    cs.SE

    In-IDE Code Generation from Natural Language: Promise and Challenges

    Authors: Frank F. Xu, Bogdan Vasilescu, Graham Neubig

    Abstract: A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code, especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural… ▽ More

    Submitted 22 September, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

    Comments: 47 pages, accepted to ACM Transactions on Software Engineering and Methodology

  15. arXiv:2005.00706  [pdf, other

    cs.CL cs.CV

    A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

    Authors: Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig, Yonatan Bisk, Nan Duan

    Abstract: Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work… ▽ More

    Submitted 9 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted by NLP Beyond Text - First International Workshop on Natural Language Processing Beyond Text @ EMNLP 2020

  16. arXiv:2005.00624  [pdf, other

    cs.CL cs.IR cs.LG

    Minimally Supervised Categorization of Text with Metadata

    Authors: Yu Zhang, Yu Meng, Jiaxin Huang, Frank F. Xu, Xuan Wang, Jiawei Han

    Abstract: Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and t… ▽ More

    Submitted 13 November, 2021; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: 10 pages; Accepted to SIGIR 2020; Some typos fixed

  17. arXiv:2004.09015  [pdf, other

    cs.CL

    Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

    Authors: Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham Neubig

    Abstract: Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the on… ▽ More

    Submitted 19 April, 2020; originally announced April 2020.

    Comments: Accepted by ACL 2020

  18. arXiv:1911.12543  [pdf, other

    cs.CL cs.LG

    How Can We Know What Language Models Know?

    Authors: Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig

    Abstract: Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as "Obama is a _ by profession". These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as "Obama worked as a _" may result in more accurately predicting the correct profession. Because of this, given an… ▽ More

    Submitted 3 May, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: TACL 2020

  19. arXiv:1910.07115  [pdf, other

    cs.LG cs.CL cs.SE stat.ML

    HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

    Authors: Yu Zhang, Frank F. Xu, Sha Li, Yu Meng, Xuan Wang, Qi Li, Jiawei Han

    Abstract: GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic reposi… ▽ More

    Submitted 13 November, 2021; v1 submitted 15 October, 2019; originally announced October 2019.

    Comments: 10 pages; Accepted to ICDM 2019; Some typos fixed

  20. StateLens: A Reverse Engineering Solution for Making Existing Dynamic Touchscreens Accessible

    Authors: Anhong Guo, Junhan Kong, Michael Rivera, Frank F. Xu, Jeffrey P. Bigham

    Abstract: Blind people frequently encounter inaccessible dynamic touchscreens in their everyday lives that are difficult, frustrating, and often impossible to use independently. Touchscreens are often the only way to control everything from coffee machines and payment terminals, to subway ticket machines and in-flight entertainment systems. Interacting with dynamic touchscreens is difficult non-visually bec… ▽ More

    Submitted 19 August, 2019; originally announced August 2019.

    Comments: ACM UIST 2019

  21. arXiv:1901.09501  [pdf, other

    cs.CL cs.AI cs.LG

    Data-to-Text Generation with Style Imitation

    Authors: Shuai Lin, Wentao Wang, Zichao Yang, Xiaodan Liang, Frank F. Xu, Eric Xing, Zhiting Hu

    Abstract: Recent neural approaches to data-to-text generation have mostly focused on improving content fidelity while lacking explicit control over writing styles (e.g., word choices, sentence structures). More traditional systems use templates to determine the realization of text. Yet manual or automatic construction of high-quality templates is difficult, and a template acting as hard constraints could ha… ▽ More

    Submitted 9 October, 2020; v1 submitted 27 January, 2019; originally announced January 2019.

    Comments: Accepted by EMNLP 2020 Findings. Significant updates over the previous version. Code & data are available at https://github.com/ha-lins/DTG-SI

  22. arXiv:1711.04204  [pdf, other

    cs.CL cs.AI

    Automatic Extraction of Commonsense LocatedNear Knowledge

    Authors: Frank F. Xu, Bill Yuchen Lin, Kenny Q. Zhu

    Abstract: LocatedNear relation is a kind of commonsense knowledge describing two physical objects that are typically found near each other in real life. In this paper, we study how to automatically extract such relationship through a sentence-level relation classifier and aggregating the scores of entity pairs from a large corpus. Also, we release two benchmark datasets for evaluation and future research.

    Submitted 12 May, 2018; v1 submitted 11 November, 2017; originally announced November 2017.

    Comments: Accepted by ACL 2018. A preliminary version is presented on AKBC@NIPS'17

  23. arXiv:1710.11169  [pdf, other

    cs.CL cs.AI

    Indirect Supervision for Relation Extraction using Question-Answer Pairs

    Authors: Zeqiu Wu, Xiang Ren, Frank F. Xu, Ji Li, Jiawei Han

    Abstract: Automatic relation extraction (RE) for types of interest is of great importance for interpreting massive text corpora in an efficient manner. Traditional RE models have heavily relied on human-annotated corpus for training, which can be costly in generating labeled data and become obstacles when dealing with more relation types. Thus, more RE extraction systems have shifted to be built upon traini… ▽ More

    Submitted 23 November, 2017; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: 9 pages + 1 page reference. Accepted to WSDM 2018

  24. arXiv:1709.04109  [pdf, other

    cs.CL cs.LG

    Empower Sequence Labeling with Task-Aware Neural Language Model

    Authors: Liyuan Liu, **gbo Shang, Frank F. Xu, Xiang Ren, Huan Gui, Jian Peng, Jiawei Han

    Abstract: Linguistic sequence labeling is a general modeling approach that encompasses a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted features. However, in many cases, it is hard to obtain sufficient annotations to train these models. In this study, we develop a novel n… ▽ More

    Submitted 23 November, 2017; v1 submitted 12 September, 2017; originally announced September 2017.

    Comments: AAAI 2018

  25. arXiv:1207.1939  [pdf

    cond-mat.soft cond-mat.mtrl-sci

    Supramolecular Thermo Aero-able Gelators (STAGs) for synthesis of hydrogels

    Authors: Feng Feng Xue, Dan Dan Yuan, Peng Wang, Atharva Sahasrabudhe, Xiao-Yan Tang, Dianyu Chen, Rongxin Yuan, Soumyajit Roy

    Abstract: Supramolecular Thermo Aero-able Gelators (STAGs): tartaric acid, urea, and guanidine with amide and imine moieties as supramolecular synthons are introduced to cross-link and aerate ('aero-able') polyacrylate networks for synthesis of hydrogels. They are bi-functional hence present a greener alternative to the existing cross-linkers and gelators.

    Submitted 17 July, 2013; v1 submitted 9 July, 2012; originally announced July 2012.

    Comments: 9 pages, 9 figures

    Journal ref: New J. Chem. 2012, 36, 2541