Skip to main content

Showing 1–18 of 18 results for author: Bogin, B

.
  1. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  2. arXiv:2402.00159  [pdf, other

    cs.CL

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Authors: Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen , et al. (11 additional authors not shown)

    Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training dat… ▽ More

    Submitted 6 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024; Dataset: https://hf.co/datasets/allenai/dolma; Code: https://github.com/allenai/dolma

  3. arXiv:2311.09519  [pdf, other

    cs.CL

    Leveraging Code to Improve In-context Learning for Semantic Parsing

    Authors: Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal

    Abstract: In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose p… ▽ More

    Submitted 27 March, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  4. arXiv:2304.13007  [pdf, other

    cs.CL cs.AI

    Answering Questions by Meta-Reasoning over Multiple Chains of Thought

    Authors: Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, Jonathan Berant

    Abstract: Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer. Often, multiple chains are sampled and aggregated through a voting mechanism over the final answers, but the intermediate steps themselves are discarded. While such approaches improve performance, they do not consider t… ▽ More

    Submitted 17 October, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted for publication in The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Author's final version

  5. arXiv:2212.06800  [pdf, other

    cs.CL

    Diverse Demonstrations Improve In-context Compositional Generalization

    Authors: Itay Levy, Ben Bogin, Jonathan Berant

    Abstract: In-context learning has shown great success in i.i.d semantic parsing splits, where the training and test sets are drawn from the same distribution. In this setup, models are typically prompted with demonstrations that are similar to the input utterance. However, in the setup of compositional generalization, where models are tested on outputs with structures that are absent from the training set,… ▽ More

    Submitted 24 June, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  6. arXiv:2211.00262  [pdf, other

    cs.CL cs.CV

    Training Vision-Language Models with Less Bimodal Supervision

    Authors: Elad Segal, Ben Bogin, Jonathan Berant

    Abstract: Standard practice in pretraining multimodal models, such as vision-language models, is to rely on pairs of aligned inputs from both modalities, for example, aligned image-text pairs. However, such pairs can be difficult to obtain in low-resource settings and for some modality pairs (e.g., structured tables and images). In this work, we investigate the extent to which we can reduce the reliance on… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: AKBC 2022

  7. arXiv:2201.05899  [pdf, other

    cs.CL

    Unobserved Local Structures Make Compositional Generalization Hard

    Authors: Ben Bogin, Shivanshu Gupta, Jonathan Berant

    Abstract: While recent work has convincingly showed that sequence-to-sequence models struggle to generalize to new compositions (termed compositional generalization), little is known on what makes compositional generalization hard on a particular test instance. In this work, we investigate what are the factors that make generalization to certain test instances challenging. We first substantiate that indeed… ▽ More

    Submitted 22 October, 2022; v1 submitted 15 January, 2022; originally announced January 2022.

    Comments: EMNLP 2022

  8. arXiv:2109.10613  [pdf, other

    cs.CL

    COVR: A test-bed for Visually Grounded Compositional Generalization with real images

    Authors: Ben Bogin, Shivanshu Gupta, Matt Gardner, Jonathan Berant

    Abstract: While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed for visually-grounded compositional generalization with real images. To create COVR, we use real images annotated with scene graphs, and propose an almost full… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  9. arXiv:2106.05006  [pdf, other

    cs.CL

    Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

    Authors: Moshe Hazoom, Vibhor Malik, Ben Bogin

    Abstract: Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE,… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: NLP4Prog 2021

  10. arXiv:2010.06000  [pdf, other

    cs.CV cs.CL

    MedICaT: A Dataset of Medical Images, Captions, and Textual References

    Authors: Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi

    Abstract: Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: EMNLP-Findings 2020

  11. arXiv:2007.00266  [pdf, other

    cs.CL cs.AI cs.LG

    Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering

    Authors: Ben Bogin, Sanjay Subramanian, Matt Gardner, Jonathan Berant

    Abstract: Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation… ▽ More

    Submitted 10 November, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2020. Author's final version

  12. arXiv:2005.00724  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Obtaining Faithful Interpretations from Compositional Neural Networks

    Authors: Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolfson, Sameer Singh, Jonathan Berant, Matt Gardner

    Abstract: Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explan… ▽ More

    Submitted 8 September, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: ACL 2020; first three authors contributed equally

  13. arXiv:2004.02709  [pdf, other

    cs.CL

    Evaluating Models' Local Decision Boundaries via Contrast Sets

    Authors: Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang , et al. (1 additional authors not shown)

    Abstract: Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systemati… ▽ More

    Submitted 1 October, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

  14. arXiv:1908.11214  [pdf, other

    cs.CL

    Global Reasoning over Database Structures for Text-to-SQL Parsing

    Authors: Ben Bogin, Matt Gardner, Jonathan Berant

    Abstract: State-of-the-art semantic parsers rely on auto-regressive decoding, emitting one symbol at a time. When tested against complex databases that are unobserved at training time (zero-shot), the parser often struggles to select the correct set of database constants in the new database, due to the local nature of decoding. In this work, we propose a semantic parser that globally reasons about the struc… ▽ More

    Submitted 29 August, 2019; originally announced August 2019.

    Comments: EMNLP 2019

  15. arXiv:1905.13326  [pdf, other

    cs.CL

    Grammar-based Neural Text-to-SQL Generation

    Authors: Kevin Lin, Ben Bogin, Mark Neumann, Jonathan Berant, Matt Gardner

    Abstract: The sequence-to-sequence paradigm employed by neural text-to-SQL models typically performs token-level decoding and does not consider generating SQL hierarchically from a grammar. Grammar-based decoding has shown significant improvements for other semantic parsing tasks, but SQL and other general programming languages have complexities not present in logical formalisms that make writing hierarchic… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

  16. arXiv:1905.06241  [pdf, other

    cs.CL

    Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing

    Authors: Ben Bogin, Matt Gardner, Jonathan Berant

    Abstract: Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In Spider, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an enco… ▽ More

    Submitted 3 June, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

    Comments: Accepted as a short paper at ACL 2019

  17. arXiv:1809.00549  [pdf, other

    cs.CL cs.AI

    Emergence of Communication in an Interactive World with Consistent Speakers

    Authors: Ben Bogin, Mor Geva, Jonathan Berant

    Abstract: Training agents to communicate with one another given task-based supervision only has attracted considerable attention recently, due to the growing interest in develo** models for human-agent interaction. Prior work on the topic focused on simple environments, where training using policy gradient was feasible despite the non-stationarity of the agents during training. In this paper, we present a… ▽ More

    Submitted 24 March, 2019; v1 submitted 3 September, 2018; originally announced September 2018.

    Comments: Emergent Communication Workshop @ NeurIPS 2018

  18. arXiv:1706.01399  [pdf, ps, other

    cs.CL

    Language Generation with Recurrent Generative Adversarial Networks without Pre-training

    Authors: Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf

    Abstract: Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent neural networks. Consequently, past work has either resorted to pre-training with maximum-likelihood or used convolutional networks for generation. In this work… ▽ More

    Submitted 21 December, 2017; v1 submitted 5 June, 2017; originally announced June 2017.

    Comments: Presented at the 1st Workshop on Learning to Generate Natural Language at ICML 2017