Skip to main content

Showing 1–9 of 9 results for author: Soltan, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.09163  [pdf, other

    cs.CL cs.AI

    GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

    Authors: Amani Namboori, Shivam Mangale, Andy Rosenbaum, Saleh Soltan

    Abstract: The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latenc… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: Accepted to The 37th International Conference on Neural Information Processing Systems (NeurIPS 2023)December 10-16, 2023 - SyntheticData4ML workshop, New Orleans, United States https://neurips.cc/Conferences/2023

  2. arXiv:2306.08756  [pdf, other

    cs.CL cs.AI cs.LG

    Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

    Authors: Saleh Soltan, Andy Rosenbaum, Tobias Falke, Qin Lu, Anna Rumshisky, Wael Hamza

    Abstract: Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on seque… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: ACL Findings 2023 and SustaiNLP Workshop 2023

  3. arXiv:2210.07074  [pdf, other

    cs.CL cs.AI cs.LG

    CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

    Authors: Andy Rosenbaum, Saleh Soltan, Wael Hamza, Amir Saffari, Marco Damonte, Isabel Groves

    Abstract: A bottleneck to develo** Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this wor… ▽ More

    Submitted 14 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted to AACL-IJCNLP 2022: The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, November 20-23, 2022. See https://www.aacl2022.org/

  4. arXiv:2209.09900  [pdf, other

    cs.CL cs.AI cs.LG

    LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging

    Authors: Andy Rosenbaum, Saleh Soltan, Wael Hamza, Yannick Versley, Markus Boese

    Abstract: We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted to The 29th International Conference on Computational Linguistics (COLING 2022) October 12-17, 2022, Gyeongju, Republic of Korea https://coling2022.org/

  5. arXiv:2208.01448  [pdf, other

    cs.CL cs.LG

    AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

    Authors: Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan

    Abstract: In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves s… ▽ More

    Submitted 3 August, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

  6. arXiv:2206.07808  [pdf, other

    cs.CL cs.AI cs.LG

    Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

    Authors: Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide Bernardi, Abhishek Bhagia, Claudio Delli Bovi, ** Cao, Rakesh Chada, Amit Chauhan, Luoxin Chen, Anurag Dwarakanath, Satyam Dwivedi, Turan Gojayev, Karthik Gopalakrishnan, Thomas Gueudre, Dilek Hakkani-Tur, Wael Hamza, Jonathan Hueser, Kevin Martin Jose, Haidar Khan, Beiye Liu, Jianhua Lu, Alessandro Manzotti, Pradeep Natarajan, Karolina Owczarzak , et al. (16 additional authors not shown)

    Abstract: We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform co… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: KDD 2022

    ACM Class: I.2.7

    Journal ref: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA

  7. arXiv:2205.12070  [pdf, other

    cs.LG cs.AI

    Deep Reinforcement Learning for Multi-class Imbalanced Training

    Authors: Jenny Yang, Rasheed El-Bouri, Odhran O'Donoghue, Alexander S. Lachapelle, Andrew A. S. Soltan, David A. Clifton

    Abstract: With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in m… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

  8. arXiv:2010.03714  [pdf, other

    cs.CL cs.AI cs.LG

    Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based Decoding

    Authors: Qile Zhu, Haidar Khan, Saleh Soltan, Stephen Rawls, Wael Hamza

    Abstract: Semantic parsing is one of the key components of natural language understanding systems. A successful parse transforms an input utterance to an action that is easily understood by the system. Many algorithms have been proposed to solve this problem, from conventional rulebased or statistical slot-filling systems to shiftreduce based neural parsers. For complex parsing tasks, the state-of-the-art m… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: Presented at CoNLL 2020

  9. arXiv:1607.06509  [pdf, other

    math.CO cs.CC cs.DM cs.DS

    Doubly Balanced Connected Graph Partitioning

    Authors: Saleh Soltan, Mihalis Yannakakis, Gil Zussman

    Abstract: We introduce and study the Doubly Balanced Connected graph Partitioning (DBCP) problem: Let $G=(V,E)$ be a connected graph with a weight (supply/demand) function $p:V\rightarrow \{-1,+1\}$ satisfying $p(V)=\sum_{j\in V} p(j)=0$. The objective is to partition $G$ into $(V_1,V_2)$ such that $G[V_1]$ and $G[V_2]$ are connected, $|p(V_1)|,|p(V_2)|\leq c_p$, and… ▽ More

    Submitted 21 July, 2016; originally announced July 2016.