Skip to main content

Showing 1–30 of 30 results for author: Shafran, I

.
  1. arXiv:2405.13640  [pdf, other

    cs.CL cs.AI cs.LG

    Knowledge Graph Reasoning with Self-supervised Reinforcement Learning

    Authors: Ying Ma, Owen Burns, Mingqiu Wang, Gang Li, Nan Du, Laurent El Shafey, Liqiang Wang, Izhak Shafran, Hagen Soltau

    Abstract: Reinforcement learning (RL) is an effective method of finding reasoning pathways in incomplete knowledge graphs (KGs). To overcome the challenges of a large action space, a self-supervised pre-training method is proposed to warm up the policy network before the RL training stage. To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) st… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 17 pages, 11 figures

  2. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2402.11450  [pdf, other

    cs.RO

    Learning to Learn Faster from Human Feedback with Language Model Predictive Control

    Authors: Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore , et al. (25 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for o… ▽ More

    Submitted 31 May, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

  4. arXiv:2402.01828  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Retrieval Augmented End-to-End Spoken Dialog Models

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain dom… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Journal ref: Proc. ICASSP 2024

  5. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  6. arXiv:2311.00899  [pdf, other

    cs.RO

    RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

    Authors: Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao

    Abstract: We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiment… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  7. arXiv:2310.13010  [pdf, other

    eess.AS cs.AI

    Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that Leverages a Universal Speech Model

    Authors: Hagen Soltau, Izhak Shafran, Alex Ottenwess, Joseph R. JR Duffy, Rene L. Utianski, Leland R. Barnard, John L. Stricker, Daniela Wiepert, David T. Jones, Hugo Botha

    Abstract: We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings. Our model compresses long sequences into a small set of class-specific latent representations and a factorized projection is use… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Journal ref: Proc. ASRU, 2023

  8. arXiv:2310.00230  [pdf, other

    cs.CL cs.SD eess.AS

    SLM: Bridge the thin gap between speech and text foundation models

    Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

    Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  9. arXiv:2306.08131  [pdf, other

    eess.AS cs.SD

    Efficient Adapters for Giant Speech Models

    Authors: Nanxin Chen, Izhak Shafran, Yu Zhang, Chung-Cheng Chiu, Hagen Soltau, James Qin, Yonghui Wu

    Abstract: Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach c… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  10. arXiv:2306.07944  [pdf, other

    eess.AS cs.AI cs.CL

    Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, w… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  11. arXiv:2305.10601  [pdf, other

    cs.CL cs.AI cs.LG

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Authors: Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan

    Abstract: Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for… ▽ More

    Submitted 3 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023 camera ready version. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm

  12. arXiv:2302.12441  [pdf, other

    cs.LG cs.CL

    MUX-PLMs: Data Multiplexing for High-throughput Language Models

    Authors: Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan

    Abstract: The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms… ▽ More

    Submitted 22 May, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

  13. arXiv:2212.09939  [pdf, other

    cs.CL

    AnyTOD: A Programmable Task-Oriented Dialog System

    Authors: Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi, Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu

    Abstract: We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A ne… ▽ More

    Submitted 13 February, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: v2, update with Multiwoz, SGD results

  14. arXiv:2212.08704  [pdf, other

    cs.AI

    Speech Aware Dialog System Technology Challenge (DSTC11)

    Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Abhinav Rastogi, Jeffrey Zhao, Ye Jia, Wei Han, Yuan Cao, Aramys Miranda

    Abstract: Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is st… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

  15. arXiv:2210.06656  [pdf, other

    cs.CL

    Knowledge-grounded Dialog State Tracking

    Authors: Dian Yu, Mingqiu Wang, Yuan Cao, Izhak Shafran, Laurent El Shafey, Hagen Soltau

    Abstract: Knowledge (including structured knowledge such as schema and ontology, and unstructured knowledge such as web corpus) is a critical part of dialog understanding, especially for unseen tasks and domains. Traditionally, such domain-specific knowledge is encoded implicitly into model parameters for the execution of downstream tasks, which makes training inefficient. In addition, such models are not e… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 Findings

  16. arXiv:2210.03629  [pdf, other

    cs.CL cs.AI cs.LG

    ReAct: Synergizing Reasoning and Acting in Language Models

    Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao

    Abstract: While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific acti… ▽ More

    Submitted 9 March, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: v3 is the ICLR camera ready version with some typos fixed. Project site with code: https://react-lm.github.io

  17. arXiv:2205.04515  [pdf, other

    cs.CL

    Unsupervised Slot Schema Induction for Task-oriented Dialog

    Authors: Dian Yu, Mingqiu Wang, Yuan Cao, Izhak Shafran, Laurent El Shafey, Hagen Soltau

    Abstract: Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To alleviate this expensive and time consuming process, we propose an unsupervised approach for slot sch… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: NAACL 2022

  18. arXiv:2203.03543  [pdf, other

    cs.CL cs.AI

    RNN Transducers for Nested Named Entity Recognition with constraints on alignment for long sequences

    Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey

    Abstract: Popular solutions to Named Entity Recognition (NER) include conditional random fields, sequence-to-sequence models, or utilizing the question-answering framework. However, they are not suitable for nested and overlap** spans with large ontologies and for predicting the position of the entities. To fill this gap, we introduce a new model for NER task -- an RNN transducer (RNN-T). These models are… ▽ More

    Submitted 8 February, 2022; originally announced March 2022.

  19. arXiv:2201.08904  [pdf, other

    cs.CL cs.AI

    Description-Driven Task-Oriented Dialog Modeling

    Authors: Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Mingqiu Wang, Harrison Lee, Abhinav Rastogi, Izhak Shafran, Yonghui Wu

    Abstract: Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not con… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

  20. arXiv:2110.15222  [pdf, other

    cs.CL cs.SD eess.AS

    Word-level confidence estimation for RNN transducers

    Authors: Mingqiu Wang, Hagen Soltau, Laurent El Shafey, Izhak Shafran

    Abstract: Confidence estimate is an often requested feature in applications such as medical transcription where errors can impact patient care and the confidence estimate could be used to alert medical professionals to verify potential errors in recognition. In this paper, we present a lightweight neural confidence model tailored for Automatic Speech Recognition (ASR) system with Recurrent Neural Network… ▽ More

    Submitted 28 September, 2021; originally announced October 2021.

    Journal ref: Proc. ASRU 2021

  21. arXiv:2105.04645  [pdf, other

    cs.CL

    R2D2: Relational Text Decoding with Transformers

    Authors: Aryan Arbabi, Mingqiu Wang, Laurent El Shafey, Nan Du, Izhak Shafran

    Abstract: We propose a novel framework for modeling the interaction between graphical structures and the natural language text associated with their nodes and edges. Existing approaches typically fall into two categories. On group ignores the relational structure by converting them into linear sequences and then utilize the highly successful Seq2Seq models. The other side ignores the sequential nature of th… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

  22. arXiv:2104.02219  [pdf, other

    cs.LG

    Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction

    Authors: Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey

    Abstract: In this paper, we describe novel components for extracting clinically relevant information from medical conversations which will be available as Google APIs. We describe a transformer-based Recurrent Neural Network Transducer (RNN-T) model tailored for long-form audio, which can produce rich transcriptions including speaker segmentation, speaker role labeling, punctuation and capitalization. On a… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

  23. arXiv:2009.01265  [pdf, ps, other

    cs.CR

    Google COVID-19 Search Trends Symptoms Dataset: Anonymization Process Description (version 1.0)

    Authors: Shailesh Bavadekar, Andrew Dai, John Davis, Damien Desfontaines, Ilya Eckstein, Katie Everett, Alex Fabrikant, Gerardo Flores, Evgeniy Gabrilovich, Krishna Gadepalli, Shane Glass, Rayman Huang, Chaitanya Kamath, Dennis Kraft, Akim Kumok, Hinali Marfatia, Yael Mayer, Benjamin Miller, Adam Pearce, Irippuge Milinda Perera, Venky Ramachandran, Karthik Raman, Thomas Roessler, Izhak Shafran, Tomer Shekel , et al. (5 additional authors not shown)

    Abstract: This report describes the aggregation and anonymization process applied to the initial version of COVID-19 Search Trends symptoms dataset (published at https://goo.gle/covid19symptomdataset on September 2, 2020), a publicly available dataset that shows aggregated, anonymized trends in Google searches for symptoms (and some related topics). The anonymization process is designed to protect the daily… ▽ More

    Submitted 2 September, 2020; originally announced September 2020.

  24. arXiv:2003.11531  [pdf, other

    cs.CL

    The Medical Scribe: Corpus Development and Model Performance Analyses

    Authors: Izhak Shafran, Nan Du, Linh Tran, Amanda Perry, Lauren Keyes, Mark Knichel, Ashley Domin, Lei Huang, Yuhui Chen, Gang Li, Mingqiu Wang, Laurent El Shafey, Hagen Soltau, Justin S. Paul

    Abstract: There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-… ▽ More

    Submitted 11 March, 2020; originally announced March 2020.

    Comments: Extended version of the paper accepted at LREC 2020

    Journal ref: Proceedings of Language Resources and Evaluation, 2020

  25. arXiv:1908.11536  [pdf, other

    cs.CL

    Learning to Infer Entities, Properties and their Relations from Clinical Conversations

    Authors: Nan Du, Mingqiu Wang, Linh Tran, Gang Li, Izhak Shafran

    Abstract: Recently we proposed the Span Attribute Tagging (SAT) Model (Du et al., 2019) to infer clinical entities (e.g., symptoms) and their properties (e.g., duration). It tackles the challenge of large label space and limited training data using a hierarchical two-stage approach that identifies the span of interest in a tagging step and assigns labels to the span in a classification step. We extend the… ▽ More

    Submitted 30 August, 2019; originally announced August 2019.

    Journal ref: Proc. Empirical Methods in Natural Language Processing, 2019

  26. arXiv:1907.05337  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Speech Recognition and Speaker Diarization via Sequence Transduction

    Authors: Laurent El Shafey, Hagen Soltau, Izhak Shafran

    Abstract: Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. The two systems are trained independently with different objective… ▽ More

    Submitted 8 July, 2019; originally announced July 2019.

    Journal ref: Proc. Interspeech 2019

  27. arXiv:1906.02246  [pdf, other

    cs.LG cs.CL cs.SD eess.AS eess.SP

    Complex Evolution Recurrent Neural Networks (ceRNNs)

    Authors: Izhak Shafran, Tom Bagby, R. J. Skerry-Ryan

    Abstract: Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractive properties: (a) the unitary property, (b) the complex-valued nature, and (c) their efficient linear operators. The literature so far does not address -- how critical is the unitary property of the model? Furthermore, uRNNs have not been evaluated on large tasks. To study these shortcomings, we propose the complex evolution R… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Journal ref: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5854-5858, 2018

  28. arXiv:1906.02239  [pdf, other

    cs.CL cs.LG

    Extracting Symptoms and their Status from Clinical Conversations

    Authors: Nan Du, Kai Chen, Anjuli Kannan, Linh Tran, Yuhui Chen, Izhak Shafran

    Abstract: This paper describes novel models tailored for a new application, that of extracting the symptoms mentioned in clinical conversations along with their status. Lack of any publicly available corpus in this privacy-sensitive domain led us to develop our own corpus, consisting of about 3K conversations annotated by professional medical scribes. We propose two novel deep learning approaches to infer t… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Journal ref: Proceedings of the Annual Meeting of the Association of Computational Linguistics, 2019

  29. arXiv:1903.07037  [pdf, other

    cs.CL

    Audio De-identification: A New Entity Recognition Task

    Authors: Ido Cohn, Itay Laish, Genady Beryozkin, Gang Li, Izhak Shafran, Idan Szpektor, Tzvika Hartman, Avinatan Hassidim, Yossi Matias

    Abstract: Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written… ▽ More

    Submitted 5 May, 2019; v1 submitted 17 March, 2019; originally announced March 2019.

    Comments: Accepted to NAACL 2019 Industry Track

  30. arXiv:1611.06009  [pdf, other

    cs.CV

    Fuzzy Statistical Matrices for Cell Classification

    Authors: Guillaume Thibault, Izhak Shafran

    Abstract: In this paper, we generalize image (texture) statistical descriptors and propose algorithms that improve their efficacy. Recently, a new method showed how the popular Co-Occurrence Matrix (COM) can be modified into a fuzzy version (FCOM) which is more effective and robust to noise. Here, we introduce new fuzzy versions of two additional higher order statistical matrices: the Run Length Matrix (RLM… ▽ More

    Submitted 18 November, 2016; originally announced November 2016.

    Comments: 21 pages, 7 figures, 5 tables