Search | arXiv e-print repository

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Authors: Avirup Sil, Jaydeep Sen, Bhavani Iyer, Martin Franz, Kshitij Fadnis, Mihaela Bornea, Sara Rosenthal, Scott McCarley, Rong Zhang, Vishwajeet Kumar, Yulong Li, Md Arafat Sultan, Riyaz Bhat, Radu Florian, Salim Roukos

Abstract: The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate… ▽ More The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa. △ Less

Submitted 25 January, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

arXiv:2204.09248 [pdf, ps, other]

Synthetic Target Domain Supervision for Open Retrieval QA

Authors: Revanth Gangi Reddy, Bhavani Iyer, Md Arafat Sultan, Rong Zhang, Avirup Sil, Vittorio Castelli, Radu Florian, Salim Roukos

Abstract: Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domai… ▽ More Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR) -- a state-of-the-art (SOTA) open domain neural retrieval model -- on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domain shift, we explore its fine-tuning with synthetic training examples, which we generate from unlabeled target domain text using a text-to-text generator. In our experiments, this noisy but fully automated target domain supervision gives DPR a sizable advantage over BM25 in out-of-domain settings, making it a more viable model in practice. Finally, an ensemble of BM25 and our improved DPR model yields the best results, further pushing the SOTA for open retrieval QA on multiple out-of-domain test sets. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Comments: Published at SIGIR 2021

arXiv:2112.08185 [pdf, other]

Learning Cross-Lingual IR from an English Retriever

Authors: Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, Avirup Sil

Abstract: We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, execu… ▽ More We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing. △ Less

Submitted 31 July, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Presented at NAACL 2022 main conference Code can be found at: https://github.com/primeqa/primeqa

arXiv:2012.01414 [pdf, other]

End-to-End QA on COVID-19: Domain Adaptation with Synthetic Training

Authors: Revanth Gangi Reddy, Bhavani Iyer, Md Arafat Sultan, Rong Zhang, Avi Sil, Vittorio Castelli, Radu Florian, Salim Roukos

Abstract: End-to-end question answering (QA) requires both information retrieval (IR) over a large document collection and machine reading comprehension (MRC) on the retrieved passages. Recent work has successfully trained neural IR systems using only supervised question answering (QA) examples from open-domain datasets. However, despite impressive performance on Wikipedia, neural IR lags behind traditional… ▽ More End-to-end question answering (QA) requires both information retrieval (IR) over a large document collection and machine reading comprehension (MRC) on the retrieved passages. Recent work has successfully trained neural IR systems using only supervised question answering (QA) examples from open-domain datasets. However, despite impressive performance on Wikipedia, neural IR lags behind traditional term matching approaches such as BM25 in more specific and specialized target domains such as COVID-19. Furthermore, given little or no labeled data, effective adaptation of QA systems can also be challenging in such target domains. In this work, we explore the application of synthetically generated QA examples to improve performance on closed-domain retrieval and MRC. We combine our neural IR and MRC systems and show significant improvements in end-to-end QA on the CORD-19 collection over a state-of-the-art open-domain QA baseline. △ Less

Submitted 2 December, 2020; originally announced December 2020.

Comments: Preprint

arXiv:2007.12780 [pdf, other]

A Canonical Architecture For Predictive Analytics on Longitudinal Patient Records

Authors: Parthasarathy Suryanarayanan, Bhavani Iyer, Prithwish Chakraborty, Bibo Hao, Italo Buleje, Piyush Madan, James Codella, Antonio Foncubierta, Divya Pathak, Sarah Miller, Amol Rajmane, Shannon Harrer, Gigi Yuan-Reed, Daby Sow

Abstract: Many institutions within the healthcare ecosystem are making significant investments in AI technologies to optimize their business operations at lower cost with improved patient outcomes. Despite the hype with AI, the full realization of this potential is seriously hindered by several systemic problems, including data privacy, security, bias, fairness, and explainability. In this paper, we propose… ▽ More Many institutions within the healthcare ecosystem are making significant investments in AI technologies to optimize their business operations at lower cost with improved patient outcomes. Despite the hype with AI, the full realization of this potential is seriously hindered by several systemic problems, including data privacy, security, bias, fairness, and explainability. In this paper, we propose a novel canonical architecture for the development of AI models in healthcare that addresses these challenges. This system enables the creation and management of AI predictive models throughout all the phases of their life cycle, including data ingestion, model building, and model promotion in production environments. This paper describes this architecture in detail, along with a qualitative evaluation of our experience of using it on real world problems. △ Less

Submitted 5 January, 2021; v1 submitted 24 July, 2020; originally announced July 2020.

Comments: Presented at DSHealth 2020 KDD Workshop on Applied Data Science for Healthcare

arXiv:1205.1853 [pdf]

Goal Directed Relative Skyline Queries in Time Dependent Road Networks

Authors: K. B. Priya Iyer, V. Shanthi

Abstract: The Wireless GIS technology is progressing rapidly in the area of mobile communications. Location-based spatial queries are becoming an integral part of many new mobile applications. The Skyline queries are latest apps under Location-based services. In this paper we introduce Goal Directed Relative Skyline queries on Time dependent (GD-RST) road networks. The algorithm uses travel time as a metric… ▽ More The Wireless GIS technology is progressing rapidly in the area of mobile communications. Location-based spatial queries are becoming an integral part of many new mobile applications. The Skyline queries are latest apps under Location-based services. In this paper we introduce Goal Directed Relative Skyline queries on Time dependent (GD-RST) road networks. The algorithm uses travel time as a metric in finding the data object by considering multiple query points (multi-source skyline) relative to user location and in the user direction of travelling. We design an efficient algorithm based on Filter phase, Heap phase and Refine Skyline phases. At the end, we propose a dynamic skyline caching (DSC) mechanism which helps to reduce the computation cost for future skyline queries. The experimental evaluation reflects the performance of GD-RST algorithm over the traditional branch and bound algorithm for skyline queries in real road networks. △ Less

Submitted 8 May, 2012; originally announced May 2012.

Showing 1–6 of 6 results for author: Iyer, B