Skip to main content

Showing 1–50 of 52 results for author: MacAvaney, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14589  [pdf, other

    cs.IR

    Top-Down Partitioning for Efficient List-Wise Ranking

    Authors: Andrew Parry, Sean MacAvaney, Debasis Ganguly

    Abstract: Large Language Models (LLMs) have significantly impacted many facets of natural language processing and information retrieval. Unlike previous encoder-based approaches, the enlarged context window of these generative models allows for ranking multiple documents at once, commonly called list-wise ranking. However, there are still limits to the number of documents that can be ranked in a single infe… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 16 pages, 3 figures, 2 tables

  2. arXiv:2405.01122  [pdf, other

    cs.IR

    Generative Relevance Feedback and Convergence of Adaptive Re-Ranking: University of Glasgow Terrier Team at TREC DL 2023

    Authors: Andrew Parry, Thomas Jaenich, Sean MacAvaney, Iadh Ounis

    Abstract: This paper describes our participation in the TREC 2023 Deep Learning Track. We submitted runs that apply generative relevance feedback from a large language model in both a zero-shot and pseudo-relevance feedback setting over two sparse retrieval approaches, namely BM25 and SPLADE. We couple this first stage with adaptive re-ranking over a BM25 corpus graph scored using a monoELECTRA cross-encode… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 5 pages, 5 figures, TREC Deep Learning 2023 Notebook

  3. On the Evaluation of Machine-Generated Reports

    Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

    Abstract: Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of… ▽ More

    Submitted 9 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper

  4. arXiv:2405.00469  [pdf, other

    cs.IR

    Exploiting Positional Bias for Query-Agnostic Generative Content in Search

    Authors: Andrew Parry, Sean MacAvaney, Debasis Ganguly

    Abstract: In recent years, neural ranking models (NRMs) have been shown to substantially outperform their lexical counterparts in text retrieval. In traditional search pipelines, a combination of features leads to well-defined behaviour. However, as neural approaches become increasingly prevalent as the final scoring component of engines or as standalone systems, their robustness to malicious text and, more… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 8 pages, 4 main figures, 7 appendix pages, 2 appendix figures

  5. A Reproducibility Study of PLAID

    Authors: Sean MacAvaney, Nicola Tonellotto

    Abstract: The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its thre… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: SIGIR 2024 (reproducibility track)

  6. arXiv:2404.08071  [pdf, other

    cs.IR

    Overview of the TREC 2023 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the impact of neural approaches to cross-language information retrieval. The track has created four collections, large collections of Chinese, Persian, and Russian newswire and a smaller collection of Chinese scientific abstracts. The principal tasks are ranked retrieval of news in one of the thr… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 27 pages, 17 figures. Part of the TREC 2023 Proceedings

  7. Shallow Cross-Encoders for Low-Latency Retrieval

    Authors: Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald

    Abstract: Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, kee** search latencies low is important for user satisfaction and energy usage. In this pape… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: Accepted by ECIR2024

  8. arXiv:2403.15246  [pdf, other

    cs.IR cs.CL cs.LG

    FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

    Authors: Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

    Abstract: Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w… ▽ More

    Submitted 7 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  9. arXiv:2403.07654  [pdf, other

    cs.IR

    Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

    Authors: Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

    Abstract: Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding targe… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 figures, Accepted at ECIR 2024 as a Full Paper

  10. arXiv:2403.01981  [pdf, other

    cs.IR

    Evaluating the Explainability of Neural Rankers

    Authors: Saran Pandian, Debasis Ganguly, Sean MacAvaney

    Abstract: Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved result… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  11. arXiv:2401.11198  [pdf, other

    cs.IR

    A Deep Learning Approach for Selective Relevance Feedback

    Authors: Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene

    Abstract: Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-ba… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  12. On the Effects of Regional Spelling Conventions in Retrieval Models

    Authors: Andreas Chari, Sean MacAvaney, Iadh Ounis

    Abstract: One advantage of neural ranking models is that they are meant to generalise well in situations of synonymity i.e. where two words have similar or identical meanings. In this paper, we investigate and quantify how well various ranking models perform in a clear-cut case of synonymity: when words are simply expressed in different surface forms due to regional differences in spelling conventions (e.g.… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: 10 pages, 3 tables, short paper published in SIGIR '23

    Journal ref: SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2023, Pages 2220-2224

  13. arXiv:2308.00415  [pdf, other

    cs.IR

    Generative Query Reformulation for Effective Adhoc Search

    Authors: Xiao Wang, Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: Performing automatic reformulations of a user's query is a popular paradigm used in information retrieval (IR) for improving effectiveness -- as exemplified by the pseudo-relevance feedback approaches, which expand the query in order to alleviate the vocabulary mismatch problem. Recent advancements in generative language models have demonstrated their ability in generating responses that are relev… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: Accepted to Gen-IR@SIGIR2023 Workshop

  14. Lexically-Accelerated Dense Retrieval

    Authors: Hrishikesh Kulkarni, Sean MacAvaney, Nazli Goharian, Ophir Frieder

    Abstract: Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: SIGIR 2023

  15. arXiv:2306.17082  [pdf, other

    cs.IR

    Adaptive Latent Entity Expansion for Document Retrieval

    Authors: Iain Mackie, Shubham Chatterjee, Sean MacAvaney, Jeffrey Dalton

    Abstract: Despite considerable progress in neural relevance ranking techniques, search engines still struggle to process complex queries effectively - both in terms of precision and recall. Sparse and dense Pseudo-Relevance Feedback (PRF) approaches have the potential to overcome limitations in recall, but are only effective with high precision in the top ranks. In this work, we tackle the problem of search… ▽ More

    Submitted 4 December, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

  16. arXiv:2306.09657  [pdf, other

    cs.IR cs.CL

    Online Distillation for Pseudo-Relevance Feedback

    Authors: Sean MacAvaney, Xi Wang

    Abstract: Model distillation has emerged as a prominent technique to improve neural search models. To date, distillation taken an offline approach, wherein a new neural model is trained to predict relevance scores between arbitrary queries and documents. In this paper, we explore a departure from this offline distillation strategy by investigating whether a model for a specific query can be effectively dist… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  17. The Information Retrieval Experiment Platform

    Authors: Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 11 pages. To be published in the proceedings of SIGIR 2023

  18. arXiv:2305.18494  [pdf, other

    cs.IR cs.LG

    Adapting Learned Sparse Retrieval for Long Documents

    Authors: Thong Nguyen, Sean MacAvaney, Andrew Yates

    Abstract: Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is cruc… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: SIGIR 2023

    Journal ref: SIGIR 2023

  19. arXiv:2304.12367  [pdf, other

    cs.IR

    Overview of the TREC 2022 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annot… ▽ More

    Submitted 24 September, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 22 pages, 13 figures, 10 tables. Part of the Thirty-First Text REtrieval Conference (TREC 2022) Proceedings. Replace the misplaced Russian result table

  20. arXiv:2303.13416  [pdf, other

    cs.IR

    A Unified Framework for Learned Sparse Retrieval

    Authors: Thong Nguyen, Sean MacAvaney, Andrew Yates

    Abstract: Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial diff… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Journal ref: ECIR 2023

  21. One-Shot Labeling for Automatic Relevance Estimation

    Authors: Sean MacAvaney, Luca Soldaini

    Abstract: Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether large language models can help us fill such holes to improve offline evaluat… ▽ More

    Submitted 11 July, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: SIGIR 2023

  22. arXiv:2301.03266  [pdf, other

    cs.IR

    Doc2Query--: When Less is More

    Authors: Mitko Gospodinov, Sean MacAvaney, Craig Macdonald

    Abstract: Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hal… ▽ More

    Submitted 27 February, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: ECIR 2023

  23. Adaptive Re-Ranking with a Corpus Graph

    Authors: Sean MacAvaney, Nicola Tonellotto, Craig Macdonald

    Abstract: Search systems often employ a re-ranking pipeline, wherein documents (or passages) from an initial pool of candidates are assigned new ranking scores. The process enables the use of highly-effective but expensive scoring functions that are not suitable for use directly in structures like inverted indices or approximate nearest neighbour indices. However, re-ranking pipelines are inherently limited… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: CIKM 2022

  24. arXiv:2205.04546  [pdf, other

    cs.IR

    CODEC: Complex Document and Entity Collection

    Authors: Iain Mackie, Paul Owoicho, Carlos Gemmell, Sophie Fischer, Sean MacAvaney, Jeffrey Dalton

    Abstract: CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert jud… ▽ More

    Submitted 17 May, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: 10 pages, SIGIR 2022 Preprint

    ACM Class: H.3.3

  25. On Survivorship Bias in MS MARCO

    Authors: Prashansa Gupta, Sean MacAvaney

    Abstract: Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find t… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: SIGIR 2022

  26. arXiv:2201.08622  [pdf, other

    cs.IR

    Reproducing Personalised Session Search over the AOL Query Log

    Authors: Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: Despite its troubled past, the AOL Query Log continues to be an important resource to the research community -- particularly for tasks like search personalisation. When using the query log these ranking experiments, little attention is usually paid to the document corpus. Recent work typically uses a corpus containing versions of the documents collected long after the log was produced. Given that… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

    Comments: ECIR 2022 (reproducibility)

  27. arXiv:2111.13466  [pdf, ps, other

    cs.IR

    Streamlining Evaluation with ir-measures

    Authors: Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the e… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: ECIR 2022 (demo)

  28. arXiv:2108.13810  [pdf, other

    cs.LG cs.AI stat.ML

    Max-Utility Based Arm Selection Strategy For Sequential Query Recommendations

    Authors: Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith, Sean MacAvaney, Evangelos Zervas

    Abstract: We consider the query recommendation problem in closed loop interactive learning settings like online information gathering and exploratory analytics. The problem can be naturally modelled using the Multi-Armed Bandits (MAB) framework with countably many arms. The standard MAB algorithms for countably many arms begin with selecting a random set of candidate arms and then applying standard MAB algo… ▽ More

    Submitted 31 August, 2021; originally announced August 2021.

    Report number: 2021

    Journal ref: Asian Conference on Machine Learning 2021

  29. arXiv:2108.04026  [pdf, other

    cs.IR

    IntenT5: Search Result Diversification using Causal Language Models

    Authors: Sean MacAvaney, Craig Macdonald, Roderick Murray-Smith, Iadh Ounis

    Abstract: Search result diversification is a beneficial approach to overcome under-specified queries, such as those that are ambiguous or multi-faceted. Existing approaches often rely on massive query logs and interaction data to generate a variety of possible query intents, which then can be used to re-rank documents. However, relying on user interaction data is problematic because one first needs a massiv… ▽ More

    Submitted 9 August, 2021; originally announced August 2021.

  30. arXiv:2105.01044  [pdf, other

    cs.IR cs.CL

    Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review

    Authors: Eugene Yang, Sean MacAvaney, David D. Lewis, Ophir Frieder

    Abstract: Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in… ▽ More

    Submitted 19 January, 2022; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: 6 pages, 1 figure, accepted at ECIR 2022

  31. Simplified Data Wrangling with ir_datasets

    Authors: Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, Nazli Goharian

    Abstract: Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and… ▽ More

    Submitted 10 May, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

    Comments: SIGIR 2021 Resource

  32. arXiv:2103.01328  [pdf, other

    cs.CL

    ToxCCIn: Toxic Content Classification with Interpretability

    Authors: Tong Xiang, Sean MacAvaney, Eugene Yang, Nazli Goharian

    Abstract: Despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans. Explanations are particularly important for tasks like offensive language or toxicity detection on social media because a manual appeal process is often in place to dispute automatically flagged content. In this work, we propose a technique to imp… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Long paper accepted to WASSA2021@EACL

  33. ABNIRML: Analyzing the Behavior of Neural IR Models

    Authors: Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, Arman Cohan

    Abstract: Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well-understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new… ▽ More

    Submitted 20 July, 2023; v1 submitted 1 November, 2020; originally announced November 2020.

    Comments: TACL version

  34. arXiv:2010.05987  [pdf, other

    cs.CL cs.IR

    SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search

    Authors: Sean MacAvaney, Arman Cohan, Nazli Goharian

    Abstract: With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of scientific literature on the virus. Clinicians, researchers, and policy-makers need to be able to search these articles effectively. In this work, we present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters tr… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: EMNLP 2020. This article draws heavily from arXiv:2005.02365

  35. arXiv:2008.09093  [pdf, other

    cs.IR

    PARADE: Passage Representation Aggregation for Document Reranking

    Authors: Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, Yingfei Sun

    Abstract: Pretrained transformer models, such as BERT and T5, have shown to be highly effective at ad-hoc passage and document ranking. Due to inherent sequence length limits of these models, they need to be run over a document's passages, rather than processing the entire document sequence at once. Although several approaches for aggregating passage-level signals have been proposed, there has yet to be an… ▽ More

    Submitted 10 June, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

  36. arXiv:2007.14477  [pdf, ps, other

    cs.CL

    GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection

    Authors: Sajad Sotudeh, Tong Xiang, Hao-Ren Yao, Sean MacAvaney, Eugene Yang, Nazli Goharian, Ophir Frieder

    Abstract: Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our expe… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: SemEval 2020

  37. arXiv:2005.08805  [pdf, ps, other

    cs.CL

    Interaction Matching for Long-Tail Multi-Label Classification

    Authors: Sean MacAvaney, Franck Dernoncourt, Walter Chang, Nazli Goharian, Ophir Frieder

    Abstract: We present an elegant and effective approach for addressing limitations in existing multi-label classification models by incorporating interaction matching, a concept shown to be useful for ad-hoc search result ranking. By performing soft n-gram interaction matching, we match labels with natural language descriptions (which are common to have in most multi-labeling tasks). Our approach can be used… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  38. arXiv:2005.02365  [pdf, other

    cs.IR cs.CL

    SLEDGE: A Simple Yet Effective Baseline for COVID-19 Scientific Knowledge Search

    Authors: Sean MacAvaney, Arman Cohan, Nazli Goharian

    Abstract: With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of literature on the virus. Clinicians, researchers, and policy-makers need a way to effectively search these articles. In this work, we present a search system called SLEDGE, which utilizes SciBERT to effectively re-rank articles. We train the model on a general-do… ▽ More

    Submitted 3 August, 2020; v1 submitted 5 May, 2020; originally announced May 2020.

  39. Training Curricula for Open Domain Answer Re-Ranking

    Authors: Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, Ophir Frieder

    Abstract: In precision-oriented tasks like answer ranking, it is more important to rank many relevant answers highly than to retrieve all relevant answers. It follows that a good ranking strategy would be to learn how to identify the easiest correct answers first (i.e., assign a high ranking score to answers that have characteristics that usually indicate relevance, and a low ranking score to those with cha… ▽ More

    Submitted 21 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at SIGIR 2020 (long)

  40. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations

    Authors: Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, Ophir Frieder

    Abstract: Deep pretrained transformer networks are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their computational expenses deem them cost-prohibitive in practice. Our proposed approach, called PreTTR (Precomputing Transformer Term Representations), considerably reduces the query-time latency of deep transformer networks (up to a 42x speedup on web do… ▽ More

    Submitted 26 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at SIGIR 2020 (long)

  41. Expansion via Prediction of Importance with Contextualization

    Authors: Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, Ophir Frieder

    Abstract: The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon,… ▽ More

    Submitted 20 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at SIGIR 2020 (short)

  42. Ranking Significant Discrepancies in Clinical Reports

    Authors: Sean MacAvaney, Arman Cohan, Nazli Goharian, Ross Filice

    Abstract: Medical errors are a major public health concern and a leading cause of death worldwide. Many healthcare centers and hospitals use reporting systems where medical practitioners write a preliminary medical report and the report is later reviewed, revised, and finalized by a more experienced physician. The revisions range from stylistic to corrections of critical errors or misinterpretations of the… ▽ More

    Submitted 18 January, 2020; originally announced January 2020.

    Comments: ECIR 2020 (short)

  43. Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

    Authors: Sean MacAvaney, Luca Soldaini, Nazli Goharian

    Abstract: While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on En… ▽ More

    Submitted 30 December, 2019; originally announced December 2019.

    Comments: ECIR 2020 (short)

  44. arXiv:1905.05818  [pdf, other

    cs.CL cs.IR

    Ontology-Aware Clinical Abstractive Summarization

    Authors: Sean MacAvaney, Sajad Sotudeh, Arman Cohan, Nazli Goharian, Ish Talati, Ross W. Filice

    Abstract: Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose a sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection and summary generation. We apply our method to a dataset of radiology reports and show that it significantly… ▽ More

    Submitted 14 May, 2019; originally announced May 2019.

    Comments: 4 pages; SIGIR 2019 Short Paper

  45. CEDR: Contextualized Embeddings for Document Ranking

    Authors: Sean MacAvaney, Andrew Yates, Arman Cohan, Nazli Goharian

    Abstract: Although considerable attention has been given to neural ranking architectures recently, far less attention has been paid to the term representations that are used as input to these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, we find that several existing… ▽ More

    Submitted 19 August, 2019; v1 submitted 15 April, 2019; originally announced April 2019.

    Comments: Appeared in SIGIR 2019, 4 pages

  46. Overcoming low-utility facets for complex answer retrieval

    Authors: Sean MacAvaney, Andrew Yates, Arman Cohan, Luca Soldaini, Kai Hui, Nazli Goharian, Ophir Frieder

    Abstract: Many questions cannot be answered simply; their answers must include numerous nuanced details and additional context. Complex Answer Retrieval (CAR) is the retrieval of answers to such questions. In their simplest form, these questions are constructed from a topic entity (e.g., `cheese') and a facet (e.g., `health effects'). While topic matching has been thoroughly explored, we observe that some f… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: This is a pre-print of an article published in Information Retrieval Journal. The final authenticated version (including additional experimental results, analysis, etc.) is available online at: https://doi.org/10.1007/s10791-018-9343-0

    Journal ref: Information Retrieval Journal 2018

  47. arXiv:1806.07916  [pdf, other

    cs.CL

    RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses

    Authors: Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, Nazli Goharian

    Abstract: Self-reported diagnosis statements have been widely employed in studying language related to mental health in social media. However, existing research has largely ignored the temporality of mental health diagnoses. In this work, we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis.… ▽ More

    Submitted 20 June, 2018; originally announced June 2018.

    Comments: 6 pages, accepted for publication at the CLPsych workshop at NAACL-HLT 2018

  48. arXiv:1806.05258  [pdf, other

    cs.CL

    SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions

    Authors: Arman Cohan, Bart Desmet, Andrew Yates, Luca Soldaini, Sean MacAvaney, Nazli Goharian

    Abstract: Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported di… ▽ More

    Submitted 10 July, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

    Comments: COLING 2018

  49. arXiv:1805.00791  [pdf, other

    cs.IR

    Characterizing Question Facets for Complex Answer Retrieval

    Authors: Sean MacAvaney, Andrew Yates, Arman Cohan, Luca Soldaini, Kai Hui, Nazli Goharian, Ophir Frieder

    Abstract: Complex answer retrieval (CAR) is the process of retrieving answers to questions that have multifaceted or nuanced answers. In this work, we present two novel approaches for CAR based on the observation that question facets can vary in utility: from structural (facets that can apply to many similar topics, such as 'History') to topical (facets that are specific to the question's topic, such as the… ▽ More

    Submitted 2 May, 2018; originally announced May 2018.

    Comments: 4 pages; SIGIR 2018 Short Paper

  50. arXiv:1804.05972  [pdf, ps, other

    cs.CL

    A Deeper Look into Dependency-Based Word Embeddings

    Authors: Sean MacAvaney, Amir Zeldes

    Abstract: We investigate the effect of various dependency-based word embeddings on distinguishing between functional and domain similarity, word similarity rankings, and two downstream tasks in English. Variations include word embeddings trained using context windows from Stanford and Universal dependencies at several levels of enhancement (ranging from unlabeled, to Enhanced++ dependencies). Results are co… ▽ More

    Submitted 16 April, 2018; originally announced April 2018.

    Comments: 6 pages; to appear at NAACL-SRW 2018