HC3: A Suite of Test Collections for CLIR Evaluation over Informal Text

Dawn Lawrie HLTCOE, Johns Hopkins UniversityBaltimore, MDUSA [email protected] , James Mayfield HLTCOE, Johns Hopkins UniversityBaltimore, MDUSA [email protected] , Douglas W. Oard University of MarylandCollege Park, MDUSA [email protected] , Eugene Yang HLTCOE, Johns Hopkins UniversityBaltimore, MDUSA [email protected] , Suraj Nair University of MarylandCollege Park, MDUSA [email protected] and Petra Galuščáková Univ. Grenoble Alpes,
CNRS, Grenoble INP*, LIGGrenobleFrance

(2023)

Abstract.

While there are many test collections for Cross-Language Information Retrieval (CLIR), none of the large public test collections focus on short informal text documents. This paper introduces a new pair of CLIR test collections with millions of Chinese or Persian Tweets or Tweet threads as documents, sixty event-motivated topics written both in English and in each of the two document languages, and three-point graded relevance judgments constructed using interactive search and active learning. The design and construction of these new test collections are described, and baseline results are presented that demonstrate the utility of the collections for system evaluation. Shallow pooling is used to assess the efficacy of active learning to select documents for judgment.

Test Collection, Cross-Language Information Retrieval, CLIR, Evaluation, Tweet-based documents

*Institute of Engineering Univ. Grenoble Alpes

^†^†journalyear: 2023^†^†copyright: rightsretained^†^†conference: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 23–27, 2023; Taipei, Taiwan^†^†booktitle: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan^†^†doi: 10.1145/3539618.3591893^†^†isbn: 978-1-4503-9408-6/23/07^†^†ccs: Information systems Information retrieval^†^†ccs: Information systems Evaluation of retrieval results^†^†ccs: Information systems Test collections^†^†ccs: Information systems Multilingual and cross-lingual retrieval^†^†ccs: Information systems Relevance assessment

1. Introduction

Ranked retrieval using pretrained language models (PLMs) has shown great promise on research test collections (dloverview2022). Two main architectures have emerged, cross-encoders and bi-encoders (lin2022pretrained). Cross-encoders are generally used as rerankers because they must process the query and the document passages together. Often a lexical matcher like BM25 (manning:Intro2IR) is used to retrieve the documents to be reranked, meaning that while a cross-encoder can match queries and documents semantically, it will never be presented with documents that lack an exact query string match.¹¹1Of course, the query for a lexical matcher can result from automatic query ,rewriting and thus need not be the same as the query presented to the cross-encoder.

A bi-encoder, by contrast, encodes document passages separately from the query, enabling encoding in an offline indexing phase using GPUs. At query time, only the query needs to be encoded, which, given its short length, can be done quickly using a CPU. A bi-encoder’s dense representations can rank documents that are semantically similar to the query, even without exact string matches. Moreover, if the bi-encoder’s token representations are built from a multilingual Pretrained Language Model, documents in languages other than that of the query can also be ranked. This architecture could enable multilingual retrieval on the web. Bi-encoders offer the promise of flexible and effective first-stage ranking.

Because of encoder limitations, bi-encoders normally break documents into passages; a useful heuristic is to use the highest passage score as the document score (dai2019deeper). A bi-encoder can encode a passage using one or many vectors. The single vector approach, for which the present state of the art is Contriever (izacard2021unsupervised), is efficient and effective in monolingual tasks. A query is also represented as a single vector. Passages are ranked by comparing the query vector to the passage vector. But in multilingual tasks such as Cross-Language Information Retrieval (CLIR), it is outperformed by ColBERT (colbert) (our focus in this paper), where each token is represented by a dense vector. At search time each query token is represented as a vector and passages are ranked based on the passage tokens that are closest to each query token. ColBERT is currently the state of the art for full-collection (i.e., end-to-end) CLIR (neucliroverview2022) and Multilingual Information Retrieval (MLIR) (lawrie2023neural). PLAID (plaid), a space-efficient implementation of ColBERT, is thus an obvious architecture to consider for a high-volume multilingual document stream.

PLAID was designed for batch settings, because it needs access to all (or nearly all) the documents at the start of indexing. That is impractical in streaming settings, where documents are introduced over time. This paper proposes FLOOBY (PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text). The key blocker to incremental indexing for streaming in the PLAID architecture is its reliance on cluster centroids for term representation. As the vocabulary in the new documents moves away from that in the documents from which the cluster centroids were built, performance degrades precipitously. This paper proposes hierarchical sharding to adapt PLAID to a streaming setting. Its main contributions include the architecture (FLOOBY),²²2Code available at https://github.com/hltcoe/colbert-x. evaluation that demonstrates its effectiveness in both monolingual and multilingual settings, and the first known application of a ColBERT variant to a terabyte collection (ClueWeb09 (callan2009clueweb09)).