Language Fairness in Multilingual Information Retrieval

Eugene Yang Johns Hopkins UniversityBaltimoreMarylandUSA [email protected] Thomas Jänich University of GlasgowGlasgowUnited Kingdom [email protected] James Mayfield Johns Hopkins UniversityBaltimoreMarylandUSA [email protected]  and  Dawn Lawrie Johns Hopkins UniversityBaltimoreMarylandUSA [email protected]
(2024)
Abstract.

Multilingual information retrieval (MLIR) considers the problem of ranking documents in several languages for a query expressed in a language that may differ from any of those languages. Recent work has observed that approaches such as combining ranked lists representing a single document language each or using multilingual pretrained language models demonstrate a preference for one language over others. This results in systematic unfair treatment of documents in different languages. This work proposes a language fairness metric to evaluate whether documents across different languages are fairly ranked through statistical equivalence testing using the Kruskal-Wallis test. In contrast to most prior work in group fairness, we do not consider any language to be an unprotected group. Thus our proposed measure, PEER (Probability of Equal Expected Rank), is the first fairness metric specifically designed to capture the language fairness of MLIR systems. We demonstrate the behavior of PEER on artificial ranked lists. We also evaluate real MLIR systems on two publicly available benchmarks and show that the PEER scores align with prior analytical findings on MLIR fairness. Our implementation is compatible with ir-measures and is available at http://github.com/hltcoe/peer_measure.

language fairness, multilingual retrieval, statistical testing
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 14–18, 2024; Washington, DC, USAbooktitle: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USAdoi: 10.1145/3626772.3657943isbn: 979-8-4007-0431-4/24/07ccs: Information systems Multilingual and cross-lingual retrievalccs: Information systems Evaluation of retrieval results

1. Introduction

Multilingual information retrieval searches a multilingual document collection and creates a unified ranked list for a given query (Braschler, 2001, 2002, 2003; Lawrie et al., 2023b; Rahimi et al., 2015; Huang et al., 2023). In tasks like navigational search (Broder, 2002), known item retrieval (Lee et al., 2006; Arguello et al., 2021), and retrieval for question-answering (Chen et al., 2017; Kwon et al., 2018), the user only needs a handful or just one relevant document to satisfy the information need, and the language of that document does not matter. In contrast, users interested in gaining a broad problem understanding prefer seeing how coverage varies across languages.

Analysis of retrieval results has shown that MLIR systems often show a preference for certain languages (Lawrie et al., 2023b; Huang et al., 2023); we call this the MLIR Fairness problem. Such preference can bias a user’s understanding of the topic (White, 2013; Schulz-Hardt et al., 2000). This problem is particularly apparent in models built on top of multilingual pretrained language models (mPLM) (Lawrie et al., 2023b), which inherit bias from the text used to build them (Choudhury and Deshpande, 2021; Kassner et al., 2021). This paper presents a new metric to allow quantitative study of MLIR Fairness.

Prior work in fairness evaluation focuses on either individual or group fairness (Zehlike et al., 2021). Individual fairness ensures that similar documents receive similar treatment; this often corresponds to a Lipschitz condition (Biega et al., 2018; Dwork et al., 2012). Group fairness ensures that a protected group receives treatment at least as favorable as unprotected groups (Zehlike et al., 2017, 2022; Sapiezynski et al., 2019). Group fairness metrics designed for protecting specific groups are not directly applicable to the MLIR fairness problem because the latter has no protected language; we want all languages to be treated equally in a ranked list.

To operationalize our notion of MLIR fairness, we propose the Probability of Equal Expected Rank (PEER) metric. By adopting the Kruskal-Wallis H𝐻Hitalic_H test, which is a rank-based, non-parametric variance analysis for multiple groups, we measure the probability that documents of a given relevance level for a query are expected to rank at the same position irrespective of language. We compare PEER to previous fairness metrics, and show its effectiveness on synthetic patterned data, on synthetic assignment of language to real retrieval ranked lists, and on system output for the CLEF 2003 and NeuCLIR 2022 MLIR benchmarks.

2. Related Work

There is no universally accepted definition of fairness. This paper views languages as groups within a ranking, and characterizes MLIR Fairness as a group fairness problem.

Existing group fairness metrics fall into two categories: those that assess fairness independent of relevance, and those that take relevance into account. Ranked group fairness, based on statistical parity proposed by Zehlike et al. (2017, 2022), demands equitable representation of protected groups in ranking without explicitly considering relevance through statistical testing.

Attention Weighted Ranked Fairness (AWRF), introduced by Sapiezynski et al. (2019), compares group exposure at certain rank cutoffs against a pre-defined target distribution. It uses the same distribution for both relevant and nonrelevant documents. This means for example that if utility is defined as finding the most relevant documents, a system can gain utility by including more documents from the language with the most relevant documents early in the rankings. In doing so, more nonrelevant documents from that language are placed above relevant documents from the other languages. From a fairness perspective, this should be penalized as unfair. Our proposed metric does not rely on a target distribution, so it does not suffer from this utility/fairness tradeoff.

Among metrics that incorporate relevance, Singh and Joachims (2018) introduced the Disparate Treatment Ratio, which measures the equality of exposure of two groups. This metric is not well suited to MLIR though, since it handles only two groups. Adjacent to fairness, Clarke et al. (2008) extended Normalized Discounted Cumulative Gain (nDCG) to incorporate diversity. Their metric, α𝛼\alphaitalic_α-nDCG, assigns document weights based on both relevance and diversity. Diversity though applies to user utility where fairness applies to documents (in our case, the languages of the returned documents) (Castillo, 2019). We nonetheless report α𝛼\alphaitalic_α-nDCG to contextualize our results.

Related work on fairness over sequences of rankings (Morik et al., 2020; Diaz et al., 2020) requires both more evidence and distributional assumptions compared to fairness of a specific ranking. While similar, our method assumes the position of each document is a random variable.

3. Probability of Equal Expected Rank

In this section, we describe the proposed measure – PEER: Probability of Equal Expected Rank. We first introduce our notation and the fairness principle, followed by forming the statistical hypothesis of the system’s fairness across document languages. Finally, we define PEER as the p𝑝pitalic_p-value of the statistical test.

Let di𝒟subscript𝑑𝑖𝒟d_{i}\in\mathcal{D}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D be the i𝑖iitalic_i-th document in the collection 𝒟𝒟\mathcal{D}caligraphic_D of size N𝑁Nitalic_N. We define the language that disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is written in as ldi{1,M}subscript𝑙subscript𝑑𝑖subscript1subscript𝑀l_{d_{i}}\in\{\mathcal{L}_{1},...\mathcal{L}_{M}\}italic_l start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ { caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … caligraphic_L start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }. For convenience, we define the set Lj={di|ldi=j}subscript𝐿𝑗subscript𝑑𝑖subscript𝑙subscript𝑑𝑖subscript𝑗L_{j}=\Set{d_{i}}{l_{d_{i}}=\mathcal{L}_{j}}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | start_ARG italic_l start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG } to be all documents in language jsubscript𝑗\mathcal{L}_{j}caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

For a given query q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, we define the degree of document disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being relevant (or the relevance grade) to the query q𝑞qitalic_q as yiq{(0),(1),,(K)}subscriptsuperscript𝑦𝑞𝑖superscript0superscript1superscript𝐾y^{q}_{i}\in\{\mathcal{R}^{(0)},\mathcal{R}^{(1)},...,\mathcal{R}^{(K)}\}italic_y start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { caligraphic_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , caligraphic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , caligraphic_R start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT }, where R(0)superscript𝑅0R^{(0)}italic_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT indicates not relevant and R(K)superscript𝑅𝐾R^{(K)}italic_R start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT is the most relevant level, i.e., graded-relevance with K𝐾Kitalic_K levels. Similarly, we define the set R(q,k)={di|yiq=(k)}superscript𝑅𝑞𝑘subscript𝑑𝑖subscriptsuperscript𝑦𝑞𝑖superscript𝑘R^{(q,k)}=\Set{d_{i}}{y^{q}_{i}=\mathcal{R}^{(k)}}italic_R start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT = { start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | start_ARG italic_y start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG } to be all documents at the (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT relevance level. Furthermore, we define the documents in jsubscript𝑗\mathcal{L}_{j}caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with relevance level (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT for a query q𝑞qitalic_q as Dj(q,k)=LjR(q,k)superscriptsubscript𝐷𝑗𝑞𝑘subscript𝐿𝑗superscript𝑅𝑞𝑘D_{j}^{(q,k)}=L_{j}\cap R^{(q,k)}italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∩ italic_R start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT.

In this work, we consider a ranking function π:𝒟×𝒬[1N]:𝜋𝒟𝒬delimited-[]1𝑁\pi:\mathcal{D}\times\mathcal{Q}\rightarrow[1...N]italic_π : caligraphic_D × caligraphic_Q → [ 1 … italic_N ] that produces the rank riq[1N]subscriptsuperscript𝑟𝑞𝑖delimited-[]1𝑁r^{q}_{i}\in[1...N]italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 1 … italic_N ].

3.1. Fairness through Hypothesis Testing

We define MLIR fairness using the following principle: Documents in different languages with the same relevance level, in expectation, should be presented at the same rank. We measure the satisfaction level of this principle by treating it as a testable hypothesis.

For relevance level (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, assuming riqsubscriptsuperscript𝑟𝑞𝑖r^{q}_{i}italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a random variable over [1N]delimited-[]1𝑁[1...N][ 1 … italic_N ], we implement the principle using the null hypothesis:

(1) H0:𝔼diDa(q,k)[riq]=𝔼djDb(q,k)[rjq]ab,:subscript𝐻0subscript𝔼subscript𝑑𝑖superscriptsubscript𝐷𝑎𝑞𝑘subscriptsuperscript𝑟𝑞𝑖subscript𝔼subscript𝑑𝑗superscriptsubscript𝐷𝑏𝑞𝑘subscriptsuperscript𝑟𝑞𝑗for-allsubscript𝑎subscript𝑏H_{0}:\operatorname{\mathbb{E}}_{d_{i}\in D_{a}^{(q,k)}}[r^{q}_{i}]=% \operatorname{\mathbb{E}}_{d_{j}\in D_{b}^{(q,k)}}[r^{q}_{j}]\,\,\,\,\forall% \mathcal{L}_{a}\neq\mathcal{L}_{b},italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : blackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ∀ caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≠ caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ,

which is the equivalence of the expected rank among documents in each language with the given relevance level and given query q𝑞qitalic_q.

Such null hypotheses can be tested with the Kruskal-Wallis H𝐻Hitalic_H test (K-W test) (Kruskal and Wallis, 1952). The null hypothesis of this test is that all groups have the same mean (i.e., equivalent mean ranks). The K-W test is like a non-parametric version of the ANOVA F-test, which tests whether each group (languages in our case) comes from the same distribution. Since the K-W test does not assume any particular underlying distribution, it uses the ranking of the data points to make this determination. Unlike prior work such as Zehlike et al. (2022) that assumes a binomial distribution for each document over the groups, not assuming the distribution of the query-document scores used for ranking and instead operating directly on ranks yields a robust statistical test.

Conceptually, the test statistics H𝐻Hitalic_H for the K-W test is the ratio between the sum of group rank variance and the total rank variance. The variance ratio obeys a chi-squared distribution; we use its survival function to derive the p𝑝pitalic_p-value. Specifically, we can express the test statistic H𝐻Hitalic_H as

(2) H=(|R(q,k)|1)j=1M|Dj(q,k)|(r¯jq,kr¯)2j=1MdiDj(q,k)(riqr¯)2𝐻superscript𝑅𝑞𝑘1superscriptsubscript𝑗1𝑀superscriptsubscript𝐷𝑗𝑞𝑘superscriptsubscriptsuperscript¯𝑟𝑞𝑘𝑗¯𝑟2superscriptsubscript𝑗1𝑀subscriptsubscript𝑑𝑖superscriptsubscript𝐷𝑗𝑞𝑘superscriptsubscriptsuperscript𝑟𝑞𝑖¯𝑟2\displaystyle H=\left(|R^{(q,k)}|-1\right)\frac{\sum_{j=1}^{M}\left|D_{j}^{(q,% k)}\right|\left(\bar{r}^{q,k}_{j}-\bar{r}\right)^{2}}{\sum_{j=1}^{M}\sum_{d_{i% }\in D_{j}^{(q,k)}}\left(r^{q}_{i}-\bar{r}\right)^{2}}italic_H = ( | italic_R start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT | - 1 ) divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT | ( over¯ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_q , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_r end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_r end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(3) where r¯jq,k=1|Dj(q,k)|diDj(q,k)riqwhere subscriptsuperscript¯𝑟𝑞𝑘𝑗1superscriptsubscript𝐷𝑗𝑞𝑘subscriptsubscript𝑑𝑖superscriptsubscript𝐷𝑗𝑞𝑘subscriptsuperscript𝑟𝑞𝑖\displaystyle\text{where }\bar{r}^{q,k}_{j}=\frac{1}{|D_{j}^{(q,k)}|}\sum_{d_{% i}\in D_{j}^{(q,k)}}r^{q}_{i}where over¯ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_q , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
(4) and r¯=1|R(q,k)|j=1MdiDj(q,k)riqand ¯𝑟1superscript𝑅𝑞𝑘superscriptsubscript𝑗1𝑀subscriptsubscript𝑑𝑖superscriptsubscript𝐷𝑗𝑞𝑘subscriptsuperscript𝑟𝑞𝑖\displaystyle\text{and }\bar{r}=\frac{1}{|R^{(q,k)}|}\sum_{j=1}^{M}\sum_{d_{i}% \in D_{j}^{(q,k)}}r^{q}_{i}and over¯ start_ARG italic_r end_ARG = divide start_ARG 1 end_ARG start_ARG | italic_R start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

for a given query q𝑞qitalic_q and relevance level (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Recall that R(q,k)superscript𝑅𝑞𝑘R^{(q,k)}italic_R start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT and Dj(q,k)superscriptsubscript𝐷𝑗𝑞𝑘D_{j}^{(q,k)}italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT are sets. For each query q𝑞qitalic_q and given relevance level, we report the p𝑝pitalic_p-value of the K-W test, which is the Probability of documents in all languages with given relevance level having Equal Expected Rank, by comparing the H𝐻Hitalic_H statistic against a chi-squared distribution with M1𝑀1M-1italic_M - 1 degrees of freedom. The p𝑝pitalic_p-value provides us with the probability that documents in different languages are ranked fairly within a given relevance level. We denote the p𝑝pitalic_p-value for a given query q𝑞qitalic_q and a relevance level (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT as p(q,k)superscript𝑝𝑞𝑘p^{(q,k)}italic_p start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT.

Our fairness notion is similar to the one proposed by Diaz et al. (2020). However, we operationalize the principle by treating each document as a sample from a distribution given the language and relevance level, instead of assuming the entire ranked list is a sample from all possible document permutations.

Refer to caption
Figure 1. Ranked lists with different fairness patterns between two languages and binary relevance.

3.2. Fairness at Each Relevance Level

The impact of unfairly ranking documents in different languages may differ at each relevance level. Such differences can be linked to a specific user model or application. For example, for an analyst actively seeking information for which each language provides different aspects, ranking nonrelevant documents of a particular language at the top does not degrade fairness; finding disproportionately fewer relevant documents in a certain language, on the other hand, may yield biased analytical conclusions.

In contrast, for a user seeking answers to a specific question who views the language as just the content carrier, reading more nonrelevant documents from a language may degrade that language’s credibility, leading the user eventually to ignore all content in that language. In this case, we do consider the language fairness of the ranking of nonrelevant documents; in the former case we do not.

To accommodate different user models, we define the PEER score as a linear combination of the p𝑝pitalic_p-value of each relevance level (k)superscript𝑘\mathcal{R}^{(k)}caligraphic_R start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Let w(k)[0,1]superscript𝑤𝑘01w^{(k)}\in[0,1]italic_w start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] be the weights and k=1Kw(k)=1superscriptsubscript𝑘1𝐾superscript𝑤𝑘1\sum_{k=1}^{K}w^{(k)}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = 1, the overall weighted PEER for query q𝑞qitalic_q is

(5) PEER(q)=k=1Kw(k)p(q,k)𝑃𝐸𝐸superscript𝑅𝑞superscriptsubscript𝑘1𝐾superscript𝑤𝑘superscript𝑝𝑞𝑘PEER^{(q)}=\sum_{k=1}^{K}w^{(k)}p^{(q,k)}italic_P italic_E italic_E italic_R start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ( italic_q , italic_k ) end_POSTSUPERSCRIPT

3.3. Rank Cutoff and Aggregation

While a ranking function π𝜋\piitalic_π ranks each document in collection 𝒟𝒟\mathcal{D}caligraphic_D, in practice, a user only examines results up to a certain cutoff. Some IR effectiveness measurements consider elaborate browsing models, such as exponential-decreasing attention in Ranked-biased precision (RBP) (Moffat and Zobel, 2008) or patience-based attention in expected reciprocal rank (ERR) (Chapelle et al., 2009); user behavior though is perpendicular to language fairness, so we consider only a simple cutoff model.

With a rank cutoff X𝑋Xitalic_X, we treat only the top-X𝑋Xitalic_X documents as the sampling universe for the K-W test. However, since disproportionately omitting documents of a certain relevance level is still considered unfair, before conducting the hypothesis test, we concatenate unretrieved documents (or those ranked below the cutoff) at that relevance level to the ranked list, assigning them a tied rank of X+1𝑋1X+1italic_X + 1. This is optimistic, since these documents might rank lower in the actual ranked list. However, this provides a robust penalty for any ranking model that provides only a truncated ranked list. We define the p𝑝pitalic_p-value as 1.0 when no document is retrieved at a given relevance level in spite of their presence in the collection; from the user perspective, no document at that level is presented, so it is fair (albeit ineffective) across languages. We denote the weighted p𝑝pitalic_p-value calculated on the top-X𝑋Xitalic_X documents as PEER(q)@X𝑃𝐸𝐸superscript𝑅𝑞@𝑋PEER^{(q)}@Xitalic_P italic_E italic_E italic_R start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT @ italic_X.

Overall, we report the average weighted PEER over all queries at rank X𝑋Xitalic_X, i.e., PEER@X=|𝒬|1q𝒬PEER(q)@X𝑃𝐸𝐸𝑅@𝑋superscript𝒬1subscript𝑞𝒬𝑃𝐸𝐸superscript𝑅𝑞@𝑋PEER@X=|\mathcal{Q}|^{-1}\sum_{q\in\mathcal{Q}}PEER^{(q)}@Xitalic_P italic_E italic_E italic_R @ italic_X = | caligraphic_Q | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT italic_P italic_E italic_E italic_R start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT @ italic_X. Since we treat each document as a random variable of position in a ranked list, what we are measuring is how likely a system is fair between languages instead of how fair each ranked list is. A higher PEER score indicates that the measured MLIR system is more likely to place documents written in different languages but with the same relevance level at similar ranks.

4. Experiments and Results

Table 1. Effectiveness and fairness results. Both AWRF and PEER exclude nonrelevant documents. They are removed in AWRF calculation. For PEER, the importance weights of nonrelevant documents are set to 0.
Collection Rank Cutoff 20 1000
Measure nDCG α𝛼\alphaitalic_α-nDCG AWRF PEER Recall α𝛼\alphaitalic_α-nDCG AWRF PEER
CLEF 2003 QT ¿¿ BM25 0.473 0.444 0.513 0.239 0.743 0.579 0.788 0.202
DT ¿¿ BM25 0.636 0.640 0.623 0.243 0.857 0.747 0.895 0.299
DT ¿¿ ColBERT 0.669 0.674 0.658 0.293 0.889 0.768 0.904 0.328
ColBERT-X ET 0.591 0.592 0.610 0.215 0.802 0.695 0.845 0.327
ColBERT-X MTT 0.643 0.658 0.649 0.318 0.827 0.748 0.860 0.362
NeuCLIR 2022 QT ¿¿ BM25 0.305 0.447 0.537 0.453 0.557 0.569 0.752 0.383
DT ¿¿ BM25 0.338 0.448 0.542 0.497 0.633 0.580 0.809 0.421
DT ¿¿ ColBERT 0.403 0.539 0.635 0.449 0.708 0.652 0.842 0.426
ColBERT-X ET 0.299 0.447 0.578 0.458 0.487 0.561 0.745 0.421
ColBERT-X MTT 0.375 0.545 0.621 0.425 0.612 0.644 0.786 0.386

4.1. Synthetic Data

To demonstrate PEER behavior, we create ranked lists of two languages and binary relevance with four methods, each creating lists from very unfair to very fair. Results are illustrated in Figure 1. Shifting starts with all documents in one language ranking higher than those in the other, and slowly interleaves them until the two are alternating. Figure 1(a) shows for fifty documents that when no documents are interleaved (left), fairness is low, with PEER close to 0. As alternation increases, the PEER score increases. In Moving Single, the ranked list consists entirely of one language except for one document. That single document moves from the top (unfair) to the middle of the ranking (fair). In Figure 1(b) with 99 majority language documents, the PEER scores increase as the singleton moves from the top to the middle. Figure 1(c) shows Interleaving, in which the languages alternate and the number of retrieved documents slowly increases. With odd lengths the highest and the lowest ranked documents are in the same language, giving them the same average and 1.0 PEER scores. With even lengths, one language has a slightly higher rank than the other. The difference shrinks with longer ranked lists, resulting in increased PEER scores. In Increasing Length, 100 retrieved documents comprise first an alternating section followed by all documents in a single language, and the size of the alternating section is gradually increased. This is similar to shifting, but with overlap** languages at the top instead of in the middle of the rank list. At the left of Figure 1(d) only the document at rank 1 is minority language, followed by minority language at ranks 1 and 3, and so on. The right of the graph is identical to the right of Figure 1(a). These four patterns demonstrate that PEER scores match our intuition of fairness between languages. The next section evaluates real MLIR retrieval systems on two MLIR evaluation collections.

4.2. Assigning Languages to a Real Ranked List

We used NeuCLIR’22 runs to create new synthetic runs with relevant documents in the same positions, but with languages assigned to the relevant documents either fairly or unfairly. We randomly selected how many relevant documents would be assigned to each language, and created that many language labels. For each label we drew from a normal distribution with either the same mean for the two languages (fair), or different means (unfair). We assigned a drawn number to each label, sorted the labels by that number, and assigned the labels to the relevant documents in the resulting order. We did the same for the nonrelevant documents, ensuring that each language was assigned at least 45% of those documents.

Figures 1(e) and (f) vary the sampling mean of the second language’s relevant and nonrelevant documents, respectively, while kee** a first language sampling mean of 1.0. The figures show that PEER captures fairness independently for each relevance level. Since there are far fewer relevant documents, the evidence for fairness is also weaker, resulting in slower decay when changing the sampling mean for relevant documents.

4.3. Real MLIR Systems

We evaluate five MLIR systems, including query translation (QT) and document translation (DT) with BM25, DT with English ColBERT (Santhanam et al., 2022), and ColBERT-X models trained with English triples (ET) and multilingual translate-train (MTT) (Lawrie et al., 2023b), on CLEF 2003 (German, Spanish, French, and English documents with English queries) and NeuCLIR 2022 (Chinese, Persian, and Russian documents with English queries). For QT, English queries are translated into each document language and monolingual search results from each language are fused by score. We report α𝛼\alphaitalic_α-nDCG, AWRF (with number of relevant documents as target distribution), and the proposed PEER with rank cutoffs at 20 and 1000. Along with nDCG@20 and Recall@1000, we summarize the results in Table 1.

Logically, merging ranked lists from each monolingual BM25 search with translated queries purely by scores is inviting unfair treatment, as scores from each language are incompatible with different query lengths and collection statistics (Robertson and Soboroff, 2002). We observed this trend in both PEER and AWRF, while α𝛼\alphaitalic_α-nDCG strongly correlates with the effectiveness scores and does not distinguish fairness.

Neural MLIR models trained with only English text and transferred zero-shot to MLIR with European languages exhibit a strong language bias compared to those trained with document languages (Lawrie et al., 2023b; Huang et al., 2023). PEER exhibits a similar trend in CLEF 2003, showing ColBERT-X ET is less fair than the MTT counterpart, while AWRF is less sensitive. Lawrie et al. (2023b) show that preference for English documents in the ET model causes this unfair treatment; this suggests that MLIR tasks without the training language (English) in the document collection would not suffer from such discrepancy. In fact, both PEER and AWRF indicate that MTT model is less fair among the three languages in NeuCLIR 2022, which is likely caused by the quality differences in machine translation (Lawrie et al., 2023a).

AWRF and PEER disagree on the comparison between English ColBERT on translated documents (DT) and ColBERT-X models. While AWRF suggests DT ¿¿ ColBERT 111The (ET)+ITD setting in Lawrie et al. (2023b). is fairer than ColBERT-X MTT in CLEF03, DT creates a larger difference among languages (Lawrie et al., 2023b). PEER, in contrast, aligns with prior analysis, giving a lower score to DT ¿¿ ColBERT. According to Huang et al. (2023), QT ¿¿ BM25 has a similar language bias compared to mDPR (Clark et al., 2020), which was trained with English MS MARCO (Bajaj et al., 2016). PEER suggests a similar conclusion between QT ¿¿ BM25 and ColBERT-X ET, which AWRF assigns a larger difference between the two with a rank cutoff of 20.

With a rank cutoff of 1000, AWRF strongly correlates with recall (Pearson r=0.93𝑟0.93r=0.93italic_r = 0.93 over both collections), while PEER does not (Pearson r=0.55𝑟0.55r=-0.55italic_r = - 0.55). The 0.904 AWRF value (range 0-1) of DT ¿¿ ColBERT on CLEF03 suggests a fair system, while the ranked list does not. This strong relationship shows that AWRF, with target distribution being the ratio of relevant documents, is indeed measuring recall instead of fairness. While it is an artifact of the choice of target distribution, the need to define a target distribution reduces the robustness of AWRF in measuring MLIR Fairness.

5. Summary

We propose measuring the Probability of Equal Expected Rank (PEER) for MLIR fairness. As PEER measures the weighted p𝑝pitalic_p-value of a non-parametric group hypothesis test, it neither requires a target distribution nor makes distributional assumptions; this makes the metric robust. Through comparison to prior analytical work in MLIR Fairness, we conclude that PEER captures the differences and nuances between systems better than other fairness metrics.

References

  • (1)
  • Arguello et al. (2021) Jaime Arguello, Adam Ferguson, Emery Fine, Bhaskar Mitra, Hamed Zamani, and Fernando Diaz. 2021. Tip of the tongue known-item retrieval: A case study in movie identification. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval. 5–14.
  • Bajaj et al. (2016) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
  • Biega et al. (2018) Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. In Proc. SIGIR. 405–414.
  • Braschler (2001) Martin Braschler. 2001. CLEF 2001—Overview of Results. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 9–26.
  • Braschler (2002) Martin Braschler. 2002. CLEF 2002—Overview of results. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 9–27.
  • Braschler (2003) Martin Braschler. 2003. CLEF 2003–Overview of results. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 44–63.
  • Broder (2002) Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM New York, NY, USA, 3–10.
  • Castillo (2019) Carlos Castillo. 2019. Fairness and transparency in ranking. In ACM SIGIR Forum, Vol. 52. ACM New York, NY, USA, 64–71.
  • Chapelle et al. (2009) Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621–630.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Association for Computational Linguistics (ACL), 1870–1879.
  • Choudhury and Deshpande (2021) Monojit Choudhury and Amit Deshpande. 2021. How Linguistically Fair Are Multilingual Pre-Trained Language Models?. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 12710–12718.
  • Clark et al. (2020) Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics (2020).
  • Clarke et al. (2008) Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan B”uttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In SIGIR.
  • Diaz et al. (2020) Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating stochastic rankings with expected exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 275–284.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214–226.
  • Huang et al. (2023) Zhiqi Huang, Hansi Zeng, Hamed Zamani, and James Allan. 2023. Soft Prompt Decoding for Multilingual Dense Retrieval. arXiv preprint arXiv:2305.09025 (2023).
  • Kassner et al. (2021) Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating knowledge in multilingual pretrained language models. arXiv preprint arXiv:2102.00894 (2021).
  • Kruskal and Wallis (1952) William H Kruskal and W Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association 47, 260 (1952), 583–621.
  • Kwon et al. (2018) Heeyoung Kwon, Harsh Trivedi, Peter Jansen, Mihai Surdeanu, and Niranjan Balasubramanian. 2018. Controlling information aggregation for complex question answering. In Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40. Springer, 750–757.
  • Lawrie et al. (2023a) Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. 2023a. Overview of the TREC 2022 NeuCLIR Track. In Proceedings of the 31st Text REtrieval Conference (Gaithersburg, Maryland). https://arxiv.longhoe.net/abs/2304.12367
  • Lawrie et al. (2023b) Dawn Lawrie, Eugene Yang, Douglas W Oard, and James Mayfield. 2023b. Neural Approaches to Multilingual Information Retrieval. In European Conference on Information Retrieval. Springer, 521–536.
  • Lee et al. (2006) ** Ha Lee, Allen Renear, and Linda C Smith. 2006. Known-item search: Variations on a concept. Proceedings of the american society for information science and technology 43, 1 (2006), 1–17.
  • Moffat and Zobel (2008) Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1–27.
  • Morik et al. (2020) Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. Controlling fairness and bias in dynamic learning-to-rank. In Proc. SIGIR. 429–438.
  • Rahimi et al. (2015) Razieh Rahimi, Azadeh Shakery, and Irwin King. 2015. Multilingual information retrieval in the language modeling framework. Information Retrieval Journal 18, 3 (2015), 246–281.
  • Robertson and Soboroff (2002) Stephen E Robertson and Ian Soboroff. 2002. The TREC 2002 Filtering Track Report.. In Proceedings of the Eleventh Text Retrieval Conference.
  • Santhanam et al. (2022) Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3715–3734.
  • Sapiezynski et al. (2019) Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson. 2019. Quantifying the impact of user attention on fair group representation in ranked lists. In Companion Proceedings of WWW. 553–562.
  • Schulz-Hardt et al. (2000) Stefan Schulz-Hardt, Dieter Frey, Carsten Lüthgens, and Serge Moscovici. 2000. Biased information search in group decision making. Journal of personality and social psychology 78, 4 (2000), 655.
  • Singh and Joachims (2018) Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proc. KDD.
  • White (2013) Ryen White. 2013. Beliefs and biases in web search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 3–12.
  • Zehlike et al. (2017) Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. 2017. Fa* ir: A fair top-k ranking algorithm. In Proc. of CIKM.
  • Zehlike et al. (2022) Meike Zehlike, Tom Sühr, Ricardo Baeza-Yates, Francesco Bonchi, Carlos Castillo, and Sara Hajian. 2022. Fair Top-k Ranking with multiple protected groups. Information processing & management 59, 1 (2022), 102707.
  • Zehlike et al. (2021) Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2021. Fairness in ranking: A survey. arXiv preprint arXiv:2103.14000 (2021).