\addbibresource

ref.bib

Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems

Ziyi Liang, Tianmin Xie , Xin Tong²²footnotemark: 2, Matteo Sesia²²footnotemark: 2 Department of Mathematics, University of Southern California, Los Angeles, CA, USA.Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA.The first two authors contributed equally to this work.

Abstract

We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set.

Keywords: Collaborative filtering; confidence regions; conformal inference; exchangeability;
Laplace’s method; simultaneous inference.

1 Introduction

1.1 Background and Motivation

Many data-driven decision problems require a simultaneous understanding of several related unknowns, motivating the use of joint confidence (or prediction) regions. This explains why the concept of joint (or simultaneous) inference has a rich history, with roots tracing back to seminal works by \citetscheffe1953method and others \citeproy1953simultaneous,goodman1965simultaneous. Despite its established roots in classical statistics, this topic remains relatively under-explored in the context of distribution-free predictive inference, or conformal inference \citepvovk2005algorithmic,lei2018distribution. In fact, conformal methods have so far primarily focused on predicting individual outcomes one by one, typically under a data exchangeability assumption. As we shall see, this traditional approach can make it hard to aggregate individual inferences efficiently. This is especially a challenge in situations where the goal is to simultaneously predict several quantities exhibiting potentially complicated dependencies, and a straightforward Bonferroni correction \citepvovk2015transductive applied to individual-level predictions would be too conservative. We begin to fill this gap in the literature by develo** a novel simultaneous conformal inference \citepvovk2005algorithmic method for matrix completion, practically motivated by collaborative filtering for group recommendations.

Group recommender algorithms provide suggestions for collective decisions in diverse contexts, such as movies \citepquijano2011happy, music \citepmccarthy1998musicfx, travel \citepherzog2018tourrec, restaurants \citepmccarthy2002pocket, and hiring \citepbaskin2009preference. For example, think of a group of friends faced with the daunting task of selecting a movie that caters to everyone’s preferences. By utilizing data on each individual’s past streaming history and previously indicated movie preferences, a group recommender system can streamline the decision-making process, enhancing the likelihood of an inclusive and gratifying movie-watching experience for all involved. This explains why the topic is gaining increased traction \citepjameson2007recommendation,felfernig2018group,dara2020survey, fueled by increasing usage of mobile devices and social networks that connect users and gather relevant data.

While previous research on recommender systems concentrated on algorithmic aspects, uncertainty estimation is an essential ingredient in principled decision-making \citepzukerman2001predictive,adomavicius2007towards,zhang2016prediction,himabindu2018conformal,coscrato2023estimating, the significance of which is underscored by the inherent limitations of inferences based on sparse data, subjective ratings, diverse user behaviors, and potential inaccuracies due to noise \citeplam2006you. In fact, uncertainty estimation can enhance transparency \citepherlocker2000explaining and plays a crucial role in identifying flawed predictions and reinforcing user trust \citepmcnee2003confidence. Moreover, it can provide valuable insights for develo** innovative algorithmic strategies \citepprice2005optimal, potentially entailing the exclusion of less confident recommendations. It is also beneficial for internal company processes, such as comparing different algorithms \citepgunawardana2009survey and offering guidance on future data collection requirements \citepchen2021exploration. In the context of group recommendations, uncertainty estimation becomes even more important \citepde2009managing,sacharidis2019modeling,ismailoglu2022aggregating because it can play a critical role in the aggregation of potentially conflicting individual preferences \citepdara2020survey,xiao2017fairness,polylens-grs-2001.

To illustrate this idea, consider a recommendation algorithm that needs to suggest a movie for a group of friends, Alice, Ben, and Chris. All three users are predicted to give 5/5 ratings to “2001: A Space Odyssey”. However, the algorithm is only confident about Alice’s and Ben’s ratings, while Chris’s predicted rating is close to random guess, due for example to limited data related to his enjoyment of science-fiction. Moreover, the algorithm predicts all three users to give a 4/5 rating to “The Godfather”, but these predictions all are highly confident. Understanding the uncertainty in both cases would facilitate the recommendations process. For instance, it would lead to the selection of “The Godfather” if the system follows the “least misery” principle—assuming that the group’s overall satisfaction is influenced by the least satisfied member \citeppolylens-grs-2001—and to “2001: A Space Odyssey” under the opposite “most pleasure” principle.

With this motivation, we consider the problem of jointly predicting (or estimating) multiple ratings corresponding to different users. We focus on collaborative filtering via matrix completion—the task of approximating the missing entries in a sparsely observed ratings matrix \citepramlatchan2018survey, where rows represent users and columns depict products. If the matrix exhibits certain patterns, such as a low-rank structure, it becomes feasible to impute the missing entries \citeprennie2005fast, candes_exact_2008, candes_noise_2010. The underlying concept is quite intuitive. Individuals often seek recommendations from their peers, placing more trust in suggestions from people whose preferences align closely with their own. Thus, the structure of the preference matrix captures valuable similarities between diverse users and products.

While conformal inference has already been recognized as providing a useful uncertainty estimation framework in the context of collaborative filtering and matrix completion \citephimabindu2018conformal,gui2023conformalized,shao2023distribution, we will explain below that the task of simultaneously estimating uncertainty in a group setting is both novel and particularly challenging, necessitating a creative approach.

1.2 Preview of Our Contributions

We study the problem of constructing a confidence region for $K$ missing entries within the same column of a partially observed matrix, for different values of $K$ . Focusing on groups of entries within the same column is helpful for concreteness, but the main elements of our solution could be adapted to also address different related questions.

Figure 1 previews the performance of our method applied to the MovieLens 100K data \citepmovielens100k, analyzing a ratings matrix with 800 rows (users) and 1000 columns (movies). Approximately 94% of the entries are missing, and the goal is to construct a joint 90% confidence region (in the shape of a hyper-cube) for the unobserved ratings assigned to a random movie by a group of $K$ users. We refer to Section 5.2 for more details on this data analysis. In this context, an unadjusted application of conformalized matrix completion method of \citetgui2023conformalized would produce individual confidence intervals for one user/movie pair at a time but would not achieve group-level coverage. Therefore, we compare our method to a benchmark that seeks simultaneous group validity by applying a straightforward Bonferroni adjustment to the individual confidence intervals. Our method outperforms this Bonferroni benchmark, producing narrower and hence more informative confidence regions thanks to its ability to automatically adapt to the complex dependency structures arising from these data. We refer to Appendix A1 for a review of the method of \citetgui2023conformalized and further details on the undadjusted and Bonferroni baseline approaches.

Refer to caption — Figure 1: Preview of the performance of our conformal method for simultaneous group-level matrix completion on the MovieLens 100K data, as a function of the group size. The results in different columns are obtained using a (convex optimization) matrix completion algorithm based on different hypothesized matrix ranks. The baseline approach applies a Bonferroni correction to the individual-level conformalized matrix completion method of \citetgui2023conformalized. The nominal group-level coverage level is 90%, and our method outputs narrower (more informative) confidence regions.

Our method can be intuitively explained as follows. Conformal inference generally aims to transform the output of a machine learning model into a confidence set with a tunable parameter that controls the margin of error. This parameter is calibrated, using hold-out data assumed to be exchangeable with the test point, to minimize the size of the confidence set while guaranteeing a suitable notion of average coverage. To address our joint estimation problem, we extend the standard conformal strategy by constructing and leveraging a structured calibration sample consisting of groups of user/product observations that mimic the patterns expected at test time.

Yet, it is not easy to translate this high-level idea into a concrete method. The main obstacle is that our structured calibration data are inconsistent with exchangeability, preventing the use of standard conformal techniques. Our solution draws inspiration from the weighted exchangeability framework of \citettibshirani-covariate-shift-2019, which was developed to address a different covariate shift problem. However, this approach gives rise to significant computational challenges in our setting, as it involves summing over exponentially many permutations to account for the lack of exchangeability. We overcome these challenges by fusing the Gumbel-max trick \citepgumbel1954statistical with a new extension of the classical Laplace method \citeplaplace1774memoire. The Gumbel-max trick converts the intractable sum into a more manageable although still analytically unfeasible integral, and the Laplace method provides an accurate approximation of this integral. Although our integration technique is not exact, we demonstrate its reliability by proving that it is asymptotically consistent in the large-sample limit and then we verify the empirical validity of our results numerically.

1.3 Related Works

While matrix completion algorithms are well-studied \citeprennie2005fast, candes_exact_2008, candes_noise_2010, pmf_2007, bpmf_2008, the problem of quantifying their uncertainty has received attention only more recently. \citetchen_inference_2019,xia_statistical_2021,farias_uncertainty_2022 obtained asymptotic confidence intervals for missing entries estimated using convex optimization algorithms, under some modeling assumptions for the underlying matrix. \citetgui2023conformalized and \citetshao2023distribution proposed alternative approaches based on conformal inference \citepvovk1999machine,vovk2005algorithmic, requiring fewer assumptions and providing finite-sample inferences. Our method builds upon the conformal inference framework, but is not in competition with the parametric techniques of \citetchen_inference_2019,xia_statistical_2021,farias_uncertainty_2022; on the contrary, in the future these opposite points of view could be combined, possibly drawing strength from the convex optimization analysis to obtain even more informative conformal inferences.

\citet

gui2023conformalized and \citetshao2023distribution both seek to construct individual confidence intervals for the unobserved entries in a partially observed matrix, one at a time, although each tackles that problem from a different angle. The view of \citetgui2023conformalized is more similar to ours, as they assume the matrix is fixed while the missingness is random. In \citetshao2023distribution, the matrix is random and the missingness is fixed. Our work departs from that of \citetgui2023conformalized as we seek simultaneous confidence regions for structured groups of missing entries; this is a more challenging problem that requires different modeling choices and involves several methodological innovations due to lack of exchangeability.

While this paper focuses on matrix completion, many components of our method are likely to have broader relevance, potentially enabling structured conformal inferences in different contexts. Many prior works focused on regression \citeplei2014distribution,lei2018distribution,romano2019conformalized,sesia2021conformal, classification \citeplei2013distribution,romano2020classification,liang2024icp, or outlier detection \citepsmith2015conformal,bates2021testing,marandon2022machine, but the applicability of conformal inference extends to numerous other tasks, including functional data analysis \citeplei2015conformal, causal inference \citeplei2021conformal, data sketching \citepsesia2023conformal, and forecasting \citepgibbs2021adaptive,zhou2024conformalized; see also \citetangelopoulos2021gentle for a beginner-level review. In general, conformal inference is easier when dealing with exhangeable data, but there have been many efforts to deal with more difficult situations \citeptibshirani-covariate-shift-2019,podkopaev2021distribution,barber_beyond_2022,einbinder2022conformal,sesia2023adaptive. Building on the weighted exchangeability framework introduced by \citettibshirani-covariate-shift-2019, we focus on the challenges arising from structured calibration.

1.4 Outline of the Paper

Section 2 states our assumptions and goals. Section 3 presents our method. Section 4 delves into essential implementation aspects. The empirical performance of our method is investigated in Section 5. Section 6 concludes with a discussion and some ideas for future research. Additional content is in the appendices. Appendix A1 summarizes technical background information and details the baseline approaches. Appendix A2 explains additional implementation details, Appendix A3 summarizes further empirical results, and Appendix A4 contains all mathematical proofs.

2 Setup and Problem Statement

Let $\bm{M}\in\mathbb{R}^{n_{r}\times n_{c}}$ be a fixed matrix with $n_{r}$ rows and $n_{c}$ columns. One may consider that the rows of $\bm{M}$ correspond to users and its columns to products, while each entry $M_{r,c}$ is the rating assigned by user $r\in[n_{r}]:=\{1,\ldots,n_{r}\}$ to product $c\in[n_{c}]:=\{1,\ldots,n_{c}\}$ . Although it is common to assume $\bm{M}$ to have a particular (e.g., low-rank) structure, possibly including some independent additive random noise, we follow a different approach. Inspired by \citetgui2023conformalized, we allow $\bm{M}$ to be deterministic and arbitrary, and we model instead the randomness in the data observation (or missingness) process.

Consider a fixed number of observations, $n_{\mathrm{obs}}$ , and let $\bm{M}_{\bm{X}_{\mathrm{obs}}}$ denote the observed portion of $\bm{M}$ , indexed by $\bm{X}_{\mathrm{obs}}=(X_{1},\ldots,X_{n_{\mathrm{obs}}})$ with each $X_{i}\in[n_{r}]\times[n_{c}]$ . We assume that $\bm{X}_{\mathrm{obs}}$ is randomly sampled without replacement from $[n_{r}]\times[n_{c}]$ . For increased flexibility, we allow this sampling process to be weighted according to parameters $\bm{w}=(w_{r,c})_{(r,c)\in[n_{r}]\times[n_{c}]}$ , which we assume to be known for now. See Appendix A2.4 for details on how these weights may be estimated in practice. In a compact notation, we write our sampling model as:

\displaystyle\bm{X}_{\mathrm{obs}}

\displaystyle\sim\Psi(n_{\mathrm{obs}},[n_{r}]\times[n_{c}],\bm{w}).

(1)

In general, $\Psi(m,\mathcal{X},\bm{w})$ denotes weighted random sampling without replacement of $m\geq 1$ distinct elements (“balls”) from a finite dictionary (“urn”) denoted as $\mathcal{X}$ , with $|\mathcal{X}|\geq m$ . More precisely, $\bm{X}\sim\Psi(m,\mathcal{X},\bm{w})$ corresponds to saying that, for any $\bm{x}\in\mathcal{X}^{m}$ ,

\displaystyle\mathbb{P}\left[\bm{X}=\bm{x}\right]

\displaystyle=\frac{w_{x_{1}}}{\sum_{x\in\mathcal{X}}w_{x}}\cdot\frac{w_{x_{2}% }}{\sum_{x\in\mathcal{X}}w_{x}-w_{x_{1}}}\cdot\ldots\cdot\frac{w_{x_{m}}}{\sum% _{x\in\mathcal{X}}w_{x}-\sum_{i=1}^{m-1}w_{x_{i}}}.

Therefore, the weights $\bm{w}$ do not need to be normalized. This model is a special case of the multivariate Wallenius’ noncentral hypergeometric distribution \citepwallenius1963biased,chesson1976non, and it reduces to random sampling without replacement if all $w_{r,c}$ are the same.

We denote as $\mathcal{D}_{\mathrm{obs}}\subset[n_{r}]\times[n_{c}]$ the unordered collection of indices corresponding to the observed matrix entries, and call its complement $\mathcal{D}_{\mathrm{miss}}$ ; i.e.,

\displaystyle\mathcal{D}_{\mathrm{obs}}:=\{X_{1},\ldots,X_{n_{\mathrm{obs}}}\},

\displaystyle\mathcal{D}_{\mathrm{miss}}:=[n_{r}]\times[n_{c}]\setminus% \mathcal{D}_{\mathrm{obs}}.

Our goal is to construct a joint confidence region for a group of $K\geq 1$ unobserved matrix entries, indexed by $\bm{X}^{*}=(X^{*}_{1},X^{*}_{2},\ldots,X^{*}_{K})$ , where each $X^{*}_{k}\in\mathcal{D}_{\mathrm{miss}}$ and the group size $K$ is fixed. The indices represented by $\bm{X}^{*}$ are assumed to be sampled without replacement from $\mathcal{D}_{\mathrm{miss}}$ , subject to the constraint that they must be in the same matrix column. Moreover, the sampling of $\bm{X}^{*}$ from $\mathcal{D}_{\mathrm{miss}}$ is guided by a distinct set of weights $\bm{w}^{*}=(w^{*}_{r,c})_{(r,c)\in[n_{r}]\times[n_{c}]}$ , which are also assumed to be known for now. We will discuss at the end of this section the implications of the choice of $\bm{w}^{*}$ .

To ensure that the sampling model for $\bm{X}^{*}$ is fully well-defined, we must account for the possibility that some matrix columns may have fewer than $K$ missing entries. Therefore, we consider a pruned set of missing indices $\bar{\mathcal{D}}_{\mathrm{miss}}\subseteq\mathcal{D}_{\mathrm{miss}}$ , defined as:

\displaystyle\bar{\mathcal{D}}_{\mathrm{miss}}=\{(r,c)\in\mathcal{D}_{\mathrm{% miss}}:n^{c}_{\mathrm{miss}}\geq K\}

(2)

where $n^{c}_{\mathrm{miss}}=\lvert\{(r^{\prime},c^{\prime})\in\mathcal{D}_{\mathrm{% miss}}:c^{\prime}=c\}\rvert$ is the number of missing entries in column $c$ . Then, we assume that $\bm{X}^{*}$ is sampled according to

\displaystyle\begin{split}\bm{X}^{*}\mid\mathcal{D}_{\mathrm{obs}},\mathcal{D}% _{\mathrm{miss}}&\sim\Psi^{\text{col}}(K,\bar{\mathcal{D}}_{\mathrm{miss}},\bm% {w}^{*}),\end{split}

(3)

where $\Psi^{\text{col}}$ is a constrained version of $\Psi$ that samples a group of indices belonging to the same column. This distribution is equivalent to the following sequential sampling procedure:

\displaystyle\begin{split}X^{*}_{1}\mid\mathcal{D}_{\mathrm{obs}},\mathcal{D}_% {\mathrm{miss}}&\sim\Psi(1,\bar{\mathcal{D}}_{\mathrm{miss}},\bm{w}^{*})\\ X^{*}_{2}\mid\mathcal{D}_{\mathrm{obs}},\mathcal{D}_{\mathrm{miss}},X^{*}_{1}&% \sim\Psi(1,\bar{\mathcal{D}}_{\mathrm{miss}}\setminus\{X^{*}_{1}\},\widetilde{% \bm{w}}^{*}),\\ &\;\;\vdots\\ X^{*}_{K}\mid\mathcal{D}_{\mathrm{obs}},\mathcal{D}_{\mathrm{miss}},X^{*}_{1},% \ldots,X^{*}_{K-1}&\sim\Psi(1,\bar{\mathcal{D}}_{\mathrm{miss}}\setminus\{X^{*% }_{1},\ldots,X^{*}_{K-1}\},\widetilde{\bm{w}}^{*}),\end{split}

(4)

where the weights $\widetilde{\bm{w}}^{*}$ are given by $\widetilde{w}^{*}_{r,c}=w^{*}_{r,c}\mathbb{I}\left[c=X^{*}_{1,2}\right]$ and $X^{*}_{1,2}$ is the column of $X^{*}_{1}$ .

Note that the sequential sampling procedure in (4) requires the existence of one column with at least $K$ unobserved entries. This is always the case as long as $n_{\mathrm{obs}}<n_{c}(n_{r}-K+1)$ , and this is a reasonable assumption in applications where $\bm{M}$ is only sparsely observed.

This premise allows us to state our goal formally. For a given coverage level $\alpha\in(0,1)$ , we seek a joint confidence region, denoted as $\widehat{\bm{C}}(\bm{X}^{*};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)\subseteq% \mathbb{R}^{K}$ , for the $K$ missing matrix entries indexed by $\bm{X}^{*}$ . Crucially, this confidence region should be informative (i.e., not too wide) and guarantee finite-sample simultaneous coverage, in the sense that

\displaystyle\mathbb{P}\left[M_{r,c}\in\widehat{C}_{r,c}(\bm{X}^{*};\bm{M}_{% \bm{X}_{\mathrm{obs}}},\alpha),\;\forall(r,c)\in\{X^{*}_{1},\ldots,X^{*}_{K}\}% \right]\geq 1-\alpha.

(5)

Note that the probability in (5) is taken with respect to both $\bm{X}_{\mathrm{obs}}$ , which is sampled according to (1), and $\bm{X}^{*}$ , which is sampled according to (4), while $\bm{M}$ is fixed.

We conclude this section by discussing the important distinction between the sampling weights $\bm{w}$ in (1) and the test weights $\bm{w}^{*}$ in (3), which have different interpretations and purposes. Intuitively, the role of $\bm{w}$ is to model situations in which the matrix is not observed uniformly at random. For example, in a collaborative filtering context, some types of users may be more engaged and certain movies tend to receive more ratings. Such patterns can be captured by the model in (1) using heterogeneous weights. Further, non-uniform missingness patterns are often apparent from the observed data, making it possible to estimate $\bm{w}$ empirically \citepgui2023conformalized, as explained in Appendix A2.4.

By contrast, $\bm{w}^{*}$ may be independently fixed by the practitioner, and its role is to control the interpretation of the coverage guarantee in (5). To understand this, consider the following two examples. If all $w^{*}_{r,c}=1$ , Equation (5) offers coverage only in a marginal sense, for $\bm{X}^{*}$ drawn uniformly at random from the missing portion of the matrix \citepgui2023conformalized. If $w^{*}_{r,c}=1$ if and only if $c\in\mathcal{A}$ , for the subset $\mathcal{A}\subset[n_{c}]$ corresponding to “action movies”, Equation (5) can be interpreted as ensuring coverage specifically for action movies. The latter is a stronger type of conditional guarantee \citepromano2020malice, which may be appealing if one indeed cares especially about action movies. Thus, $\bm{w}^{*}$ generally allows one to place more or less emphasis on certain unobserved portions of the matrix, interpolating between marginal and conditional views. Of course, there is a trade-off. As we will see empirically, the price of stronger theoretical guarantees obtained with more concentrated weight vectors $\bm{w}^{*}$ tends to take the form of wider (less informative) confidence regions \citepfoygel2021limits, which is why some flexibility in the choice of $\bm{w}^{*}$ is desirable.

3 Methods

This section describes the key components of our method, which we call Structured Conformalized Matrix Completion (SCMC). Section 3.1 gives a high-level overview of SCMC and outlines it in Algorithm 1. Section 3.2 details the construction of the calibration set utilized by SCMC. Section 3.3 presents a generalized quantile inflation lemma that provides the main theoretical building block for our simultaneous coverage results. Section 3.4 characterizes precisely the conformalization weights needed to apply our quantile inflation lemma in the context of SCMC. Section 3.5 establishes our lower and upper simultaneous coverage bounds. Important computational shortcuts pertaining to the evaluation of our conformalization weights are postponed to Section 4.

3.1 Method Outline

Having observed the matrix entries indexed by $\mathcal{D}_{\mathrm{obs}}\subset[n_{r}]\times[n_{c}]$ , SCMC partitions $\mathcal{D}_{\mathrm{obs}}$ into two disjoint subsets: a training set $\mathcal{D}_{\mathrm{train}}$ and a calibration set $\mathcal{D}_{\mathrm{cal}}$ , so that $\mathcal{D}_{\mathrm{obs}}=\mathcal{D}_{\mathrm{train}}\cup\mathcal{D}_{% \mathrm{cal}}$ . However, departing from the standard approach in (split) conformal inference, we do not partition the data completely at random. On the contrary, since we want the calibration set to exhibit a structure similar to that of the target group $\bm{X}^{*}$ , we form $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{cal}}$ using a more sophisticated approach, the details of which are explained later in Section 3.2.

After appropriately partitioning the observations into $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{cal}}$ , SCMC trains a matrix completion algorithm using only the data in $\mathcal{D}_{\mathrm{train}}$ , producing a point estimate $\widehat{\bm{M}}(\mathcal{D}_{\mathrm{train}})$ of the full matrix $\bm{M}$ . Any matrix completion algorithm can be applied for this purpose. For example, if $\bm{M}$ is suspected to have an underlying low-rank structure, it may be reasonable to follow a classical convex nuclear norm minimization approach \citepcandes_exact_2008, computing

\begin{gathered}\widehat{\bm{M}}(\mathcal{D}_{\mathrm{train}})=\underset{\bm{Z% }\in\mathbb{R}^{n_{r}\times n_{c}}}{\arg\min}\lVert\bm{Z}\rVert_{*}\qquad\text% {subject to}\qquad\mathcal{P}_{\mathcal{D}_{\mathrm{train}}}(\bm{Z})=\mathcal{% P}_{\mathcal{D}_{\mathrm{train}}}(\bm{M}),\end{gathered}

where $\lVert\cdot\rVert_{*}$ denotes the nuclear norm and $\mathcal{P}_{\mathcal{D}_{\mathrm{train}}}(\bm{M})$ is the orthogonal projection of $\bm{M}$ onto the subspace of matrices that vanish outside the index set $\mathcal{D}_{\mathrm{train}}$ .

Beyond convex optimization, our method can be combined with any matrix completion algorithm, including those based on non-convex factorization \citepsun2016guaranteed or deep learning \citepsedhain2015autorec,fan2017deep. While SCMC tends to produce more informative confidence regions if $\widehat{\bm{M}}(\mathcal{D}_{\mathrm{train}})$ estimates $\bm{M}$ more accurately, its coverage guarantee will require no assumptions on how $\widehat{\bm{M}}$ is derived from $\mathcal{D}_{\mathrm{train}}$ .

Our method translates any black-box estimate $\widehat{\bm{M}}(\mathcal{D}_{\mathrm{train}})$ into confidence regions for the missing entries as follows. Let $\mathcal{C}$ be a pre-specified set-valued function, termed prediction rule, that takes as input $\widehat{\bm{M}}$ , a list of $K$ target indices $\bm{x}^{*}=(x^{*}_{1},\ldots,x^{*}_{K})$ , and a parameter $\tau\in[0,1]$ , and outputs $\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})\subseteq\mathbb{R}^{K}$ . (We will often make the dependence of $\widehat{\bm{M}}(\mathcal{D}_{\mathrm{train}})$ on $\mathcal{D}_{\mathrm{train}}$ implicit.) Our method is flexible in the choice of the prediction rule, but we generally require that this function be monotone increasing in $\tau$ , in the sense that

\displaystyle\mathcal{C}(\bm{x}^{*},\tau_{1},\widehat{\bm{M}})\subseteq% \mathcal{C}(\bm{x}^{*},\tau_{2},\widehat{\bm{M}}),\quad\text{ almost-surely if% }\tau_{1}<\tau_{2},

(6)

and satisfies the following boundary conditions almost-surely:

\displaystyle\mathcal{C}(\bm{x}^{*},0,\widehat{\bm{M}})=\left\{\left(\widehat{% \bm{M}}_{x^{*}_{1}},\ldots,\widehat{\bm{M}}_{x^{*}_{K}}\right)\right\},

\displaystyle\lim_{\tau\to 1^{-}}\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})% =\mathbb{R}^{K}.

(7)

Intuitively, $\tau=0$ corresponds to placing absolute confidence in the accuracy of $\widehat{\bm{M}}$ , while approaching $\tau=1$ suggests that the point estimate carries no information about $\bm{M}$ .

For example, a simple prediction rule that satisfies the aforementioned requirements is

\displaystyle\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})=\left(\widehat{\bm{% M}}_{x^{*}_{1}}\pm\frac{\tau}{1-\tau},\ldots,\widehat{\bm{M}}_{x^{*}_{K}}\pm% \frac{\tau}{1-\tau}\right),

(8)

which produces regions in the shape of a hyper-cube. This approach will be utilized in our numerical experiments due to its ease of interpretation, but it is of course not unique. See Appendix A2.1 for further details and additional examples of alternative prediction rules.

The purpose of the observations indexed by $\mathcal{D}_{\mathrm{cal}}$ , which were not used to train $\widehat{\bm{M}}$ , is to find the smallest possible $\tau$ needed to achieve simultaneous coverage (5). As detailed in Section 3.2, SCMC carefully constructs $\mathcal{D}_{\mathrm{cal}}$ so that it gives us a set of $n$ calibration groups $\{\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n}\}$ , where each $\bm{X}^{\mathrm{cal}}_{i}$ consists of $K$ observed matrix entries within the same column; i.e., $\bm{X}^{\mathrm{cal}}_{i}=(X^{\mathrm{cal}}_{i,1},\ldots,X^{\mathrm{cal}}_{i,K})$ . As explained in the next section, $n$ can be fixed arbitrarily, although it should be small compared to the total number of observed matrix entries and typically at least greater than 100 to avoid excessively high variance in the results \citepvovk2012conditional,sesia2020comparison. Intuitively, these calibration groups are constructed in such a way as to (approximately) simulate the structure of $\bm{X}^{*}$ .

For each calibration group, we compute a conformity score $S_{i}=S(\bm{X}^{\mathrm{cal}}_{i})$ , defined as the smallest value of $\tau$ for which the candidate confidence region covers all $K$ entries of $\bm{M}_{\bm{X}^{\mathrm{cal}}_{i}}$ :

\displaystyle S_{i}:=\inf\left\{\tau\in\mathbb{R}:\bm{M}_{\bm{X}^{\mathrm{cal}% }_{i}}\in\mathcal{C}(\bm{X}^{\mathrm{cal}}_{i},\tau,\widehat{\bm{M}})\right\}.

(9)

Then, the calibrated value $\widehat{\tau}_{\alpha,K}$ of $\tau$ is obtained by evaluating the following weighted quantile \citeptibshirani-covariate-shift-2019 of the empirical distribution of the calibration scores:

\displaystyle\widehat{\tau}_{\alpha,K}=Q\Big{(}1-\alpha;\sum_{i=1}^{n}p_{i}% \delta_{S_{i}}+p_{n+1}\delta_{\infty}\Big{)}.

(10)

Above, $Q(1-\alpha;F)$ denotes the $1-\alpha$ quantile of a distribution $F$ on the augmented real line $\mathbb{R}\cup\{\infty\}$ ; that is, for $S\sim F$ , $Q(\beta;F)=\inf{\left\{s\in\mathbb{R}:\mathbb{P}\left(S\leq s\right)\geq\beta% \right\}}$ . The distribution in (10) places a point mass $p_{i}$ on each observed value of $S_{i}$ and an additional point mass $p_{n+1}$ at $+\infty$ . The expression of the weights $p_{i}$ and $p_{n+1}$ will be given in Section 3.4. These weights generally depend on $\bm{X}^{*}$ and on all $\bm{X}^{\mathrm{cal}}_{i}$ , although this dependence is kept implicit here for simplicity.

Finally, the calibrated parameter $\widehat{\tau}_{\alpha,K}$ is utilized to construct a joint confidence region

\displaystyle\widehat{\bm{C}}(\bm{X}^{*};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha% )=\mathcal{C}(\bm{X}^{*},\widehat{\tau}_{\alpha,K},\widehat{\bm{M}}).

(11)

This will be proved in Section 3.5 to have valid simultaneous coverage (5), as long as $\bm{X}_{\mathrm{obs}}$ is sampled from (1) and $\bm{X}^{*}$ from (4). The overall procedure is summarized by Algorithm 1, while all missing details will be carefully explained in the subsequent sections.

Algorithm 1 Simultaneous Conformalized Matrix Completion (SCMC)

1: Input: partially observed matrix

\bm{M}_{\bm{X}_{\mathrm{obs}}}

, with unordered list of observed indices

\mathcal{D}_{\mathrm{obs}}

;

2: Input: test group

\bm{X}^{*}

; nominal coverage level

\alpha\in(0,1)

;

3: Input: any matrix completion algorithm producing point estimates;

4: Input: any prediction rule

\mathcal{C}

satisfying (6) and (7);

5: Input: desired number

n

of calibration groups.

6: Apply Algorithm 2 to obtain

\mathcal{D}_{\mathrm{train}}

and the calibration groups

(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n})

\mathcal{D}_{\mathrm{cal}}

;

7: Compute a point estimate

\widehat{\bm{M}}

, looking only the observations in

\mathcal{D}_{\mathrm{train}}

8: Compute the conformity scores

S_{i}

, for all

i\in[n]

, with Equation (9).

9: Compute

\widehat{\tau}_{\alpha,K}

in (10), based on the weights

p_{i}

given by (19) in Section 3.4.

10: Output: Joint confidence region

\widehat{\bm{C}}(\bm{X}^{*};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)

given by Equation (11).

3.2 Assembling the Structured Calibration Set

This section explains how to partition $\mathcal{D}_{\mathrm{obs}}$ into a training set $\mathcal{D}_{\mathrm{train}}$ and a collection of calibration groups $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n}$ that approximately mimic the structure of $\bm{X}^{*}$ . To begin, we note that the number $n$ of calibration groups cannot exceed $\lfloor n_{\mathrm{obs}}/K\rfloor$ and, further,

\displaystyle n\leq\sum_{c=1}^{n_{c}}\left\lfloor\frac{n^{c}_{\mathrm{obs}}}{K% }\right\rfloor\eqqcolon\xi_{\mathrm{obs}},

(12)

where $n^{c}_{\mathrm{obs}}$ is the number of observed entries in column $c\in[n_{c}]$ , which is a function of $\bm{X}_{\mathrm{obs}}$ in (1). To satisfy these constraints, as a practical rule-of-thumb one may set $n=\min\{1000,\lfloor\xi_{\mathrm{obs}}/2\rfloor\}$ . In the following, we will assume that $n$ is a fixed parameter (e.g., $n=1000$ ) guaranteed to satisfy the upper bound in (12). This simplification streamlines the analysis of SCMC without much loss of generality. In principle, it would also be possible to set $n$ in a data-independent way so that (12) holds with high probability, as long as $K$ is not too large compared to $n_{\mathrm{obs}}$ and $n_{r}n_{c}$ .

For any given $n$ satisfying (12), we partition $\mathcal{D}_{\mathrm{obs}}$ into a training set $\mathcal{D}_{\mathrm{train}}$ and a collection of $n$ calibration groups $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n}$ as detailed in Algorithm 2. After initializing an empty $\mathcal{D}_{\mathrm{train}}$ , we iterate over each column $c$ and assign to $\mathcal{D}_{\mathrm{train}}$ a random subset of $m^{c}:=n^{c}_{\mathrm{obs}}\bmod K$ observations from that column, where $n^{c}_{\mathrm{obs}}$ is the total number of observations in column $c$ . This preliminary step ensures that the remaining number of observations in column $c$ is a multiple of $K$ (possibly zero). Then, for each $i\in[n]$ , $\bm{X}^{\mathrm{cal}}_{i}$ is obtained by sampling $K$ observations uniformly without replacement from a randomly chosen matrix column. Finally, all remaining observations are assigned to $\mathcal{D}_{\mathrm{train}}$ .

Algorithm 2 Assembling the structured calibration set for Algorithm 1

1: Input: Set

\mathcal{D}_{\mathrm{obs}}

n_{\mathrm{obs}}

observed entries; number

n

of calibration groups; group size

K

2: Initialize an empty set of matrix indices,

\mathcal{D}_{\mathrm{prune}}=\emptyset

3: for all columns

c\in[n_{c}]

4: Define

m^{c}:=n^{c}_{\mathrm{obs}}\bmod K

5: if

m^{c}\neq 0

then

6: Sample

m^{c}

indices

\left(I_{1},\dots,I_{m^{c}}\right)\sim\Psi(m^{c},\mathcal{D}_{\mathrm{obs}}% \cap([n_{r}]\times\{c\}),\bm{1})

7: Add the entry indices

\left\{I_{1},\dots,I_{m^{c}}\right\}

\mathcal{D}_{\mathrm{prune}}

8: end if

9: end for

10: Initialize a set of available observed matrix indices,

\mathcal{D}_{\mathrm{avail}}=\mathcal{D}_{\mathrm{obs}}\setminus\mathcal{D}_{% \mathrm{prune}}

11: Initialize an empty set of matrix index groups,

\mathcal{D}_{\mathrm{cal}}=\emptyset

12: for

i\in[n]

13: Sample

\bm{X}^{\mathrm{cal}}_{i}=(X^{\mathrm{cal}}_{i,1},\dots,X^{\mathrm{cal}}_{i,K}% )\sim\Psi^{\text{col}}(K,\mathcal{D}_{\mathrm{avail}},\bm{1})

, with

\Psi^{\text{col}}

defined as in (3).

14: Insert

\bm{X}^{\mathrm{cal}}_{i}

\mathcal{D}_{\mathrm{cal}}

. Remove

\{X^{\mathrm{cal}}_{i,1},\dots,X^{\mathrm{cal}}_{i,K}\}

from

\mathcal{D}_{\mathrm{avail}}

15: end for

16: Define:

\mathcal{D}_{\mathrm{train}}=\mathcal{D}_{\mathrm{prune}}\cup\mathcal{D}_{% \mathrm{avail}}

17: Output: Set of calibration groups

\mathcal{D}_{\mathrm{cal}}=\{\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{% cal}}_{n}\}

; training set

\mathcal{D}_{\mathrm{train}}\subset\mathcal{D}_{\mathrm{obs}}

;

Algorithm 2 intuitively mimics the sampling model for $\bm{X}^{*}$ defined in (3), with the key difference that it samples the calibration groups from $\mathcal{D}_{\mathrm{obs}}$ instead of $\mathcal{D}_{\mathrm{miss}}$ . This unavoidable discrepancy, however, is delicate, as it implies that $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n}$ are neither exchangeable nor weighted exchangeable \citeptibshirani-covariate-shift-2019 with the test group $\bm{X}^{*}$ . Therefore, an innovative approach is needed to translate these calibration groups into valid simultaneous confidence regions, as explained in the next section.

3.3 A General Quantile Inflation Lemma

Consider a conformity score $S^{*}$ , defined similarly to the scores $S_{i}$ in (9),

\displaystyle S^{*}:=\inf\left\{\tau\in\mathbb{R}:\bm{M}_{\bm{X}^{*}}\in% \mathcal{C}(\bm{X}^{*},\tau,\widehat{\bm{M}})\right\}.

(13)

In words, $S^{*}$ is the smallest $\tau$ for which $\mathcal{C}(\bm{X}^{*},\tau,\widehat{\bm{M}})$ covers all $K$ entries of $\bm{M}_{\bm{X}^{*}}$ . Although this score cannot be observed because the matrix entries indexed by $\bm{X}^{*}$ are latent, it is a well-defined and useful quantity. It allows us to write the probability that the confidence region output by Algorithm 1 simultaneously covers all elements of $\bm{M}_{\bm{X}^{*}}$ as:

\displaystyle\mathbb{P}\left[\bm{M}_{\bm{X}^{*}}\in\mathcal{C}(\bm{X}^{*},% \widehat{\tau}_{\alpha,K},\widehat{\bm{M}})\right]

\displaystyle=\mathbb{P}\left[S^{*}\leq\widehat{\tau}_{\alpha,K}\right].

(14)

To establish that Algorithm 1 achieves simultaneous coverage (5), the right-hand-side of (14) must be bounded from below by $1-\alpha$ , for a suitable (and practical) choice of the weights $p_{i}$ and $p_{n+1}$ used to compute $\widehat{\tau}_{\alpha,K}$ in (19). This is not straightforward because the scores $S_{1},\ldots,S_{n},S^{*}$ are neither exchangeable nor weighted exchangeable, as they respectively depend on $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n}$ and $\bm{X}^{*}$ . A solution is provided by the following lemma due to \citettibshirani-covariate-shift-2019.

Lemma 1 (from \citettibshirani-covariate-shift-2019).

Let $Z_{1},\ldots,Z_{n+1}$ be random variables with joint law $f$ . For any fixed function $s$ and $i\in[n+1]$ , define $V_{i}=s(Z_{i},Z_{-i})$ , where $Z_{-i}=\{Z_{1},\ldots,Z_{n+1}\}\setminus\{Z_{i}\}$ . Assume that $V_{1},\ldots,V_{n+1}$ are distinct almost surely. Define also

p^{f}_{i}(z_{1},\ldots,z_{n+1}):=\frac{\sum_{\sigma\in\mathcal{S}:\sigma(n+1)=% i}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}{\sum_{\sigma\in\mathcal{S}}f(z_{% \sigma(1)},\ldots,z_{\sigma(n+1)})},\quad\forall i\in[n+1],

(15)

where $\mathcal{S}$ is the set of all permutations of $[n+1]$ . Then, for any $\beta\in(0,1)$ ,

\mathbb{P}\left[V_{n+1}\leq Q\bigg{(}\beta;\,\sum_{i=1}^{n}p^{f}_{i}(Z_{1},% \ldots,Z_{n+1})\delta_{V_{i}}+p^{f}_{n+1}(Z_{1},\ldots,Z_{n+1})\delta_{\infty}% \bigg{)}\right]\geq\beta.

Translating Lemma 1 into a practical method requires evaluating the weights $p^{f}_{i}$ defined in (15), which generally involves a computationally unfeasible sum over an exponential number of permutations. If the distribution $f$ satisfies a symmetry condition called “weighted exchangeability”, it was shown by \citettibshirani-covariate-shift-2019 that the expression in (15) simplifies greatly, but this is not helpful in our case because $(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*})$ do not enjoy such a property. Further, it is unclear how Algorithm 2 may be modified to achieve weighted exchangeability.

Fortunately, our groups satisfy a “leave-one-out exchangeability” property that still enables an efficient computation of the conformalization weights in (15). Intuitively, the joint distribution of $\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*}$ is invariant to the reordering of the first $n$ variables.

Proposition 1.

Let $\mathcal{D}_{\mathrm{obs}}$ and $\mathcal{D}_{\mathrm{miss}}$ be subsets of observed and missing matrix entries, respectively, sampled according to (1). Let $\bm{X}^{*}$ be a test group sampled according to (3) conditional on $\mathcal{D}_{\mathrm{obs}}$ . Suppose $\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n}$ are the calibration groups output by Algorithm 2, while $\mathcal{D}_{\mathrm{prune}}$ and $\mathcal{D}_{\mathrm{train}}$ are the corresponding training and pruned observation sets. Then, for any permutation $\sigma$ of $[n]$ ,

\displaystyle(\bm{X}^{\mathrm{cal}}_{1},\bm{X}^{\mathrm{cal}}_{2},\ldots,\bm{X% }^{\mathrm{cal}}_{n},\bm{X}^{*})\overset{d}{=}(\bm{X}^{\mathrm{cal}}_{\sigma(1% )},\bm{X}^{\mathrm{cal}}_{\sigma(2)},\ldots,\bm{X}^{\mathrm{cal}}_{\sigma(n)},% \bm{X}^{*})\quad\mid\mathcal{D}_{\mathrm{prune}},\mathcal{D}_{\mathrm{train}}.

The proof of Proposition 1 is in Appendix A4.2. Intuitively, this is established by deriving the joint distribution of $(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*})$ conditional on $\mathcal{D}_{\mathrm{prune}}$ and $\mathcal{D}_{\mathrm{train}}$ . The usefulness of this result becomes clear in the light of the following specialized version of Lemma 1.

Lemma 2.

Let $Z_{1},\ldots,Z_{n+1}$ be leave-one-out exchangeable random variables, so that there exists a permutation-invariant function $g$ such that their joint law $f$ can be factorized as

\displaystyle f(Z_{1},\ldots,Z_{n+1})

\displaystyle=g(\{Z_{1},\ldots,Z_{n+1}\})\cdot\bar{h}(\{Z_{1},\ldots,Z_{n}\},Z% _{n+1}),

(16)

for some function $\bar{h}$ taking as first input an unordered set of $n$ elements. For any fixed function $s:\mathbb{R}^{n+1}\mapsto\mathbb{R}$ and any $i\in[n+1]$ , define $V_{i}=s(Z_{i},Z_{-i})$ , where $Z_{-i}=\{Z_{1},\ldots,Z_{n+1}\}\setminus\{Z_{i}\}$ . Assume that $V_{1},\ldots,V_{n+1}$ are almost-surely distinct. Then, $\forall\beta\in(0,1)$ ,

\displaystyle\mathbb{P}\bigg{\{}V_{n+1}\leq Q\bigg{(}\beta;\,\sum_{i=1}^{n}p_{% i}(Z_{1},\ldots,Z_{n+1})\delta_{V_{i}}+p_{n+1}(Z_{1},\ldots,Z_{n+1})\delta_{% \infty}\bigg{)}\bigg{\}}\geq\beta,

(17)

where

\displaystyle p_{i}(Z_{1},\ldots,Z_{n+1})

\displaystyle=\frac{\bar{h}(Z_{-i},Z_{i})}{\sum_{j=1}^{n+1}\bar{h}(Z_{-j},Z_{j% })}.

(18)

The proof of Lemma 2 is in Appendix A4.1. Note that the weights $p_{i}$ in (18) are relatively easy to compute because they involve a sum over only $n+1$ instead of $(n+1)!$ terms.

3.4 Characterization of the Conformalization Weights

We now characterize explicitly the conformalization weights needed to apply Lemma 2 to our problem. The following notation is useful for this purpose.

Denote by $\bar{\mathcal{D}}_{\mathrm{obs}}$ an augmented version of $\mathcal{D}_{\mathrm{obs}}$ that also includes the unordered set of indices corresponding to the test group $\bm{X}^{*}=(X^{*}_{1},X^{*}_{2},\ldots,X^{*}_{K})$ ; i.e.,

\displaystyle\bar{\mathcal{D}}_{\mathrm{obs}}\coloneqq\mathcal{D}_{\mathrm{obs% }}\cup\left(\cup_{k\in[K]}\{X^{*}_{k}\}\right).

Similarly, let $D_{\mathrm{obs}}$ and $\bar{D}_{\mathrm{obs}}$ denote possible realization of $\mathcal{D}_{\mathrm{obs}}$ and $\bar{\mathcal{D}}_{\mathrm{obs}}$ , respectively. Then, for any $i\in[n]$ , denote by $D_{\mathrm{obs};i}$ the imaginary set of observations obtained by replacing the indices corresponding to the calibration group $\bm{X}^{\mathrm{cal}}_{i}$ with those corresponding to the test group $\bm{X}^{*}$ . Further, let $D_{\mathrm{obs};n+1}\coloneqq D_{\mathrm{obs}}$ denote the original observation set. In summary,

\displaystyle D_{\mathrm{obs};i}\coloneqq\begin{cases}\bar{D}_{\mathrm{obs}}% \setminus\cup_{k\in[K]}\{x_{i,k}\},&{\mathrm{for}\;i\in[n]},\\ D_{\mathrm{obs}},&{\mathrm{for}\;i=n+1},\end{cases}

where $\bm{x}_{i}=(x_{i,1},\ldots,x_{i,K})$ is a realization of $\bm{X}^{\mathrm{cal}}_{i}$ and $\bm{x}_{n+1}$ is a realization of $\bm{X}^{*}$ .

Next, let $n_{\mathrm{obs}}^{c}$ denote the numbers of observations in column $c$ from the sets $D_{\mathrm{obs}}$ . Define also $\bar{n}^{c}_{\mathrm{obs}}:=n^{c}_{\mathrm{obs}}-(n^{c}_{\mathrm{obs}}\mod K)$ , the corresponding numbers of observations remaining in column $c$ after the random pruning step of Algorithm 2. For any $i\in[n+1]$ , let $c_{i}$ denote the column to which $\bm{x}_{i}$ belongs; i.e., $c_{i}\coloneqq x_{i,k,2},\forall k\in[K]$ , where $x_{i,k,2}$ is the column of the $k$ -th entry in $\bm{x}_{i}$ . Further, let $\bar{D}_{\mathrm{miss}}$ denote a realization of $\bar{\mathcal{D}}_{\mathrm{miss}}$ in (2). With slight abuse of notation, we denote the set of missing indices in column $c_{n+1}$ excluding those in the group $\bm{x}_{n+1}$ as $D^{c_{n+1}}_{\mathrm{miss}}\setminus\bm{x}_{n+1}\coloneqq D^{c_{n+1}}_{\mathrm% {miss}}\setminus\{x_{n+1,k}\}_{k=1}^{K}$ . We are now ready to state how Lemma 2 applies in our setting, with an explicit expression for the conformalization weights in (18).

Lemma 3.

Under the setting of Proposition 1, let $Z_{1},\ldots,Z_{n},Z_{n+1}$ denote $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*}$ , and $V_{1},\ldots,V_{n},V_{n+1}$ represent the corresponding scores $S_{1},\ldots,S_{n},S^{*}$ given by (9) and (13), respectively, based on a matrix estimate $\widehat{\bm{M}}$ computed based on the observations in $\mathcal{D}_{\mathrm{train}}$ . Then, Equation (17) from Lemma 2 applies conditional on $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{prune}}$ , with weights

\displaystyle\begin{split}p_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1})&% \propto\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs};i}% \right)\cdot\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^{K}\widetilde{% w}^{*}_{x_{i,k}}\right)\\ &\quad\cdot\left[\cfrac{\mbinom{n^{c_{i}}_{\mathrm{obs}}}{\bar{n}^{c_{i}}_{% \mathrm{obs}}}}{\mbinom{n^{c_{i}}_{\mathrm{obs}}-K}{\bar{n}^{c_{i}}_{\mathrm{% obs}}-K}}\cdot\cfrac{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}}}{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}+K}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}+K}}\cdot\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{% obs}}-k}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}+K-k}\right]^{\mathbb{I}\left[c_{i}% \neq c_{n+1}\right]}.\end{split}

(19)

Above, $\widetilde{w}^{*}_{x_{i,1}}$ and $\widetilde{w}^{*}_{x_{i,k}}$ have explicit expressions that depend on the weights $\bm{w}^{*}$ in (3); i.e.,

\displaystyle\widetilde{w}^{*}_{x_{i,1}}=\frac{w^{*}_{x_{i,1}}}{\sum_{(r,c)\in% \bar{D}_{\mathrm{miss}}}w^{*}_{r,c}-\sum\limits_{k=1}^{K}\left(w^{*}_{x_{n+1,k% }}-w^{*}_{x_{i,k}}\right)+u^{*}_{x_{i,1}}},

(20)

with

\displaystyle u^{*}_{x_{i,1}}=\mathbb{I}\left[c_{i}\neq c_{n+1}\right]\mathbb{% I}\left[n^{c_{i}}_{\mathrm{miss}}<K\right]\left(\sum\limits_{(r,c)\in D^{c_{i}% }_{\mathrm{miss}}}w^{*}_{r,c}-\mathbb{I}\left[n^{c_{n+1}}_{\mathrm{miss}}<2K% \right]\sum\limits_{(r,c)\in D^{c_{n+1}}_{\mathrm{miss}}\setminus\bm{x}_{n+1}}% w^{*}_{r,c}\right),

and, for all $k\in\{2,\ldots,K\}$ ,

\displaystyle\widetilde{w}^{*}_{x_{i,k}}=\frac{w^{*}_{x_{i,k}}}{\sum\limits_{(% r,c)\in D^{c_{i}}_{\mathrm{miss}}}w^{*}_{r,c}+\sum\limits_{k^{\prime}=k}^{K}w^% {*}_{x_{i,k^{\prime}}}-\mathbb{I}\left[c_{i}=c_{n+1}\right]\sum\limits_{k^{% \prime}=1}^{K}w^{*}_{x_{n+1,k^{\prime}}}}.

(21)

The main challenge in the computation of (19) arises from the term $\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs};i}\right)$ , which is the probability of observing the matrix entries in $D_{\mathrm{obs};i}$ and depends on the sampling weights $\bm{w}$ in (1). Although this probability cannot be evaluated analytically, it can be approximated with an efficient algorithm, which makes it possible to compute the conformalization weights in (19) at cost $\mathcal{O}(n_{r}n_{c}+nK)$ , as explained in Section 4.

3.5 Finite-Sample Coverage Bounds

The following theorem states formally that Algorithm 1 produces joint confidence regions with simultaneous coverage for random groups $\bm{X}^{*}$ sampled according to the model defined in (3). This result follows by integrating Proposition 1, Lemma 2, and Equation (19).

Theorem 1.

Suppose $\mathcal{D}_{\mathrm{obs}}$ and $\mathcal{D}_{\mathrm{miss}}$ are sampled according to (1). Let $\bm{X}^{*}$ be a test group sampled according to (3) conditional on $\mathcal{D}_{\mathrm{obs}}$ . Then, for any fixed level $\alpha\in(0,1)$ , the joint confidence region output by Algorithm 1 satisfies (5) conditional on $\mathcal{D}_{\mathrm{train}},\mathcal{D}_{\mathrm{prune}}$ :

\displaystyle\mathbb{P}\left[\bm{M}_{\bm{X}^{*}}\in\widehat{\bm{C}}(\bm{X}^{*}% ;\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)\mid\mathcal{D}_{\mathrm{train}},% \mathcal{D}_{\mathrm{prune}}\right]\geq 1-\alpha.

Note that the probability in Theorem 1 is taken over the randomness in $\{\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n}\}$ and $\bm{X}^{*}$ , while $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{prune}}$ can be considered fixed. Therefore, this result implies the simultaneous coverage property stated earlier in (5). Further, it is also possible to bound our simultaneous coverage from above.

Theorem 2.

Under the same setting of Theorem 1,

\displaystyle\mathbb{P}\left[\bm{M}_{\bm{X}^{*}}\in\widehat{\bm{C}}(\bm{X}^{*}% ;\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)\right]\leq 1-\alpha+\mathbb{E}\left[{% \max_{i\in[n+1]}p_{i}(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n% },\bm{X}^{*})}\right],

(22)

where the conformalization weights $p_{i}$ are given by (19) and the expectations can also be taken conditional on $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{prune}}$ , as in Theorem 1.

Theorem 2 is proved in Appendix A4.4. A numerical investigation of the expected value on the right-hand-side of (22), conducted in Appendix A3.2, demonstrates that in practice the upper bound in (22) converges to $1-\alpha$ as $n$ increases. This is consistent with our empirical observations that Algorithm 1 is not too conservative, as previewed in Figure 1.

4 Computational Shortcuts and Cost Analysis

4.1 Efficient Evaluation of the Conformalization Weights

We now explain how to efficiently approximate the conformalization weights $p_{i}$ in (19), for all $i\in[n+1]$ . The main challenge is to evaluate $\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs};i}\right)$ according to the missingness model defined in (1). In truth, it suffices to relate this probability, which depends on the index $i\in[n+1]$ , to $\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs}}\right)$ , which is constant and can thus be ignored when computing (19). In this section, we demonstrate that their ratio can be expressed in a much more tractable form, one whose computational complexity does not increase with the matrix dimensions.

We begin by expressing $\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs}}\right)$ and $\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{\mathrm{obs};i})$ , for any $i\in[n+1]$ , as closed-form integrals. Let $\delta=\sum_{(r,c)\in D_{\mathrm{miss}}}w_{r,c}$ denote the cumulative weight of all missing indices and, for any positive scaling parameter $h>0$ , define $\Phi(\tau;h)$ of $\tau\in(0,1]$ as

\displaystyle\Phi(\tau;h):=h\delta\tau^{h\delta-1}\prod_{(r,c)\in D_{\mathrm{% obs}}}\left(1-\tau^{hw_{r,c}}\right),\quad\quad\phi(\tau;h):=\log\Phi(\tau;h).

(23)

Further, define also $d_{i}:=\sum_{k=1}^{K}(w_{x_{i,k}}-w_{x_{n+1,k}})$ for all $i\in[n+1]$ .

Proposition 2.

For any fixed $n_{\mathrm{obs}}<n_{r}n_{c}$ , scaling parameter $h>0$ , and $i\in[n+1]$ ,

\displaystyle\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{\mathrm{obs};% i})=\int_{0}^{1}\Phi(\tau;h)\cdot\eta_{i}(\tau)\,d\tau,

(24)

where, for any $\tau\in(0,1)$ ,

\displaystyle\eta_{i}(\tau;h):=\frac{\tau^{hd_{i}}(\delta+d_{i})}{\delta}\cdot% \left(\prod\limits_{k=1}^{K}\frac{1-\tau^{hw_{x_{n+1,k}}}}{1-\tau^{hw_{x_{i,k}% }}}\right).

(25)

Note that $\eta_{n+1}(\tau)=1$ for all $\tau$ if $i=n+1$ , and in that case Proposition 2 recovers a classical result by \citetwallenius1963biased. See the proof of Proposition 2 in Appendix A4.5 for further details. Furthermore, the function $\eta_{i}$ in (25) is a product of only $K$ simple functions of $\tau$ , and therefore it is straightforward to evaluate even for large matrices.

Proposition 2 provides the foundation for evaluating the conformalization weights $p_{i}$ in (19). The remaining difficulty is that (24) has no analytical solution. Fortunately, the function $\Phi(\tau;h)$ satisfies some properties that make it feasible to approximate this integral accurately.

Lemma 4.

If $h>1-\delta$ , the function $\Phi(\tau;h)$ defined in (23) has a unique stationary point with respect to $\tau$ at some value $\tau_{h}\in(0,1)$ . Further, $\tau_{h}$ is a global maximum.

See Figure 2 for a visualization of $\Phi(\tau;h)$ and $\eta_{i}(\tau;h)$ , in two examples where the sampling weights $\bm{w}$ in (1) are independent and uniformly distributed on $[0,1]$ . These results show that $\Phi(\tau;h)$ becomes increasingly concentrated around its unique maximum for larger sample sizes, while $\eta_{i}(\tau;h)$ remains relatively smooth (or flat) at that point. Therefore, it makes sense to approximate this integral through a careful extension of Laplace’s method \citeplaplace1774memoire. This is explained below.

The first step to approximate the integral in (24) with a generalized Laplace method (justified later Section 4.2), is to modify the integrand in such a way as to move the peak away from the integration boundary. To this end, define $\tau_{h}$ as

\displaystyle\tau_{h}:=\operatorname*{arg\,max}_{\tau\in(0,1)}\Phi(\tau;h),

(26)

and recall that $h>0$ is a parameter that we are free to choose. Therefore, we will tune $h$ in such a way as to center the peak within the integration domain; that is, we pick a value $h$ such that $\tau_{h}=1/2$ . Fortunately, Lemma 4 tells us that the function $\Phi(\tau;h)$ has a unique global maximum at $\tau_{h}\in(0,1)$ when $h>1/\delta$ , and a suitable value of $h>1/\delta$ such that $\tau_{h}=1/2$ can be found by applying the Newton-Raphson iterative algorithm; see Appendix A2.2 for further details.

Having fixed $h$ such that $\tau_{h}=1/2$ , a Laplace approximation can be obtained as follows. The key intuition is that, as the number of observations grows, the peak of the function $\Phi(\tau;h)$ increasingly dominates the integral. In particular, a second-order Taylor expansion shows that the integral is primarily determined by the value of $\eta_{i}(\tau;h)\cdot\Phi(\tau;h)$ at $\tau=\tau_{h}$ and by the curvature of $\log\Phi(\tau;h)$ at the peak, namely $\phi^{{}^{\prime\prime}}(\tau_{h};h)$ . This leads to the following approximation,

\displaystyle\begin{split}\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{% \mathrm{obs};i})\approx\eta_{i}(\tau_{h};h)\cdot\Phi(\tau_{h};h)\sqrt{\frac{-2% \pi}{\phi^{{}^{\prime\prime}}(\tau_{h};h)}}\approx\eta_{i}(\tau_{h};h)\cdot% \mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{\mathrm{obs}}).\end{split}

(27)

As explained below, this approximation becomes very accurate in the large-sample limit, and it is useful because it allows us to approximate the ratio $\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{\mathrm{obs};i})/\mathbb{P% }_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{\mathrm{obs}})$ with a quantity, $\eta_{i}(\tau_{h};h)$ , that is straightforward to calculate. For example, if the sampling weights $\bm{w}$ in (1) are uniformly constant, $\eta_{i}(\tau;h)\equiv 1$ for all $i\in[n+1]$ and any $\tau\in(0,1)$ .

By combining (27) with (19), it follows that, for each $i\in[n+1]$ , the conformalization weight $p_{i}$ can be approximately rewritten in the large-sample limit as

\displaystyle p_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1})

\displaystyle\approx\frac{\bar{p}_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1}% )}{\sum_{j=1}^{n+1}\bar{p}_{j}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1})},

(28)

with the un-normalized weight $\bar{p}_{i}$ given by:

\displaystyle\begin{split}\bar{p}_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1}% )&=\eta_{i}(\tau_{h})\cdot\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^% {K}\widetilde{w}^{*}_{x_{i,k}}\right)\\ &\quad\cdot\left[\cfrac{\mbinom{n^{c_{i}}_{\mathrm{obs}}}{\bar{n}^{c_{i}}_{% \mathrm{obs}}}}{\mbinom{n^{c_{i}}_{\mathrm{obs}}-K}{\bar{n}^{c_{i}}_{\mathrm{% obs}}-K}}\cdot\cfrac{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}}}{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}+K}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}+K}}\cdot\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{% obs}}-k}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}+K-k}\right]^{\mathbb{I}\left[c_{i}% \neq c_{n+1}\right]}.\end{split}

(29)

This finally makes Algorithm 1 practical because evaluating $\bar{p}_{i}$ in (29) only involves simple arithmetic operations and can be carried out very efficiently, as explained in Section 4.3.

4.2 Consistency of the Generalized Laplace Approximation

It is important to emphasize that our approximation in (27) is not obtained from a standard application of the Laplace method, since the latter is typically restricted to handling integrals of simpler functions; see Appendix A1.4. Yet, the Taylor approximation ideas underlying the Laplace method are versatile enough to be extended to our setting, as demonstrated by the following theorem. This novel result provides a rigorous justification for the generalized Laplace approximation in (27). For simplicity, but without much loss of generality, this theorem relies on some additional technical assumptions, which will be justified in our context towards the end of this section. This result is presented informally here for simplicity, but a formal statement can be found in Appendix A4.6, along with its proof.

Theorem 3 (Informal statement of Theorem A4).

Let $\{v_{i}\}_{i=1}^{\infty}$ denote a sequence of i.i.d. random variables from some distribution $F$ supported on $(0,1)$ , and $\{x_{i}\}_{i=1}^{\infty}$ a sequence of independent Bernoulli random variables, with $x_{i}\sim\text{Bernoulli}(v_{i})$ . Define $\delta_{m}=\sum_{i=1}^{m}(1-x_{i})v_{i}$ and let

\displaystyle\Phi_{m}(\tau):=h_{m}\delta_{m}\tau^{h_{m}\delta_{m}-1}\prod_{i=1% }^{m}\left(1-\tau^{h_{m}v_{i}}\right)^{x_{i}},

(30)

where $h_{m}$ is such that $\tau_{h}:=\operatorname*{arg\,max}_{\tau\in(0,1)}\Phi_{m}(\tau)=1/2$ . Define also $\phi_{m}(\tau)\coloneqq\log\Phi_{m}(\tau)$ . Then, for a sequence of functions $f_{m}$ bounded away from 0 and satisfying certain smoothness conditions,

\displaystyle\lim_{n\to\infty}\frac{\int_{0}^{1}f_{m}(\tau)\Phi_{m}(\tau)\,d% \tau}{f_{m}\left(\tau_{h}\right)\cdot\Phi_{m}\left(\tau_{h}\right)\sqrt{-2\pi/% \phi_{m}^{\prime\prime}(1/2)}}=1\quad\text{ almost surely.}

(31)

To relate this result to the Laplace approximations described in Section 4.1, let us compare the function $\Phi(\tau;h)$ in (23) with the function $\Phi_{m}(\tau)$ in (30). Given a map** from the sequence to the matrix entries $\sigma:[n_{r}n_{c}]\mapsto[n_{r}]\times[n_{c}]$ , we can express $\Phi(\tau;h)$ as:

\displaystyle\Phi(\tau;h)=h\delta\tau^{h\delta-1}\prod_{(r,c)\in D_{\mathrm{% obs}}}\left(1-\tau^{hw_{r,c}}\right)=h\delta\tau^{h\delta-1}\prod_{i=1}^{n_{r}% n_{c}}\left(1-\tau^{hw_{\sigma(i)}}\right)^{\mathbbm{1}\{\sigma(i)\in D_{% \mathrm{obs}}\}}.

(32)

Therefore, the discrepancy between $\Phi(\tau;h)$ and $\Phi_{n_{r}n_{c}}(\tau)$ can be traced to the different sampling models describing the distributions of our matrix observations and the $x_{i}$ variables in Theorem 3. In Theorem 3, the observations follow independent Bernoulli distributions, whereas the matrix entries in our model (3) are sampled without replacement. These views can be reconciled as follows. Sampling without replacement is a natural modelling choice for the simultaneous inference problem studied in this paper, but it would make the proof of Theorem A4 too complicated. Nevertheless, these two models are qualitatively consistent. Suppose the sampling weights $\bm{w}$ in (1) are constant; then, in that special case our model corresponds to that of Theorem 3 with $v_{i}\equiv v$ , for some constant $v\in(0,1)$ , after conditioning on the observed number of entries $n_{\text{obs}}$ .

4.3 Computational Complexity

The SCMC method described in this paper can be implemented efficiently and is able to handle completion tasks involving large matrices. Its practicality is demonstrated in this section, which summarizes the results of an analysis of the computational complexity of different components of Algorithm 1. We refer to Appendix A2.3 for the details behind this analysis and an explanation of the underling computational shortcuts with which all redundant operations are streamlined.

In summary, the cost of producing a joint confidence region for a test group $\bm{X}^{*}$ of size $K$ using Algorithm 1 is $\mathcal{O}(T+n_{r}n_{c}+n(K+\log n))$ , where $T$ denotes the fixed cost of training the black-box matrix completion model based on $\bm{M}_{\bm{X}_{\mathrm{obs}}}$ and $n$ is the number of calibration groups. Further, it is possible to recycle redundant calculations when constructing simultaneous confidence regions for $m$ distinct test groups $\bm{X}^{*}$ , as explained in Appendix A2.3. Therefore, the overall cost of obtaining $m$ distinct confidence regions for $m$ different groups is only $\mathcal{O}(T+n_{r}n_{c}+n(mK+\log n))$ . See Table 1 for a summary of these results.

Table 1: Computational analysis of different components of SCMC.

Module	Cost for one test group	Cost for $m$ test groups
Overall (Algorithm 1)	$\mathcal{O}(T+n_{r}n_{c}+n(K+\log n))$	$\mathcal{O}(T+n_{r}n_{c}+n(mK+\log n))$
Algorithm 2	$\mathcal{O}(n_{c}n_{r}+nK)$	$\mathcal{O}(n_{c}n_{r}+nK)$

5 Empirical Demonstrations

We apply SCMC to simulated and real data, comparing its performance to those of the unadjusted and Bonferroni baselines. This section is organized as follows. Section 5.1 describes experiments based on simulated data, with Section 5.1.1 focusing on (known) uniform sampling weights, and Section 5.1.2 allowing the sampling weights for the observed data to be heterogeneous (although still known exactly). Section 5.2 describes more realistic experiments involving the MovieLens data, considering estimated sampling weights. The results of additional experiments are presented in the Appendices. Appendix A3.1 describes experiments with synthetic data involving heterogeneous test weights. Appendix A3.2 investigates the tightness of the theoretical coverage upper bounds derived in Section 3.5.

5.1 Numerical Experiments with Synthetic Data

5.1.1 Uniform Sampling Weights

We begin with a simple scenario in which the observation pattern in (3) is completely random and the test weights in (4) are uniform: $w_{r,c}=w^{*}_{r,c}=1$ for all $(r,c)\in[n_{r}]\times[n_{c}]$ . A matrix $\bm{M}$ with $n_{r}=200$ rows and $n_{c}=200$ columns is generated based on a “signal plus noise” model that exhibits both a low-rank structure and column-wise dependencies. (For example, in the Netflix data set, users may tend to agree on the quality of certain movies, leading to positive dependency among the columns of the rating matrix.) This design is motivated by the intuition that column-wise dependencies make our simultaneous inference task especially challenging, hel** us better understand the settings under which our method brings larger practical advantages relative to the baselines.

The ground truth matrix $\bm{M}$ is obtained as $\bm{M}=0.5\cdot\bar{\bm{M}}+0.5\cdot\bm{N}$ , where $\bar{\bm{M}}\in\mathbb{R}^{n_{r}\times n_{c}}$ is low-rank while $\bm{N}\in\mathbb{R}^{n_{r}\times n_{c}}$ is a noise matrix exhibiting column-wise dependencies whose strength can be tuned as a control parameter, as detailed below.

$\bar{\bm{M}}\in\mathbb{R}^{n_{r}\times n_{c}}$ is given by a random factorization model with rank $l=5$ ; i.e., $\bar{\bm{M}}=\bar{\bm{U}}(\bar{\bm{V}})^{\top}$ , where $\bar{\bm{U}}=(U_{r,c})_{r\in[n_{r}],c\in[l]}$ and $\bar{\bm{V}}=(V_{r^{\prime},c^{\prime}})_{r^{\prime}\in[n_{c}],c^{\prime}\in[l]}$ are such that

\displaystyle U_{r,c}~{}\overset{\mathrm{i.i.d.}}{\sim}~{}\mathcal{N}(0,1),% \quad V_{r^{\prime},c^{\prime}}~{}\overset{\mathrm{i.i.d.}}{\sim}~{}\mathcal{N% }(0,1).

(33)

$\bm{N}=0.1\cdot\bm{\epsilon}+0.9\cdot\bm{1}\widetilde{\bm{\epsilon}}^{\top}$ , where $\bm{1}\in\mathbb{R}^{n_{r}\times 1}$ is a vector of ones, $\bm{\epsilon}\in\mathbb{R}^{n_{r}\times n_{c}}$ has i.i.d. standard normal components, and $\widetilde{\bm{\epsilon}}\in\mathbb{R}^{n_{c}\times 1}$ is such that, for all $c\in[n_{c}]$ ,

\displaystyle\widetilde{\epsilon}_{c}~{}\overset{\mathrm{i.i.d.}}{\sim}~{}% \left(1-\gamma\right)\cdot\mathcal{N}(0,1)+\gamma\cdot\mathcal{N}(\mu,0.1),

(34)

for suitable parameters $\gamma\in(0,1)$ and $\mu\in\mathbb{R}$ . Thus, $\bm{1}\widetilde{\bm{\epsilon}}^{\top}\in\mathbb{R}^{n_{r}\times n_{c}}$ has constant columns, and larger values of $\mu\in\mathbb{R}$ result in stronger column-wise dependencies compared to the background i.i.d. noise described by the matrix $\bm{\epsilon}$ . In the following, the value of $\mu$ is varied as a control parameter, while we fix $\gamma=\alpha/2$ .

For a given ground truth matrix $\bm{M}$ generated as described above, we observe $n_{\mathrm{obs}}=8000$ entries, randomly sampled according to model defined in (1) with $w_{r,c}=1$ for all $(r,c)\in[n_{r}]\times[n_{c}]$ . Let $\mathcal{D}_{\mathrm{obs}}$ denote the unordered collection of these observed indices. Then, 100 test groups $\bm{X}^{*}$ of size $K$ , where $K\geq 2$ is a control parameter, are sampled without replacement from $\mathcal{D}_{\mathrm{miss}}=[n_{r}]\times[n_{c}]\setminus\mathcal{D}_{\mathrm{% obs}}$ , according to the model defined in (3) with $w^{*}_{r,c}=1$ for all $(r,c)\in\mathcal{D}_{\mathrm{miss}}$ .

The simultaneous confidence region for a test group $\bm{X}^{*}$ is constructed by applying Algorithm 1 with $n=\min\{1000,\lfloor\xi_{\text{obs}}/2\rfloor\}$ calibration groups, where $\xi_{\text{obs}}$ , defined in (12), denotes the maximum possible number of such groups. Note that the matrix algorithm leveraged by our method can thus be trained using $n_{\mathrm{train}}=n_{\mathrm{obs}}-Kn$ observed entries of $\bm{M}$ , indexed by $\mathcal{D}_{\mathrm{train}}$ .

While SCMC can leverage any matrix completion algorithm producing point predictions, here we employ the alternating least squares approach of \citethu_cf_2008, which is designed to recover low-rank signals. For simplicity, we apply this algorithm with an hypothesized rank of 5, which matches the true rank of $\bar{\bm{M}}$ . It is worth repeating, however, that the validity of the SCMC confidence regions is independent of both the true $\bm{M}$ and the matrix completion model.

Our method is compared to the two baselines introduced in Section 1.2. Recall that the first one is a naive unadjusted heuristic that ignores the multiple testing aspect of our simultaneous inference problem and essentially applies Algorithm 1 with $K=1$ repeatedly for every individual entry in $\bm{X}^{*}$ . This ensures valid coverage for each entry in $\bm{X}^{*}$ separately, but does not guarantee simultaneous coverage for groups with $K\geq 2$ . By contrast, the second Bonferroni baseline relies on a crude and overly conservative multiple testing adjustment to achieve simultaneous coverage, essentially applying Algorithm 1 with $K=1$ at level $\alpha/K$ instead of $\alpha$ . Both baseline approaches are applied using the same matrix completion model leveraged by our method, and their predictions are calibrated using a calibration set containing $Kn$ observed matrix entries.

Figure 3 summarizes the results of these experiments as a function of $K$ and for different values of the noise parameter $\mu$ . Each method is assessed in terms of the average width of the output confidence regions, at level $\alpha=10\%$ , and of the empirical simultaneous coverage for the 100 test groups. All results are averaged over 300 independent experiments. Our method always achieves the desired 90% simultaneous coverage, as predicted by the theory, while the unadjusted baseline becomes increasingly anti-conservative for larger values of $K$ . Further, our method leads to more informative confidence regions compared to the Bonferroni baseline, which becomes increasingly conservative with larger values of $K$ and $\mu$ . See Figure A10 in Appendix A3 for a different view of these results, highlighting the behavior of all methods as a function of $\mu$ , for different values of $K$ .

5.1.2 Heterogeneous Sampling Weights

Moving beyond the setting of data missing completely at random, we now consider similar experiments in which the sampling weights $\bm{w}$ of the observation model (3) are heterogeneous, while the matrix $\bm{M}$ has a simple low-rank structure. In particular, $\bm{M}$ is generated according to the random factorization model defined in (33), so that $\bm{M}=\bar{\bm{U}}(\bar{\bm{V}})^{\top}$ with rank $l=8$ and $n_{c}=n_{r}=400$ . The sampling weights $\bm{w}$ are chosen such as to introduce an interesting spatial missingness pattern, with some rows and columns being more densely observed than others. Precisely, we set

\displaystyle w_{r,c}=(n_{r}(c-1)+r)^{s},\qquad\forall r\in[n_{r}],\;c\in[n_{c% }],

(35)

where $s\geq 0$ controls the degree of heterogeneity. If $s=0$ , the missingness is uniform, whereas larger values of $s$ result in columns with higher indices to be more densely observed.

Based on this model, we randomly sample without replacement $n_{\mathrm{obs}}=48,000$ matrix entries (from a total of 160,000) and then apply Algorithm 1 similarly to the previous section, using $n=\min\{2000,\lfloor\xi_{\text{obs}}/2\rfloor\}$ calibration groups and allocating the remaining $n_{\mathrm{train}}=n_{\mathrm{obs}}-Kn$ observations to train the matrix completion model. For the latter, we rely on the same alternating least squares algorithm \citephu_cf_2008 as in the previous section, with hypothesized rank 8. The two baseline approaches are also applied similarly, following an approach analogous to that described in Section 5.1.1.

All methods are evaluated on a test set of 100 test groups sampled without replacement according to the model defined in (3), with uniform weights $w^{*}_{r,c}=1$ for all $(r,c)\in[400]\times[400]$ . The $\alpha$ level is $10\%$ . All results are averaged over 300 independent experiments.

Figure 4 reports on the results of these experiments as a function of the parameter $s$ in (35), for different values of $K$ . As predicted by the theory, our method always achieves valid simultaneous coverage, unlike the unadjusted baseline. Further, our method produces relatively informative confidence regions compared to the Bonferroni approach, as the latter becomes more conservative for larger values of $s$ . This can be understood as follows. As the matrix completion model naturally finds it easier to recover more accurately the missing entries belonging to more densely observed columns, the heterogeneous sampling model tends to introduce spatial dependencies in the residual matrix $\widehat{\bm{M}}-\bm{M}$ . These dependencies, which intuitively become stronger for larger values of $s$ , make our simultaneous inference task intrinsically more challenging, resulting in wider confidence regions for all methods, but have a disproportionate adverse effect on the Bonferroni approach (which implicitly but incorrectly assumes the miscoverage events corresponding to different entries to be mutually independent). See Figure A11 in Appendix A3 for a different view of these results, highlighting the behavior of all methods as a function of $K$ , for different values of $s$ .

5.2 Numerical Experiments with MovieLens Data

We now apply our method to the MovieLens 100K \citepmovielens100k data and compare its performance to those of the unadjusted and Bonferroni baselines. This data set contains 100,000 ratings (on a scale from 1 to 5) provided by 943 users for 1682 movies. Therefore, approximately 94% of all possible ratings are missing. To reduce the memory requirements of the matrix completion algorithm utilized to compute $\widehat{\bm{M}}$ , we reduce the matrix size by half, focusing on a smaller rating matrix $\bm{M}\in\mathbb{R}^{800\times 1000}$ , corresponding to a random subset of 800 users and 1000 movies.

As usual, we denote the set of indices for the observed matrix entries as $\mathcal{D}_{\text{obs}}$ and its complement as $\mathcal{D}_{\text{miss}}=[800]\times[1000]\setminus\mathcal{D}_{\text{obs}}$ . Since the true sampling weights $\bm{w}$ are unknown in this application, we compute estimated weights $\widehat{\bm{w}}$ with a data-driven approach inspired by \citetgui2023conformalized, as described in Appendix A2.4. Algorithm 1 is then applied with $\widehat{\bm{w}}$ instead of $\bm{w}$ , to construct simultaneous confidence regions for the unobserved ratings of 100 random test groups $\bm{X}^{*}$ . We utilize $n=\min\{1000,\lfloor\xi_{\text{obs}}/2\rfloor\}$ calibration groups and vary the group size $K$ as a control parameter. The test groups are randomly sampled without replacement from $\mathcal{D}_{\text{miss}}$ according to the model defined in (3), with uniform weights $\bm{w}^{*}$ . The matrix completion algorithm is trained as described in the previous sections, applying the alternating least squares approach of \citethu_cf_2008 based on $n_{\mathrm{train}}=n_{\mathrm{obs}}-Kn$ observations. The hypothesized rank of $\bm{M}$ utilized by this model to obtain $\widehat{\bm{M}}$ is varied as an additional control parameter. As before, the baseline approaches are also applied based on the same matrix completion model, to facilitate the comparison with our method.

Figure 1, previewed earlier in Section 1.2, reports on the results of these experiments as a function of the group size $K$ and of the hypothesized rank utilized by the matrix completion model. The confidence regions are assessed based on their average width alone, since it is impossible to measure the empirical coverage given that the ground truth is unknown. The results show that SCMC produces more informative (narrower) confidence regions compared to the Bonferroni approach, consistently with the results of our previous experiments based on synthetic data. Figure 1 displays only the performance of the Bonferroni baseline because the unadjusted baseline is not intended to provide valid simultaneous coverage, making it less suitable for comparisons lacking a verifiable ground truth. Nevertheless, Figure A12 in Appendix A3.5 includes a comparison with both baselines, demonstrating that our simultaneous confidence regions are not much wider than those produced by the unadjusted baseline. Further, our method’s higher reliability compared to the unadjusted baseline is supported by the following additional experiments, conducted using the same data but under a more artificial setting in which the ground truth is known.

To evaluate the coverage on the MovieLens data, we carry out similar but more closely controlled experiments in which the test groups are drawn not from $\mathcal{D}_{\text{miss}}$ (for which the ground truth is unknown) but from a hold-out subset $\mathcal{D}_{\text{hout}}$ containing 20% of the observed matrix indices in $\mathcal{D}_{\text{obs}}$ . Algorithm 1 is then applied to construct confidence regions for the unobserved ratings of 100 random test groups $\bm{X}^{*}$ sampled from $\mathcal{D}_{\text{hout}}$ , proceeding as described before but utilizing only the observed data in $\mathcal{D}_{\text{obs}}\setminus\mathcal{D}_{\text{hout}}$ instead of $\mathcal{D}_{\text{obs}}$ .

Since the estimation of $\widehat{\bm{w}}$ acknowledges the existence of an unobserved set of entries $\mathcal{D}_{\text{miss}}$ , in this setting our method is essentially aiming to achieve simultaneous coverage for test groups $\bm{X}^{*}$ sampled from $\mathcal{D}_{\text{miss}}\setminus\mathcal{D}_{\text{hout}}$ instead of $\mathcal{D}_{\text{hout}}$ . Of course, we can only evaluate the empirical coverage for test groups sampled from $\mathcal{D}_{\text{hout}}$ , and this is why these experiments are useful to understand the robustness of our inferences to possible distribution shifts between $\mathcal{D}_{\text{miss}}\setminus\mathcal{D}_{\text{hout}}$ and $\mathcal{D}_{\text{hout}}$ .

Figure 5 compares the performances of each method under this hybrid setting, focusing on test groups sampled from the hold-out data in $\mathcal{D}_{\text{hout}}$ . These results are reported as a function of $K$ , for different values of the hypothesized rank in the matrix completion model. Consistently with the previous results, our method leads to more informative inferences compared to the Bonferroni approach, and it nearly achieves the desired 90% simultaneous coverage for the test groups sampled from $\mathcal{D}_{\text{hout}}$ , even though in theory one would only expect it to have valid coverage on average over all test groups sampled from $\mathcal{D}_{\text{miss}}\setminus\mathcal{D}_{\text{hout}}$ . The nearly valid coverage also demonstrates the robustness of our method towards possible misspecification of the sampling weights.

6 Discussion

This paper introduces a principled and effective method for simultaneous conformal inference in matrix completion. Although primarily motivated by the challenges of uncertainty estimation for group recommender systems, our approach is sufficiently modular and flexible to be potentially relevant beyond our initial focus. In particular, the core idea of leveraging a structured calibration set to approximately replicate the patterns expected at test time could be adapted to obtain joint inferences beyond the task of predicting multiple user ratings for the same product. Moreover, our newly introduced notion of leave-one-out exchangeability and the related conformalization techniques extend the existing framework for conformal inference under covariate shift proposed by \citettibshirani-covariate-shift-2019 and these advances may be useful in other applications of conformal inference.

A related direction for future research may involve extending our method to accommodate the jackknife+ framework of \citetbarber_cv+_2021. The data-splitting approach adopted in this paper may not be fully satisfactory in situations where the observations are very limited. In fact, a scarcity of training data may result in less accurate point estimates, thereby reducing the informativeness of our inferences, and a scarcity of calibration data generally leads to more unstable outputs. In contrast, cross-validation can make a more efficient use of the limited data, although at the price of increased theoretical challenges and more expensive computations.

Software Availability

A software package implementing the methods and numerical experiments described in this paper is available at https://github.com/ZiyiLiang/simultaneous-matrix-completion.

Acknowledgements

M. S. was partly supported by NSF grant DMS 2210637.

\printbibliography

Appendix A1 Additional Technical Background

A1.1 Review of Individual-Level Conformalized Matrix Completion

This section reviews the conformalized matrix completion method proposed by \citetgui2023conformalized, which is designed to produce confidence intervals for one missing entry at a time.

The setup of \citetgui2023conformalized is similar to ours as they also treat $\bm{M}$ as fixed and assume the randomness in the matrix completion problem comes from the observation process or, equivalently, the missingness mechanism. However, their modeling choices do not match exactly with ours. Specifically, they assume that each matrix entry in row $r$ and column $c$ is independently observed with some (known) probability $p_{r,c}$ , which roughly corresponds to our sampling weights $w_{r,c}$ in (1); i.e., $\mathcal{D}_{\mathrm{obs}}:=\{(r,c)\in[n_{r}]\times[n_{c}]:Z_{r,c}=1\}$ , where

\displaystyle Z_{r,c}=\mathbb{I}\left[(r,c)\text{ is observed}\right]\overset{% \mathrm{ind.}}{\sim}\mathrm{Bernoulli}(p_{r,c}),\qquad\forall(r,c)\in[n_{r}]% \times[n_{c}].

(A36)

Therefore, the total number of observed entries is a random variable in \citetgui2023conformalized, whereas we can allow $n_{\mathrm{obs}}$ to be fixed within the sampling without replacement model defined in (1). As shown in this paper, our modeling choice is natural when aiming to construct group-level simultaneous inferences. The model assumed by \citetgui2023conformalized also differs from ours in its requirement that all sampling weights must be strictly positive; $p_{r,c}>0$ for all $(r,c)\in[n_{r}]\times[n_{c}]$ in (A36). Further, the approach of \citetgui2023conformalized differs from ours in that they assume the missing matrix index of interest, namely $\mathcal{I}^{*}\in[n_{r}]\times[n_{c}]$ , to be sampled uniformly at random from $\mathcal{D}_{\mathrm{miss}}$ , that is, $\mathcal{I}^{*}\sim\mathrm{Uniform}(\mathcal{D}_{\mathrm{miss}})$ , where $\mathcal{D}_{\mathrm{miss}}:=[n_{r}]\times[n_{c}]\setminus\mathcal{D}_{\mathrm% {obs}}$ . By contrast, our sampling model for the test groups, defined in (3), can accommodate heterogeneous weights $\bm{w}^{*}$ .

The method proposed by \citetgui2023conformalized constructs conformal confidence intervals for individual missing entries as follows. First, $\mathcal{D}_{\mathrm{obs}}=\{(r,c)\in[n_{r}]\times[n_{c}]:Z_{r,c}=1\}$ is partitioned into a training set $\mathcal{D}_{\mathrm{train}}$ and a disjoint calibration set $\mathcal{D}_{\mathrm{cal}}$ by randomly sampling $\widetilde{Z}_{r,c}\sim\mathrm{Bernoulli}(q)$ independently for all $(r,c)\in[n_{r}]\times[n_{c}]$ , for some fixed parameter $q\in(0,1)$ , and then defining

\displaystyle\mathcal{D}_{\mathrm{train}}\coloneqq\{(r,c)\in\mathcal{D}_{% \mathrm{obs}}:\widetilde{Z}_{r,c}=1\},\quad\quad\mathcal{D}_{\mathrm{cal}}% \coloneqq\{(r,c)\in\mathcal{D}_{\mathrm{obs}}:\widetilde{Z}_{r,c}=0\}.

(A37)

Similar to us, \citetgui2023conformalized utilize $\mathcal{D}_{\mathrm{train}}$ to compute $\widehat{\bm{M}}$ , leveraging any black-box algorithm, and then evaluate conformity scores on the calibration data as explained below.

Let $\mathcal{C}(\cdot,\tau,\widehat{\bm{M}})$ denote a pre-specified prediction rule for a single matrix entry, which should be monotonically increasing in the tuning parameter $\tau\in[0,1]$ as explained in Section 3.1; for example, this could correspond to the prediction rule defined in (8) in the special case of $K=1$ . For any $I\in[n_{r}]\times[n_{c}]$ , let $S_{i}=S(I_{i})$ denote the conformity score corresponding to $\mathcal{C}(\cdot,\tau,\widehat{\bm{M}})$ , as in (9). Imagining that the calibration set contains the indices of $n$ matrix entries— $\mathcal{D}_{\mathrm{cal}}=\{\mathcal{I}_{1},\dots,\mathcal{I}_{n}\}$ —the method of \citetgui2023conformalized evaluates $S_{i}=S(I_{i})$ for all $i\in[n]$ and then calibrates the tuning parameter $\tau$ by computing

\displaystyle\widehat{\tau}^{\mathrm{indiv}}_{\alpha,1}=Q\left(1-\alpha;\sum_{% i=1}^{n}p_{i}^{\mathrm{indiv}}\delta_{S_{i}}+p_{n+1}^{\mathrm{indiv}}\delta_{% \infty}\right),

(A38)

where the conformalization weights $p_{i}^{\mathrm{indiv}}$ are given by

\displaystyle p_{i}^{\mathrm{indiv}}(I_{1},\dots,I_{n+1})=\frac{R_{I_{i}}}{% \sum_{j}^{n+1}R_{I_{j}}},\qquad R_{I_{i}}=\frac{1-p_{I_{i}}}{p_{I_{i}}},

(A39)

for all $i\in[n+1]$ , with the convention that $I_{n+1}=I^{*}$ . Finally, the confidence interval for the latent value of $\bm{M}$ at index $I^{*}$ is given by:

\displaystyle\widehat{\bm{C}}^{\mathrm{indiv}}({\mathcal{I}}^{*};\bm{M}_{% \mathcal{D}_{\mathrm{obs}}},\alpha)=\mathcal{C}({\mathcal{I}}^{*},\widehat{% \tau}^{\mathrm{indiv}}_{\alpha,1},\widehat{\bm{M}}).

(A40)

The following result establishes that the confidence intervals defined in (A40) have guaranteed marginal coverage at level $1-\alpha$ .

Proposition A3 (from \citetgui2023conformalized).

Suppose $\mathcal{D}_{\mathrm{obs}}$ is sampled according to (A36) and ${\mathcal{I}}^{*}\sim\mathrm{Uniform}(\mathcal{D}_{\mathrm{miss}})$ . Then, for any $\alpha\in(0,1)$ , $\widehat{\bm{C}}^{\mathrm{indiv}}({\mathcal{I}}^{*};\bm{M}_{\mathcal{D}_{% \mathrm{obs}}},\alpha)$ in (A40) satisfies

\displaystyle\mathbb{P}\left[\bm{M}_{\mathcal{I}^{*}}\in\widehat{\bm{C}}^{% \mathrm{indiv}}({\mathcal{I}}^{*};\bm{M}_{\mathcal{D}_{\mathrm{obs}}},\alpha)% \mid\mathcal{D}_{\mathrm{train}}\right]\geq 1-\alpha.

Proof.

This result follows directly from the proof of Theorem 3.2 in \citetgui2023conformalized. Alternatively, the following proof can be obtained by applying our Lemma 2. Conditioning on $n$ , such that $\mathcal{D}_{\mathrm{cal}}=\{\mathcal{I}_{1},\dots,\mathcal{I}_{n}\}$ , note that the joint distribution of $\mathcal{I}_{1},\dots,\mathcal{I}_{n},{\mathcal{I}}^{*}$ trivially satisfies the leave-one-out exchangeability condition defined in Lemma 2. Specifically, let $I_{1},\dots,I_{n},I_{n+1}$ be a realization of $\mathcal{I}_{1},\dots,\mathcal{I}_{n},{\mathcal{I}}^{*}$ , so that $D_{\mathrm{cal}}\coloneqq\{I_{1},\dots,I_{n}\}$ is a realization of $\mathcal{D}_{\mathrm{cal}}$ sampled according to (A37). Then,

	$\displaystyle\mathbb{P}\left({\mathcal{I}}_{1}=I_{1},\dots,{\mathcal{I}}_{n}=I% _{n},{\mathcal{I}}^{*}=I_{n+1}\mid\mathcal{D}_{\mathrm{train}},\left\lvert% \mathcal{D}_{\mathrm{obs}}\right\rvert=n_{\mathrm{obs}},\left\lvert\mathcal{D}% _{\mathrm{cal}}\right\rvert=n\right)$
	$\displaystyle\quad=\mathbb{P}\left(\mathcal{D}_{\mathrm{cal}}=D_{\mathrm{cal}}% ,{\mathcal{I}}^{*}=I_{n+1}\mid\mathcal{D}_{\mathrm{train}},\left\lvert\mathcal% {D}_{\mathrm{obs}}\right\rvert=n_{\mathrm{obs}},\left\lvert\mathcal{D}_{% \mathrm{cal}}\right\rvert=n\right)$
	$\displaystyle\quad=g(D_{\mathrm{cal}}\cup\left\{I_{n+1}\right\})\cdot\prod_{(r% ,c)\in D_{\mathrm{cal}}}\frac{p_{r,c}}{1-p_{r,c}},$

where the second equality follows from Lemma 3.1 in \citetgui2023conformalized, for a suitable function $g$ that is invariant to any permutation of its input. Further, it follows that

	$\displaystyle\mathbb{P}\left({\mathcal{I}}_{1}=I_{1},\dots,{\mathcal{I}}_{n}=I% _{n},{\mathcal{I}}^{*}=I_{n+1}\mid\mathcal{D}_{\mathrm{train}},\left\lvert% \mathcal{D}_{\mathrm{obs}}\right\rvert=n_{\mathrm{obs}},\left\lvert\mathcal{D}% _{\mathrm{cal}}\right\rvert=n\right)$
	$\displaystyle\quad=\left[g(D_{\mathrm{cal}}\cup\left\{I_{n+1}\right\})\cdot% \prod_{(r,c)\in D_{\mathrm{cal}}\cup\left\{I_{n+1}\right\}}\frac{p_{r,c}}{1-p_% {r,c}}\right]\cdot\frac{1-p_{I_{n+1}}}{p_{I_{n+1}}}$
	$\displaystyle\quad=\left[g(D_{\mathrm{cal}}\cup\left\{I_{n+1}\right\})\cdot% \prod_{(r,c)\in D_{\mathrm{cal}}\cup\left\{I_{n+1}\right\}}\frac{p_{r,c}}{1-p_% {r,c}}\right]\cdot R_{I_{n+1}},$

with $R_{I_{n+1}}$ defined as in (A39). This proves that $\mathcal{I}_{1},\dots,\mathcal{I}_{n},{\mathcal{I}}^{*}$ are leave-one-out exchangeable random variables by the definition in (16), with $\bar{h}(D_{\mathrm{cal}},I_{n+1})=R_{I_{n+1}}$ . Therefore, the coverage guarantee of Proposition A3 follows directly from Lemma 2. ∎

A1.2 Limitations of the Unadjusted and Bonferroni Baselines

It is not easy to construct informative simultaneous confidence regions satisfying (5) and, to the best of our knowledge, there are no satisfactory alternatives to the method proposed in this paper. In fact, standard conformal methods are designed to deal with one test point at a time, and directly aggregating separate prediction intervals into a joint confidence region is neither precise nor efficient in our context, as explained in more detail below.

Recall that the conformalized matrix completion method of \citetgui2023conformalized, reviewed in Appendix A1.1, is designed to construct a confidence interval $\widehat{C}^{\mathrm{indiv}}(X^{*}_{1};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)$ for one missing entry at a time, denoted as $X^{*}_{1}$ , such that

\displaystyle\mathbb{P}[M_{X^{*}_{1}}\in\widehat{C}^{\mathrm{indiv}}(X^{*}_{1}% ;\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)]\geq 1-\alpha

(A41)

under a suitable sampling model for $\bm{X}_{\mathrm{obs}}$ and $X^{*}_{1}$ . The model for $\bm{X}_{\mathrm{obs}}$ and $X^{*}_{1}$ considered by \citetgui2023conformalized is different from ours, as they treat $n_{\mathrm{obs}}$ as random, rely on independent Bernoulli observations instead of sampling without replacement, and do not consider the possibility that the sampling weights $\bm{w}^{*}$ in (3) may be non-uniform. However, a similar idea can be adapted to construct confidence intervals $\widehat{C}^{\mathrm{indiv}}(X^{*}_{1};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)$ for a single matrix entry $X^{*}_{1}$ under our sampling model (1)–(3), as explained in Appendix A1.3. In any case, regardless of these modeling details, the limitations of the baseline approaches within our simultaneous inference context can already be understood as follows.

If the goal is to make joint predictions for a group of $K$ matrix entries, concatenating individual-level predictions clearly does not guarantee simultaneous coverage in the sense of (5), as the errors across different coordinates tend to accumulate. This may be seen as an instance of the prototypical multiple testing problem. The unadjusted baseline approach essentially computes:

\displaystyle\widehat{\bm{C}}^{\mathrm{Unadj}}(\bm{X}^{*};\bm{M}_{\bm{X}_{% \mathrm{obs}}},\alpha):=\left(\widehat{C}^{\mathrm{indiv}}(X^{*}_{1};\bm{M}_{% \bm{X}_{\mathrm{obs}}},\alpha),\ldots,\widehat{C}^{\mathrm{indiv}}(X^{*}_{K};% \bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)\right).

(A42)

As demonstrated by Figure 5 and other synthetic experiments in Section 5.1, this approach often leads to low simultaneous coverage.

Figure 1 previewed the performance of a second baseline approach that relies on a simple but inefficient Bonferroni correction to approximately ensure simultaneous coverage. Intuitively, this tries to (conservatively) account for the multiplicity of the problem by applying (A42) at level $\alpha/K$ instead of $\alpha$ , computing

\displaystyle\widehat{\bm{C}}^{\mathrm{Bonf}}(\bm{X}^{*};\bm{M}_{\bm{X}_{% \mathrm{obs}}},\alpha):=\left(\widehat{C}^{\mathrm{indiv}}\left(X^{*}_{1};\bm{% M}_{\bm{X}_{\mathrm{obs}}},\frac{\alpha}{K}\right),\ldots,\widehat{C}^{\mathrm% {indiv}}\left(X^{*}_{K};\bm{M}_{\bm{X}_{\mathrm{obs}}},\frac{\alpha}{K}\right)% \right).

(A43)

Although a Bonferroni correction may seem reasonable at first sight, it is still unsatisfactory for at least two reasons. Firstly, it is not rigorous because we know the $K$ missing entries indexed by $\bm{X}^{*}$ must belong to the same column, but this constraint cannot be easily taken into account by individual-level predictions. Secondly, and even more crucially, the Bonferroni correction tends to be overly conservative in practice because the coverage events $M_{X^{*}_{k}}\in\widehat{C}^{\mathrm{indiv}}(X^{*}_{k};\bm{M}_{\bm{X}_{\mathrm% {obs}}},\alpha)$ for different values of $k\in[K]$ are mutually dependent, since they are all affected by the same observations $\bm{X}_{\mathrm{obs}}$ . These dependencies, however, are potentially very complex.

A1.3 Implementation Details for the Baselines

To facilitate the empirical comparison with our method, which relies on the sampling model for $\bm{X}_{\mathrm{obs}}$ and $\bm{X}^{*}$ defined (1)–(3), in this paper we apply the unadjusted and Bonferroni baseline approaches described in Appendix A1.2 based on individual-level conformal prediction intervals $\widehat{C}^{\mathrm{indiv}}$ obtained as follows. Instead of directly applying the conformalized matrix completion method of \citetgui2023conformalized, we repeatedly apply our own method separately for each element $X^{*}_{k}$ in $\bm{X}^{*}=(X^{*}_{1},X^{*}_{2},\ldots,X^{*}_{K})$ , imagining each time that we are dealing with a trivial group of size 1. This provides us with individual-level prediction intervals $\widehat{C}^{\mathrm{indiv}}$ that are similar in spirit to those of \citetgui2023conformalized but whose construction more faithfully mirrors the sampling model assumed in this paper (although they still ignore the constraint that all elements of $\bm{X}^{*}$ must belong to the same column). In summary, the implementation of the unadjusted and Bonferroni baseline approaches applied in this paper is outlined by Algorithms A3 and A4, respectively.

Algorithm A3 Unadjusted confidence region for multiple missing matrix entries

1: Input: partially observed matrix

\bm{M}_{\bm{X}_{\mathrm{obs}}}

, with unordered list of observed indices

\mathcal{D}_{\mathrm{obs}}

;

2: Input: test group

\bm{X}^{*}

; nominal coverage level

\alpha\in(0,1)

;

3: Input: any matrix completion algorithm producing point estimates;

4: Input: any prediction rule

\mathcal{C}

satisfying (6) and (7);

5: Input: desired number

n

of calibration entries.

6: Apply Algorithm 2 with group size

K=1

to obtain

\mathcal{D}_{\mathrm{train}}

\mathcal{D}_{\mathrm{cal}}

;

7: Compute a point estimate

\widehat{\bm{M}}

, looking only the observations in

\mathcal{D}_{\mathrm{train}}

8: Compute the conformity scores

S_{i}

, for all

i\in[n]

, with Equation (9).

9: for all

k\in[K]

10: Compute

\widehat{\tau}_{\alpha,1}

in (10), based on the weights

p_{i}

given by (19) with

\bm{x}_{n+1}=X^{*}_{k}

in Section 3.4.

11: Compute

\widehat{C}^{\mathrm{indiv}}(X^{*}_{k};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)=% \widehat{\bm{C}}(X^{*}_{k};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)

given by (11).

12: end for

13: Output: Joint confidence region

\widehat{\bm{C}}^{\mathrm{Unadj}}(\bm{X}^{*};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)

given by Equation (A42).

Algorithm A4 Bonferroni-style confidence region for multiple missing matrix entries

1: Input: partially observed matrix

\bm{M}_{\bm{X}_{\mathrm{obs}}}

, with unordered list of observed indices

\mathcal{D}_{\mathrm{obs}}

;

2: Input: test group

\bm{X}^{*}

; nominal coverage level

\alpha\in(0,1)

;

3: Input: any matrix completion algorithm producing point estimates;

4: Input: any prediction rule

\mathcal{C}

satisfying (6) and (7);

5: Input: desired number

n

of calibration entries.

6: Apply Algorithm 2 with group size

K=1

to obtain

\mathcal{D}_{\mathrm{train}}

\mathcal{D}_{\mathrm{cal}}

;

7: Compute a point estimate

\widehat{\bm{M}}

, looking only the observations in

\mathcal{D}_{\mathrm{train}}

8: Compute the conformity scores

S_{i}

, for all

i\in[n]

, with Equation (9).

9: for all

k\in[K]

10: Compute

\widehat{\tau}_{\frac{\alpha}{K},1}

in (10), based on the weights

p_{i}

given by (19) with

\bm{x}_{n+1}=X^{*}_{k}

in Section 3.4.

11: Compute

\widehat{C}^{\mathrm{indiv}}(X^{*}_{k};\bm{M}_{\bm{X}_{\mathrm{obs}}},\frac{% \alpha}{K})=\widehat{\bm{C}}(X^{*}_{k};\bm{M}_{\bm{X}_{\mathrm{obs}}},\frac{% \alpha}{K})

given by (11).

12: end for

13: Output: Joint confidence region

\widehat{\bm{C}}^{\mathrm{Bonf}}(\bm{X}^{*};\bm{M}_{\bm{X}_{\mathrm{obs}}},\alpha)

given by Equation (A43).

A1.4 Review of the Classical Laplace Method

This section provides a concise review of the classical version of Laplace’s method, as detailed for example in \citetbutler2007saddlepoint. This method is a powerful tool for approximating analytically intractable integrals of the form $\int_{a}^{b}e^{nf(x)}h(x)dx$ , where the function $f$ is sufficiently well-behaved and smooth, with a unique global maximum at an interior point $x_{0}\in(a,b)$ , the function $h$ is positive and does not vary significantly near $x_{0}$ , and $n$ is a relatively large constant. The method hinges on the principle that this integral’s value is predominantly determined by a small region around the point where $f$ achieves its maximum. This idea is explained in more detail and motivated precisely below.

Let $f(x)$ be a twice continuously differentiable function on an interval $(a,b)$ , and assume there exists a unique global maximum at an interior point $x_{0}\in(a,b)$ , such that $f(x_{0})=\max_{x\in(a,b)}f(x)$ and $f^{\prime\prime}(x_{0})<0$ . Suppose $h$ is a function that varies slowly around $x_{0}$ and is such that $h(x)>0$ for all $x\in(a,b)$ . Then, Laplace’s approximation involves replacing the integral $\int h(x)e^{nf(x)}dx$ with

\int_{a}^{b}e^{nf(x)}h(x)\,dx\approx e^{nf(x_{0})}h(x_{0})\sqrt{\frac{2\pi}{-% nf^{\prime\prime}(x_{0})}}.

(A44)

A standard mathematical justification for this approximation starts by proving that, under suitable technical assumptions on $f$ and $h$ in the spirit of the intuitive conditions outlined above,

\lim_{n\to\infty}\frac{\int_{a}^{b}e^{nf(x)}h(x)\,dx}{e^{nf(x_{0})}h(x_{0})% \sqrt{\frac{2\pi}{n\left(-f^{\prime\prime}(x_{0})\right)}}}=1.

(A45)

The classical proof of (A45) consists of three high-level steps:

1.

Local second-order approximation: Approximate $f(x)$ near $x_{0}$ using a second-order Taylor expansion: $f(x)\approx f(x_{0})+\frac{1}{2}f^{\prime\prime}(x_{0})(x-x_{0})^{2}$ .
2.

Integral transformation: Standardize the quadratic term in the integral to apply results from Gaussian integral analysis.
3.

Asymptotic evaluation: Assess the integral in the standardized coordinates to achieve the asymptotic equivalence in (A45).

The proof of Theorem A4, presented in Appendix A4.6, follows a similar high-level strategy, although its details are significantly more involved due to the fact that our integral of interest in (24) cannot be directly written as $\int_{a}^{b}e^{nf(x)}h(x)\,dx$ for some functions $f,g$ .

Appendix A2 Additional Methodological Details

A2.1 Practical Computation of the Conformity Scores

As detailed in Section 3.1, our method allows flexibility in the choice of the prediction rule $\mathcal{C}$ , which uniquely determines the conformity scores. In this section, we explore three practical options for the prediction rules and their respective conformity scores.

A2.1.1 Hyper-Cubic Confidence Regions

An intuitive prediction rule, introduced in Section 3.1, is:

\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})=\left(\widehat{\bm{M}}_{x^{*}_{1% }}\pm\frac{\tau}{1-\tau},\ldots,\widehat{\bm{M}}_{x^{*}_{K}}\pm\frac{\tau}{1-% \tau}\right),

with the parameter $\tau$ taking value in $[0,1]$ . Note that this rule leads to hyper-cubic confidence regions, with constant widths for all users in a group.

The conformity scores corresponding to this rule can be written explicitly, for any $i\in[n]$ , as:

\displaystyle S_{i}

\displaystyle=\max\left\{\frac{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}}_{i,1}}% -\bm{M}_{X^{\mathrm{cal}}_{i,1}}\rvert}{1+\lvert\widehat{\bm{M}}_{X^{\mathrm{% cal}}_{i,1}}-\bm{M}_{X^{\mathrm{cal}}_{i,1}}\rvert},\ldots,\frac{\lvert% \widehat{\bm{M}}_{X^{\mathrm{cal}}_{i,K}}-\bm{M}_{X^{\mathrm{cal}}_{i,K}}% \rvert}{1+\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}}_{i,K}}-\bm{M}_{X^{\mathrm{% cal}}_{i,K}}\rvert}\right\}.

Remark. The function $x\mapsto x/(1+x)$ is an strictly increasing function on $x\geq 0$ . Therefore, we can equivalently define the prediction set as the following. Let

\displaystyle\tilde{S}_{i}=\max\left\{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}}% _{i,1}}-\bm{M}_{X^{\mathrm{cal}}_{i,1}}\rvert,\ldots,\lvert\widehat{\bm{M}}_{X% ^{\mathrm{cal}}_{i,K}}-\bm{M}_{X^{\mathrm{cal}}_{i,K}}\rvert\right\}

(A46)

and define the alternate confidence set as

\mathcal{C}^{\prime}(\bm{x}^{*},\tau,\widehat{\bm{M}})=\left(\widehat{\bm{M}}_% {x^{*}_{1}}\pm\tau,\ldots,\widehat{\bm{M}}_{x^{*}_{K}}\pm\tau\right),

with $\tau$ taking value in $[0,+\infty)$ . The expression in (A46) is more closely related to the typical notation in the conformal inference literature; e.g., see \citetlei2018distribution.

A2.1.2 Hyper-Rectangular Confidence Regions

An alternative type of prediction rule, yielding intervals of varying lengths for different users, involves scaling the hyper-cube defined in (A2.1.1). This modification may be particularly useful in applications involving count data with wide ranges, where the variance may be expected to increase in proportion to the observed values. We define this linearly-scaled prediction rule as

\displaystyle\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})=\left(\widehat{\bm{% M}}_{x^{*}_{1}}\pm\lvert\widehat{\bm{M}}_{x^{*}_{1}}\rvert\tau,\ldots,\widehat% {\bm{M}}_{x^{*}_{K}}\pm\lvert\widehat{\bm{M}}_{x^{*}_{K}}\rvert\tau\right),

(A47)

which leads to confidence regions in the shape of a hyper-rectangle. The corresponding scores are:

\displaystyle S_{i}

\displaystyle=\max\left\{\frac{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}}_{i,1}}% -\bm{M}_{X^{\mathrm{cal}}_{i,1}}\rvert}{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal% }}_{i,1}}\rvert},\ldots,\frac{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}}_{i,K}}-% \bm{M}_{X^{\mathrm{cal}}_{i,K}}\rvert}{\lvert\widehat{\bm{M}}_{X^{\mathrm{cal}% }_{i,K}}\rvert}\right\}.

A2.1.3 Hyper-Spherical Confidence Regions

The prediction rules described above all result in confidence regions with a hyper-rectangular shape. Alternatively, one can construct a confidence region with a hyper-spherical shape using the following prediction rule, where $\|\cdot\|_{2}$ represents the Euclidean norm:

\displaystyle\mathcal{C}(\bm{x}^{*},\tau,\widehat{\bm{M}})=\left\{\bm{x}\in% \mathbb{R}^{K}:\|\bm{x}-\widehat{\bm{M}}_{\bm{X}^{\mathrm{cal}}_{i}}\|_{2}\leq% \tau\right\}.

(A48)

The corresponding conformity scores are

\displaystyle S_{i}\coloneqq\inf\left\{\tau\in\mathbb{R}:\|\bm{M}_{\bm{X}^{% \mathrm{cal}}_{i}}-\widehat{\bm{M}}_{\bm{X}^{\mathrm{cal}}_{i}}\|_{2}\leq\tau% \right\}=\|\bm{M}_{\bm{X}^{\mathrm{cal}}_{i}}-\widehat{\bm{M}}_{\bm{X}^{% \mathrm{cal}}_{i}}\|_{2}.

Note that replacing the Euclidean norm with the max norm in (A48) recovers the hyper-cubic the prediction rule.

The concept of a hyper-spherical confidence region is quite rare in the conformal inference literature, where the majority of existing methods focus on constructing confidence sets for a single test point individually. However, when aiming to provide simultaneous coverage for multiple entries, it becomes possible to develop confidence regions of varying geometric shapes.

A2.2 Efficient Evaluation of the Conformalization Weights

We discuss in more detail here the choice of scaling parameter $h$ for the function $\Phi(\tau;h)$ defined in Equation (23) (Section 4.1). This free parameter controls the location of $\tau_{h}=\operatorname*{arg\,max}_{\tau\in(0,1)}\Phi(\tau;h)$ . Since our Laplace approximation hinges on $\tau_{h}$ being not too close to the integration boundary, an intuitive and effective choice is to set $h$ so that $\tau_{h}=1/2$ ; e.g., see \citetfog2007wnchypg. We explain below how to achieve this using the Newton-Raphson algorithm.

Recall Lemma 4, which tells us that the function $\Phi(\tau;h)$ has a unique stationary point with respect to $\tau$ at some value $\tau_{h}\in(0,1)$ , and that this stationary point is a global maximum. Therefore, since the function $\Phi(\tau;h)$ is smooth, the Newton-Raphson algorithm can be applied as follows to find a value of $h>0$ such that $\tau_{h}=1/2$ . Define

\displaystyle z(h):=\frac{\phi^{\prime}(\frac{1}{2};h)}{2h}=\delta-\frac{1}{h}% -\sum_{(r,c)\in D_{\mathrm{obs}}}\frac{w_{r,c}}{2^{hw_{r,c}}-1},

and note that this function is monotonically increasing in $h$ . Then it suffices to find $h$ such that $z(h)=0$ . Note that $z(h)$ is smooth, and $z^{\prime}(h)>0$ , $z^{\prime\prime}(h)<0$ for $h\in[1/\delta,\infty)$ . It is also clear that the solution of $z(h)=0$ must be greater than $1/\delta$ . Further, $z(h)$ has a unique root in the interval $[1/\delta,\infty)$ because $z(1/\delta)<0$ and $\lim_{h\rightarrow\infty}z(h)=\delta>0$ .

Thus, it follows from Theorem 2.2 in \citetatkinson1989numerical that the Newton-Raphson algorithm will converge to the root $\tau_{h}$ quadratically, for any starting point within the interval $[1/\delta,\infty)$ . In practice, one can choose $1/\delta$ as the starting point.

The time complexity of the Newton-Raphson iteration depends on the desired precision level. If the tolerable error is a predetermined small constant, the iteration terminates after a constant number of updates due to quadratic convergence. Evaluating $z(h)$ and $z^{\prime}(h)$ at any given $h$ requires $\mathcal{O}(n_{\mathrm{obs}})$ . Hence, solving $z(h)=0$ takes $\mathcal{O}(n_{\mathrm{obs}})$ .

A2.3 Computational Shortcuts and Complexity Analysis

A2.3.1 Evaluation of the Conformalization Weights

Evaluating the simplified weights $\bar{p}_{i}$ in (29) only involves arithmetic operations and can be carried out for all $i\in[n+1]$ at a total computational cost roughly of order $\mathcal{O}(n_{r}n_{c}+nK)$ . To understand this, first note that computing $\eta_{i}(\tau_{h})$ defined in (25), for all $i\in[n+1]$ with any given $\tau_{h}$ and $h$ , has cost $\mathcal{O}(nK)$ ; and finding the correct value of $\tau_{h}$ and $h$ according to (26) has cost $\mathcal{O}(n_{\mathrm{obs}})$ , or equivalently no worse than $\mathcal{O}(n_{r}n_{c})$ , as explained in Section A2.2.

Next, evaluating $\widetilde{w}^{*}_{x_{i,1}}\cdot\ldots\cdot\widetilde{w}^{*}_{x_{i,K}}$ for all $i\in[n+1]$ has cost $\mathcal{O}(n_{r}n_{c}+nK)$ , because the constant $\sum_{(r,c)\in\bar{D}_{\mathrm{miss}}}w^{*}_{r,c}$ in the denominators of (20) and (21) can be pre-computed at cost $\mathcal{O}(n_{r}n_{c})$ , while the remaining terms in (20) and (21) can be evaluated at cost $\mathcal{O}(K)$ separately for each $i\in[n+1]$ .

The cost of evaluating the term within the square brackets in (29) for all $i\in[n+1]$ is $\mathcal{O}(n_{r}+nK)$ . This is achieved by pre-computing factorials up to $n_{r}$ since $n^{c}_{\mathrm{obs}}$ is upper-bounded by $n_{r}$ for any $c\in[n_{c}]$ . Then for each $i\in[n+1]$ , computing binomial coefficients, given the pre-computed factorials, takes constant time, and the remaining term in the brackets requires $\mathcal{O}(K)$ . Putting everything together, the conformalization weights in (29) has cost $\mathcal{O}(n_{r}+nK)$ for all $i\in[n+1]$ .

A2.3.2 Cost Analysis of Algorithm 1

Analysis for a single test group. The cost of computing a confidence region for a single test group is $\mathcal{O}(T+n_{r}n_{c}+n(K+\log n))$ , as shown below.

•

Training the black-box matrix completion model has cost $\mathcal{O}(T)$ .
•

After the black-box model is trained, the cost of computing scores $S_{i}$ for all $i\in[n]$ is $\mathcal{O}(nK)$ .
•

The cost of computing $p_{i}$ for all $i\in[n+1]$ is $\mathcal{O}(n_{r}n_{c}+nK)$ , as explained in Section A2.3.1.
•

After the conformalization weights are computed, the cost of computing $\widehat{\tau}_{\alpha,K}$ is $\mathcal{O}(n\log n)$ . This is because sorting the scores $S_{i}$ for all $i\in[n]$ has a worst-time cost of $\mathcal{O}(n\log n)$ , while it takes $\mathcal{O}(n)$ to find the weighted quantile based on $p_{i}$ and the sorted scores.

Therefore, the overall cost is $\mathcal{O}(T+n_{r}n_{c}+n(K+\log n))$ .

Analysis for $m$ distinct test groups. The cost of computing confidence regions for $m$ distinct test groups is $\mathcal{O}(T+n_{r}n_{c}+n(\log n+mK))$ , as shown below.

•

Training the black-box matrix completion model has cost $\mathcal{O}(T)$ , since the model only needs to be trained once.
•

The cost of computing conformity scores $S_{i}$ for all $i\in[n]$ is $\mathcal{O}(nK)$ , since the calibration groups are the same for any new test group.
•

The cost of computing $p_{i}$ for all $i\in[n+m]$ is $\mathcal{O}(n_{r}n_{c}+mnK)$ , as explained in Section A2.3.1.
•

After the conformalization weights are computed, the cost of computing the confidence sets for all $m$ test groups is $\mathcal{O}(n\log n+nm)$ . Sorting the scores $S_{i}$ for all $i\in[n]$ has a worst-time cost of $\mathcal{O}(n\log n)$ , which only needs to be performed once. For any $j\in[m]$ , it takes $\mathcal{O}(n)$ to find the weighted quantile given weights $\left\{p_{1},\dots,p_{n},p_{n+j}\right\}$ and the sorted scores.

Therefore, the overall cost is $\mathcal{O}(T+n_{r}n_{c}+n(\log n+mK))$ .

A2.3.3 Cost Analysis of Algorithm 2

•

For each column, the cost of computing $m^{c}:=n^{c}_{\mathrm{obs}}-\lfloor n^{c}_{\mathrm{obs}}/K\rfloor<K$ is $\mathcal{O}(n_{r})$ , and the cost of sampling $m^{c}$ indices uniformly at random is $\mathcal{O}(K)$ . Hence the cost of sampling the pruned indices for all columns is $\mathcal{O}(n_{c}(n_{r}+K))$ , which simplifies to $\mathcal{O}(n_{c}n_{r})$ by the fact that $K<n_{r}$ .
•

Initializing $\mathcal{D}_{\mathrm{avail}}$ given the pruned indices $\mathcal{D}_{\mathrm{prune}}$ has cost of $\mathcal{O}(n_{c}n_{r})$ .
•

After $\mathcal{D}_{\mathrm{avail}}$ is initialized, the cost of sampling the $i$ th calibration group (and updating $\mathcal{D}_{\mathrm{cal}}$ and $\mathcal{D}_{\mathrm{avail}}$ ) is $\mathcal{O}(K)$ , for each $i\in[n]$ . Hence sampling all $n$ calibration groups takes $\mathcal{O}(nK)$ .

Therefore, Algorithm 2 has time complexity of $\mathcal{O}(n_{c}n_{r}+nK)$ , and it does not need to be repeatedly applied when dealing with distinct groups involving the same matrix.

A2.4 Estimation of the Sampling Weights

We describe here a method, inspired by \citetgui2023conformalized, to estimate empirically the sampling weights $\bm{w}$ for our sampling model in (1), leveraging the available matrix observations indexed by $\mathcal{D}_{\mathrm{obs}}$ . In general, this estimation problem is made feasible by introducing the assumption that $\bm{w}$ has some lower-dimensional structure that can be summarized for example by a parametric model. The approach suggested by \citetgui2023conformalized assumes that the weight matrix $\bm{w}\in\mathbb{R}^{n_{r}\times n_{c}}$ is low-rank. For simplicity, we follow the same approach here, although our framework could also accommodate alternative estimation techniques in situations where different modeling assumptions about $\bm{w}$ may be justified.

Suppose the sampling weights follow the parametric model

\displaystyle\log\left(\frac{w_{r,c}}{1-w_{r,c}}\right)=A_{r,c},

where $\bm{A}\in\mathbb{R}^{n_{r}\times n_{c}}$ is a matrix with rank $\rho$ and bounded infinity norm; i.e., $||\bm{A}||_{\infty}\leq\nu$ , for some pre-defined constant $\nu\in\mathbb{R}$ . Then, if each matrix entry $(r,c)$ is independently observed (i.e., included in $\mathcal{D}_{\mathrm{obs}}$ ) with probability $w_{r,c}$ , i.e.,

\displaystyle\mathbb{I}\left[(r,c)\in\mathcal{D}_{\mathrm{obs}}\right]\overset% {\text{ind.}}{\sim}\text{Bernoulli}(w_{r,c}),

(A49)

then the log-likelihood of $\bm{A}$ can be written as

\displaystyle\mathcal{L}_{\mathcal{D}_{\mathrm{obs}}}(\bm{A})

\displaystyle=\sum\limits_{(r,c)\in\mathcal{D}_{\mathrm{obs}}}\log(l(A_{r,c}))% +\sum\limits_{(r,c)\in[n_{r}]\times[n_{c}]\setminus\mathcal{D}_{\mathrm{obs}}}% \log(1-l(A_{r,c})),

(A50)

where $l(t)=\left(1+\exp(-t)\right)^{-1}$ . This suggests estimating $\bm{A}$ by solving

	$\displaystyle\widehat{\bm{A}}=\operatorname*{arg\,max}\limits_{\bm{A}\in% \mathbb{R}^{n_{r}\times n_{c}}}$	$\displaystyle\mathcal{L}_{\mathcal{D}_{\mathrm{obs}}}(\bm{A})$
	subject to:	$\displaystyle\lVert\bm{A}\rVert_{*}\leq\nu\sqrt{\rho n_{r}n_{c}},$
		$\displaystyle\lVert\bm{A}\rVert_{\infty}\leq\nu,$

where $\lVert\cdot\rVert_{*}$ is the nuclear norm. Finally, having obtained $\widehat{\bm{A}}$ , the estimated sampling weights $\widehat{w}_{r,c}$ for each $(r,c)\in[n_{r}]\times[n_{c}]$ are given by

\displaystyle\widehat{w}_{r,c}=1/(1+\exp(\widehat{A}_{r,c})).

(A51)

In practice, the numerical experiments described in this paper apply this estimation procedure using the default choices of the parameters $\rho$ and $\nu$ suggested by \citetgui2023conformalized.

It is worth remarking that the independent Bernoulli observation model (A49) underlying this maximum-likelihood estimation approach differs from the weighted sampling without replacement model (1) that we utilize to calibrate our simultaneous conformal inferences. This discrepancy, however, is both useful and unlikely to cause issues, as explained next. On the one hand, sampling without replacement model is essential to capture the structured nature of our group-level test case $\bm{X}^{*}$ and of the calibration groups $\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n}$ . On the other hand, sampling without replacement would make the likelihood function in (A50) intractable, unnecessarily hindering the estimation process. Fortunately, however, the interpretation of the sampling weights $w_{r,c}$ remains largely consistent across the models (1) and (A49), which justifies the use of the estimated weights $\widehat{w}_{r,c}$ in (A51) for the purpose of calibrating conformal inferences under the model defined in (1).

Appendix A3 Additional Empirical Results

A3.1 Additional Experiments with Synthetic Data

A3.1.1 Heterogeneous Test Sampling Weights

This section describes experiments in which the test group $\bm{X}^{*}$ is sampled according to a model (3) with heterogeneous weights $\bm{w}^{*}$ . As explained in Section 2, the heterogeneous nature of these weights makes it feasible to ensure valid coverage conditional on interesting features of $\bm{X}^{*}$ . Therefore, the following experiments demonstrate the ability of our method to smoothly interpolate between marginal and conditional coverage guarantees, giving practitioners flexibility to up-weight or down-weight different types of test cases, as needed.

The ground-truth matrix $\bm{M}\in\mathbb{R}^{400\times 400}$ is generated according to the random factorization model defined in Equation (33), with rank $l=4$ . We observe $n_{\mathrm{obs}}=48,000$ entries of this matrix, sampled based on the model in (1) with uniform weights $\bm{w}$ ; these are indexed by $\mathcal{D}_{\mathrm{obs}}$ , whose complement is $\mathcal{D}_{\mathrm{miss}}=[n_{r}]\times[n_{c}]\setminus\mathcal{D}_{\mathrm{% obs}}$ . Algorithm 1 is then applied as in the previous experiments, using $n=\min\{2000,\lfloor\xi_{\mathrm{obs}}/2\rfloor\}$ calibration groups and allocating the remaining $n_{\mathrm{train}}=n_{\mathrm{obs}}-Kn$ observations for training. For the latter purpose, we rely on the usual alternating least square approach, with hypothesized rank $4$ , and thus obtain a point estimate $\widehat{\bm{M}}$ and its corresponding factor matrices $\widehat{\bm{U}}\in\mathbb{R}^{n_{r}\times l}$ and $\widehat{\bm{V}}\in\mathbb{R}^{n_{c}\times l}$ , such that $\widehat{\bm{M}}=\widehat{\bm{U}}(\widehat{V})^{\top}$ .

The weights $\bm{w}^{*}$ for $\bm{X}^{*}$ in (3) are based on an oracle procedure that leverages perfect knowledge of $\bm{M}$ and $\widehat{\bm{M}}$ to construct a sampling process that over-represents portions of the matrix for which the point estimate is less accurate. This process is controlled by a parameter $\delta\in(0,1]$ , which determines the heterogeneity of $\bm{w}^{*}$ . In the special case of $\delta=1$ , the test weights become $w_{r,c}^{*}=1$ for all matrix entries, recovering the experimental setup considered earlier in Section 5.1.1. By contrast, smaller values of $\delta$ tend to increasingly over-sample portions of the matrix for which the point estimate $\widehat{\bm{M}}$ is less accurate. We refer to Appendix A3.1.2 for details about this construction of the test sampling weights, which gives rise to an interesting and particularly challenging experimental setting in which attaining high coverage is intrinsically difficult.

To highlight the importance of correctly accounting for the heterogeneous nature of the test sampling weights $\bm{w}$ , in these experiments we compare the performance of joint confidence regions obtained with two alternative approaches. The first approach consists of applying Algorithm 1 based on the correct values of the data-generating weights $\bm{w}$ and $\bm{w}^{*}$ . The second approach consists of applying Algorithm 1 based on the correct values of the data-generating weights $\bm{w}$ but incorrectly specified weights $w^{*}_{r,c}=1$ for all $r\in[n_{r}],c\in[n_{c}]$ . In both cases, the nominal significance level is $\alpha=10\%$ , and the methods are evaluated based on 100 random test groups sampled from $\mathcal{D}_{\mathrm{miss}}\setminus\mathcal{D}_{\mathrm{wse}}$ , according to the model in (3) with the weights $\bm{w}^{*}$ defined in Equation (A54) within Section A3.1.2. All results are averaged over 300 independent experiments.

Figure A6 compares the performances of the two aforementioned implementations of our method as a function of the group size $K$ , for different values of the parameter $\delta$ . The results show that our method applied with the correct weights $\bm{w}^{*}$ always achieves the desired 90% simultaneous coverage, as predicted by the theory. By contrast, using mis-specified uniform test sampling weights $w^{*}$ leads to lower coverage than expected, especially for lower values of the parameter $\delta$ . Figure A7 provides an alternative but qualitatively consistent view of these findings, varying the parameter $\delta$ separately for different values of the group size $K$ .

It is interesting to note from Figures A6 and A7 that our method is sometimes slightly over-conservative when applied with highly heterogeneous test sampling weights $\bm{w}^{*}$ (corresponding to small values of the parameter $\delta$ ). This phenomenon is due to the unavoidable challenge of constructing valid confidence regions in the presence of strong distribution shifts, and it can be understood more precisely as follows. Smaller values of $\delta$ result in a stronger distribution shift between the observed data in $\mathcal{D}_{\mathrm{obs}}$ and $\bm{X}^{*}$ , increasing the likelihood that the weighted empirical quantile $\widehat{\tau}_{\alpha,K}$ defined in (10) might become infinite, leading to trivially wide confidence regions. In those (relatively rare) cases in which $\widehat{\tau}_{\alpha,K}$ diverges, to avoid numerical issues we simply set $\widehat{\tau}_{\alpha,K}$ equal to $S_{(n)}$ , the highest calibration conformity score. Fortunately, as shown explicitly in Figure A8, this issue is not very common (it is observed in fewer than 2.5% of the cases), which explains why our method appears to be only slightly over-conservative in Figures A6 and A7.

A3.1.2 Additional Details for Section A3.1.1

The sampling weights $\bm{w}^{*}$ for $\bm{X}^{*}$ utilized in the experiments of Section A3.1.1 are defined based on the following oracle procedure, which leverages perfect knowledge of $\bm{M}$ and $\widehat{\bm{M}}$ to construct a sampling process that over-represents portions of the matrix for which the point estimate is less accurate. This gives rise a particularly challenging experimental setting. For each entry $(r,c)\in[n_{r}]\times[n_{c}]$ , define the latent feature vector $\bm{y}_{r,c}\coloneqq(\widehat{\bm{U}}_{r\circ},\widehat{\bm{V}}_{c\circ})\in% \mathbb{R}^{2l}$ , where $\widehat{\bm{U}}_{r\circ}$ and $\widehat{\bm{V}}_{c\circ}$ are the $r$ -th row of $\widehat{\bm{U}}$ and the $c$ -th row of $\widehat{\bm{V}}_{c\circ}$ , respectively. Let also $\mathcal{D}_{\mathrm{wse}}\subset[n_{r}]\times[n_{c}]$ denote a subset containing $25\%$ of the matrix indices in $\mathcal{D}_{\mathrm{miss}}$ , chosen uniformly at random.

The values of $\bm{M}$ and $\widehat{\bm{M}}$ indexed by $\mathcal{D}_{\mathrm{wse}}$ are utilized by the oracle to construct $\bm{w}^{*}$ with an approach inspired by \citetcauchois2020knowing and \citetromano2020classification. For any fixed $\delta\in(0,1]$ , define the worst-slab estimation error,

\displaystyle\mathrm{WSE}(\widehat{\bm{M}};\delta,\mathcal{D}_{\mathrm{wse}})=% \sup\limits_{\bm{v}\in\mathbb{R}^{2l},a<b\in\mathbb{R}}\left\{\frac{\sum_{(r,c% )\in S_{\bm{v},a,b}}\lvert\widehat{M}_{r,c}-M_{r,c}\rvert}{\lvert S_{\bm{v},a,% b}\rvert}\mathrm{s.t.}\frac{\lvert S_{\bm{v},a,b}\rvert}{\lvert\mathcal{D}_{% \mathrm{wse}}\rvert}\geq\delta\right\},

(A52)

where, for any $\bm{v}\in\mathbb{R}^{2l}$ and $a<b\in\mathbb{R}$ , the subset $S_{\bm{v},a,b}\subset\mathcal{D}_{\mathrm{wse}}$ is defined as

\displaystyle S_{\bm{v},a,b}=\{(r,c)\in\mathcal{D}_{\mathrm{wse}}:a\leq\bm{v}^% {\top}\bm{y}_{r,c}\leq b\}.

(A53)

Intuitively, $S_{\bm{v},a,b}$ is a subset (or slab) of the matrix entries in $\mathcal{D}_{\mathrm{wse}}$ characterized by a direction $\bm{v}$ in the latent feature space and two scalar thresholds $a<b$ . Accordingly, $\mathrm{WSE}(\widehat{\bm{M}};\delta,\mathcal{D}_{\mathrm{wse}})$ is the average absolute residual between $\bm{M}$ and $\widehat{\bm{M}}$ evaluated for the entries within $S_{\bm{v},a,b}$ , after selecting the worst-case subset $S_{\bm{v},a,b}$ containing at least a fraction $\delta$ of the observations within $\mathcal{D}_{\mathrm{wse}}$ .

In practice, the optimal (worst-case) choice of $\bm{v}$ in (A52) is approximated by fitting an ordinary least square regression model to predict the absolute residuals $\{\lvert\widehat{M}_{r,c}-M_{r,c}\rvert\}_{(r,c)\in\mathcal{D}_{\mathrm{wse}}}$ as a linear function of the latent features $\{\bm{y}_{r,c}\}_{(r,c)\in\mathcal{D}_{\mathrm{wse}}}$ . Then, the corresponding optimal values of $a^{*},b^{*}$ in (A52) are approximated through a grid search, for a fixed value of the parameter $\delta$ .

Finally, the test sampling weights $\bm{w}^{*}=\left\{w^{*}_{r,c}\right\}_{(r,c)\in[n_{r}]\times[n_{c}]}$ are given by

\displaystyle w^{*}_{r,c}=\begin{cases}\cfrac{\mathrm{normpdf}(\bm{v}^{*\top}% \bm{y}_{r,c},a^{*},\sigma^{2})}{\mathrm{normpdf}(a^{*},a^{*},\sigma^{2})},&\bm% {v}^{*\top}\bm{y}_{r,c}<a^{*}\\ 1,&(r,c)\in S_{\bm{v}^{*},a^{*},b^{*}},\\ \cfrac{\mathrm{normpdf}(\bm{v}^{*\top}\bm{y}_{r,c},b^{*},\sigma^{2})}{\mathrm{% normpdf}(b^{*},b^{*},\sigma^{2})},&\bm{v}^{*\top}\bm{y}_{r,c}>b^{*},\end{cases}

(A54)

where $\mathrm{normpdf}(\cdot,a,\sigma^{2})$ denotes the density function of the Gaussian distribution with mean $a$ and variance $\sigma^{2}$ . This density function is introduced for smoothing purposes, setting $\sigma=(b^{*}-a^{*})/5$ . These sampling weights enable us to select test groups from indices that predominantly fall within the worst-slab region for which $\widehat{\bm{M}}$ estimates $\bm{M}$ least accurately. Intuitively, attaining valid coverage for this portion of the matrix should be especially challenging.

A3.2 Investigation of the Coverage Upper Bound

In this section, we investigate in more detail the upper coverage bound for our method established by Theorem 2, which is equal to $1-\alpha+\mathbb{E}[\max_{i\in[n+1]}p_{i}(\mathbf{X}^{\mathrm{cal}}_{1},\dots,% \mathbf{X}^{\mathrm{cal}}_{n},\mathbf{X}^{*})]$ . Ideally, a small expected value in this equation would guarantee that our conformal inferences are not too conservative. However, given that it would be unfeasible to evaluate this expected value analytically, we rely on a Monte Carlo numerical study.

We begin by focusing on groups of size $K=2$ and consider for simplicity matrices with an equal number of rows and columns; i.e., $n_{r}=n_{c}=200$ . We simulate the observation process by sampling $n_{\mathrm{obs}}=0.2\cdot n_{r}n_{c}$ matrix entries without replacement according to the model defined in (1), with

\displaystyle w_{r,c}=(n_{r}(c-1)+r)^{s},\qquad\forall r\in[n_{r}],\;c\in[n_{c% }],

with a scaling parameter $s=2$ . Note that this is the same choice of sampling weights utilized in the experiments of Section 5.1.2. For simplicity, the test group $X^{*}$ is sampled from the model defined in (3) using weights $\bm{w}^{*}$ exhibiting the same patterns as $\bm{w}$ . Then, the conformalization weights $p_{i}$ for all $i\in[n+1]$ are computed by applying Algorithm 1, and varying the number $n$ of calibration groups as a control parameter. Finally, we estimate $\mathbb{E}[\max_{i\in[n+1]}p_{i}(\mathbf{X}^{\mathrm{cal}}_{1},\dots,\mathbf{X% }^{\mathrm{cal}}_{n},\mathbf{X}^{*})]$ by taking the empirical average of $\max_{i\in[n+1]}p_{i}(\mathbf{X}^{\mathrm{cal}}_{1},\dots,\mathbf{X}^{\mathrm{% cal}}_{n},\mathbf{X}^{*})$ over 10 independent experiments.

Figure A9 [left] reports on the results of these experiments as a function of $n$ . The results show that our coverage upper bound approaches $1-\alpha$ roughly at rate $1/n$ , as one would generally expect in the case of standard conformal inferences based on exchangeable data \citepvovk2005algorithmic. This is consistent with our empirical observations that Algorithm 1 is typically not too conservative in practice. Figure A9 [right] reports on the results of additional experiments in which the group size $K$ is varied, while kee** the number of calibration groups fixed to $n=400$ . The results shows that the coverage upper bound tends to become more conservative as the group size increases, reflecting the intrinsic higher difficulty of producing valid simultaneous conformal inferences for larger groups.

A3.3 Additional Results for Section 5.1.1

A3.4 Additional Results for Section 5.1.2

A3.5 Additional Results for Section 5.2

Appendix A4 Mathematical Proofs

A4.1 A General Quantile Inflation Lemma

Proof of Lemma 1.

The proof follows the same strategy as that of \citettibshirani-covariate-shift-2019. Let $E_{z}$ denote the event that $\{Z_{1},\ldots,Z_{n+1}\}=\{z_{1},\ldots,z_{n+1}\}$ , for some possible realization $z=(z_{1},\ldots,z_{n+1})$ of $Z_{1},\ldots,Z_{n+1}$ , and let $v_{i}={\mathcal{S}}(z_{i},z_{-i})$ for all $i\in[n+1]$ . By the definition of conditional probability, for each $i\in[n+1]$ ,

\mathbb{P}\{V_{n+1}=v_{i}\mid E_{z}\}=\mathbb{P}\{Z_{n+1}=z_{i}\mid E_{z}\}=% \frac{\sum_{\sigma:\sigma(n+1)=i}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}{\sum% _{\sigma}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}=p^{f}_{i}(z_{1},\ldots,z_{n+% 1}),

where $\sigma$ is a permutation of $[n+1]$ . In other words,

V_{n+1}\mid E_{z}\sim\sum_{i=1}^{n+1}p^{f}_{i}(z_{1},\ldots,z_{n+1})\delta_{v_% {i}},

where $\delta_{v_{i}}$ denotes a point mass at $v_{i}$ . This implies that

\mathbb{P}\bigg{\{}V_{n+1}\leq\mathrm{Quantile}\bigg{(}\beta;\,\sum_{i=1}^{n+1% }p^{f}_{i}(z_{1},\ldots,z_{n+1})\delta_{v_{i}}\bigg{)}\mid E_{z}\bigg{\}}\geq\beta,

which is equivalent to

\mathbb{P}\bigg{\{}V_{n+1}\leq\mathrm{Quantile}\bigg{(}\beta;\,\sum_{i=1}^{n+1% }p^{f}_{i}(Z_{1},\ldots,Z_{n+1})\delta_{V_{i}}\bigg{)}\mid E_{z}\bigg{\}}\geq\beta.

Finally, marginalizing over $E_{z}$ leads to

\mathbb{P}\bigg{\{}V_{n+1}\leq\mathrm{Quantile}\bigg{(}\beta;\,\sum_{i=1}^{n+1% }p^{f}_{i}(Z_{1},\ldots,Z_{n+1})\delta_{V_{i}}\bigg{)}\bigg{\}}\geq\beta.

This is equivalent to the desired result because, by Lemma A5,

	$\displaystyle V_{n+1}\leq\mathrm{Quantile}\bigg{(}\beta;\,\sum_{i=1}^{n}p^{f}_% {i}(Z_{1},\ldots,Z_{n+1})\delta_{V_{i}}+p^{f}_{n+1}(Z_{1},\ldots,Z_{n+1})% \delta_{V_{n+1}}\bigg{)}$
	$\displaystyle\qquad\Longleftrightarrow$
	$\displaystyle V_{n+1}\leq\mathrm{Quantile}\bigg{(}\beta;\,\sum_{i=1}^{n}p^{f}_% {i}(Z_{1},\ldots,Z_{n+1})\delta_{V_{i}}+p^{f}_{n+1}(Z_{1},\ldots,Z_{n+1})% \delta_{\infty}\bigg{)}.$

∎

Lemma A5 (also appearing implicitly in \citettibshirani-covariate-shift-2019).

Consider $n+1$ random variables $V_{1},\ldots,V_{n+1}$ and some weights $p_{1},\ldots,p_{n+1}$ such that $p_{i}>0$ and $\sum_{i=1}^{n+1}p_{i}=1$ . Then, for any $\beta\in(0,1)$ ,

\displaystyle V_{n+1}\leq Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}% \big{)}\iff V_{n+1}\leq Q\big{(}\beta;\sum_{i=1}^{n}p_{i}\delta_{V_{i}}+p_{n+1% }\delta_{\infty}\big{)}.

Proof of Lemma A5.

This result was previously utilized by \citettibshirani-covariate-shift-2019 and a proof is included here for completeness. It is straightforward to establish one direction of the result, namely

\displaystyle V_{n+1}\leq Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}% \big{)}\Longrightarrow V_{n+1}\leq Q\big{(}\beta;\sum_{i=1}^{n}p_{i}\delta_{V_% {i}}+p_{n+1}\delta_{\infty}\big{)},

because, almost surely, $V_{n+1}\leq\infty$ , and hence

\displaystyle Q\big{(}\beta;\sum_{i=1}^{n}p_{i}\delta_{V_{i}}+p_{n+1}\delta_{% \infty}\big{)}\geq Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}\big{)}.

To prove the other direction, suppose $V_{n+1}>Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}\big{)}$ . By definition of the quantile function, we can write without loss of generality that $Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}\big{)}=V_{(j)}$ , where $j\in[n+1]$ is defined such that

\displaystyle p_{(1)}+\ldots+p_{(j)}\geq\beta,

\displaystyle p_{(1)}+\ldots+p_{(j-1)}<\beta,

where $p_{(1)}\leq\ldots p_{(n+1)}$ are the order statistics of $p_{1},\ldots,p_{n+1}$ . Therefore, $V_{n+1}>V_{(j)}$ , and re-assigning $V_{n+1}\to\infty$ does not change $V_{(j)}$ . This means that $Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}\big{)}=Q\big{(}\beta;\sum_{i% =1}^{n}p_{i}\delta_{V_{i}}+p_{n+1}\delta_{\infty}\big{)}$ , leading to $V_{n+1}>Q\big{(}\beta;\sum_{i=1}^{n}p_{i}\delta_{V_{i}}+p_{n+1}\delta_{\infty}% \big{)}$ . Thus, we have shown that

\displaystyle V_{n+1}>Q\big{(}\beta;\sum_{i=1}^{n+1}p_{i}\delta_{V_{i}}\big{)}% \Longrightarrow V_{n+1}>Q\big{(}\beta;\sum_{i=1}^{n}p_{i}\delta_{V_{i}}+p_{n+1% }\delta_{\infty}\big{)}.

∎

Proof of Lemma 2.

Let $E_{z}$ denote the event that $\{Z_{1},\ldots,Z_{n+1}\}=\{z_{1},\ldots,z_{n+1}\}$ , for some possible realization $z=(z_{1},\ldots,z_{n+1})$ of $Z_{1},\ldots,Z_{n+1}$ , and let $v_{i}={\mathcal{S}}(z_{i},z_{-i})$ for all $i\in[n+1]$ . As in the proof of Lemma 1, for each $i\in[n+1]$ ,

\mathbb{P}\{V_{n+1}=v_{i}\mid E_{z}\}=\mathbb{P}\{Z_{n+1}=z_{i}\mid E_{z}\}=% \frac{\sum_{\sigma:\sigma(n+1)=i}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}{\sum% _{\sigma}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}.

Further, because $Z_{1},\ldots,Z_{n+1}$ are also leave-one-out exchangeable,

	$\displaystyle\frac{\sum_{\sigma:\sigma(n+1)=i}f(z_{\sigma(1)},\ldots,z_{\sigma% (n+1)})}{\sum_{\sigma}f(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}$	$\displaystyle=\frac{\sum_{\sigma:\sigma(n+1)=i}g(z_{\sigma(1)},\ldots,z_{% \sigma(n+1)})\cdot h(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}{\sum_{\sigma}g(z_{% \sigma(1)},\ldots,z_{\sigma(n+1)})\cdot h(z_{\sigma(1)},\ldots,z_{\sigma(n+1)})}$
		$\displaystyle=\frac{\sum_{\sigma:\sigma(n+1)=i}g(z_{1},\ldots,z_{n+1})\cdot% \bar{h}(z_{-i},z_{i})}{\sum_{\sigma}g(z_{1},\ldots,z_{n+1})\cdot\bar{h}(z_{-% \sigma(n+1)},z_{\sigma(n+1)})}$
		$\displaystyle=\frac{\sum_{\sigma:\sigma(n+1)=i}\bar{h}(z_{-i},z_{i})}{\sum_{j=% 1}^{n+1}\sum_{\sigma:\sigma(n+1)=j}\bar{h}(z_{-j},z_{j})}$
		$\displaystyle=\frac{n!\bar{h}(z_{-i},z_{i})}{n!\sum_{j=1}^{n+1}\bar{h}(z_{-j},% z_{j})}=p_{i}(z_{1},\ldots,z_{n+1}),$

which implies $V_{n+1}\mid E_{z}\sim\sum_{i=1}^{n+1}p_{i}(z_{1},\ldots,z_{n+1})\delta_{v_{i}}.$ The rest of the proof then follows with the same approach as the proof of Lemma 1. ∎

A4.2 Conformal Inference with Structured Calibration

Proof of Proposition 1.

This result is a direct consequence of Proposition A4, which characterizes the joint distribution of $(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*})$ conditional on $\mathcal{D}_{\mathrm{prune}}$ and $\mathcal{D}_{\mathrm{train}}$ . It is easy to see from (A56) that this distribution is invariant to permutations of $\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n}$ . ∎

Proposition A4.

Consider the same setting of Proposition 1. Let $D_{1}$ and $D_{0}\subseteq D_{1}$ denote arbitrary realizations of $\mathcal{D}_{\mathrm{train}}$ and $\mathcal{D}_{\mathrm{prune}}$ , respectively. Let $\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{n},\bm{x}_{n+1}$ be any sequence of $n+1$ $K$ -groups involving elements of $\mathcal{D}_{\mathrm{obs}}$ such that

\displaystyle\mathbb{P}\left[\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\bm{X}^{% \mathrm{cal}}_{2}=\bm{x}_{2},\ldots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n},\bm{X% }^{*}=\bm{x}_{n+1}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{% train}}=D_{1}\right]>0.

Define also $D_{2}=\cup_{i\in[n],k\in[K]}\{x^{k}_{i}\}$ , the unordered collection of matrix entries indexed by the groups $\bm{x}_{1},\ldots,\bm{x}_{n}$ , noting that $\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}$ . Further, define

\displaystyle\begin{split}\bar{\mathcal{D}}_{\mathrm{miss}}^{c}&\coloneqq\{(r^% {\prime},c^{\prime})\in\bar{\mathcal{D}}_{\mathrm{miss}}\mid c^{\prime}=c\},% \qquad\forall c\in[n_{c}],\\ \bar{n}_{\mathrm{obs}}&\coloneqq\left\lvert\mathcal{D}_{\mathrm{obs}}\setminus% \mathcal{D}_{\mathrm{prune}}\right\rvert,\\ \bar{n}^{c}_{\mathrm{obs}}&\coloneqq\lvert\{(r^{\prime},c^{\prime})\in\mathcal% {D}_{\mathrm{obs}}\setminus\mathcal{D}_{\mathrm{prune}}:c^{\prime}=c\}\rvert,% \qquad\forall c\in[n_{c}],\\ N_{n}^{c}&\coloneqq|\{i\in[n]:c=x_{i,1,2}=x_{i,2,2}=\ldots=x_{i,K,2}\}|,\qquad% \forall c\in[n_{c}].\end{split}

(A55)

Intuitively, $\bar{\mathcal{D}}_{\mathrm{miss}}^{c}$ represents the pruned missing indices in column $c$ , $\bar{n}^{c}_{\mathrm{obs}}$ is the number of indices in $\mathcal{D}_{\mathrm{obs}}\setminus\mathcal{D}_{\mathrm{prune}}$ corresponding to entries in column $c$ , while $N_{n}^{c}$ is the number of calibration groups in column $c$ . Note that all quantities in (A55) are uniquely determined by $D_{0}$ and $D_{2}$ . Then,

\displaystyle\begin{split}&\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1% },\dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n},\bm{X}^{*}=\bm{x}_{n+1}\mid% \mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)\\ &=\left[\frac{w^{*}_{x_{n+1,1}}}{\sum_{(r,c)\in\bar{\mathcal{D}}_{\mathrm{miss% }}}w^{*}_{r,c}}\cdot\prod_{k=2}^{K}\frac{w^{*}_{x_{n+1,k}}}{\sum_{(r,c)\in\bar% {\mathcal{D}}_{\mathrm{miss}}}w^{*}_{r,c}\mathbb{I}\left[c=x_{n+1,1,2}\right]-% \sum_{k^{\prime}=1}^{k-1}w^{*}_{x_{n+1,k^{\prime}}}}\right]\\ &\quad\cdot\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)% \cdot\left[\frac{1}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},% \mathcal{D}_{\mathrm{train}}=D_{1}\right)}\cdot\prod_{c\in[n_{c}]}\frac{1}{% \mbinom{n^{c}_{\mathrm{obs}}}{\bar{n}^{c}_{\mathrm{obs}}}}\right]\\ &\quad\cdot\left(\prod_{i=1}^{n}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-1)}\right)% \cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}\frac{1}{\bar{% n}^{c}_{\mathrm{obs}}-K(j-1)-k}.\end{split}

(A56)

Proof of Proposition A4.

First, note that

\displaystyle\begin{split}&\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1% },\dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n},\bm{X}^{*}=\bm{x}_{n+1}\mid% \mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)\\ &=\mathbb{P}\left(\bm{X}^{*}=\bm{x}_{n+1}\mid\mathcal{D}_{\mathrm{prune}}=D_{0% },\mathcal{D}_{\mathrm{train}}=D_{1},\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},% \dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n}\right)\\ &\quad\cdot\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{% \mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}% _{\mathrm{train}}=D_{1}\right)\\ &=\mathbb{P}\left(\bm{X}^{*}=\bm{x}_{n+1}\mid\mathcal{D}_{\mathrm{obs}}=D_{1}% \cup D_{2}\right)\\ &\quad\cdot\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{% \mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}% _{\mathrm{train}}=D_{1}\right)\\ &=\left[\frac{w^{*}_{x_{n+1,1}}}{\sum_{(r,c)\in\bar{\mathcal{D}}_{\mathrm{miss% }}}w^{*}_{r,c}}\cdot\prod_{k=2}^{K}\frac{w^{*}_{x_{n+1,k}}}{\sum_{(r,c)\in\bar% {\mathcal{D}}_{\mathrm{miss}}}w^{*}_{r,c}\mathbb{I}\left[c=x_{n+1,1,2}\right]-% \sum_{k^{\prime}=1}^{k-1}w^{*}_{x_{n+1,k^{\prime}}}}\right]\\ &\quad\cdot\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{% \mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}% _{\mathrm{train}}=D_{1}\right),\end{split}

(A57)

where the first term on the right-hand-side above was written explicitly using the sequential sampling characterization of $\Psi^{\text{col}}$ in (4).

Next, we focus on the second term on the right-hand-side of (LABEL:eq:paired-partial-exch-1):

\displaystyle\begin{split}&\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1% },\dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D% _{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)\\ &\quad=\frac{\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}% ^{\mathrm{cal}}_{n}=\bm{x}_{n},\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_% {\mathrm{train}}=D_{1}\right)}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_% {0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)}\\ &=\frac{\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{% \mathrm{cal}}_{n}=\bm{x}_{n},\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{train}}=D_{1},\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)}{% \mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{train}% }=D_{1}\right)}\\ &=\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{\mathrm{% cal}}_{n}=\bm{x}_{n},\mathcal{D}_{\mathrm{train}}=D_{1}\mid\mathcal{D}_{% \mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\\ &{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}% \pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}\cdot\frac{\mathbb{P}% \left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D% _{2}\right)}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{train}}=D_{1}\right)}\\ &=\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{\mathrm{% cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{obs}}=D_{1}\cup D_{2}\right)\\ &{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}% \pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}\cdot\mathbb{P}\left(% \mathcal{D}_{\mathrm{train}}=D_{1}\mid\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},% \dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n},\mathcal{D}_{\mathrm{prune}}=D_{0},% \mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\\ &{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}% \pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}\cdot\frac{\mathbb{P}% \left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D% _{2}\right)}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{train}}=D_{1}\right)}\\ &=\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}^{\mathrm{% cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{obs}}=D_{1}\cup D_{2}\right)\\ &{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}% \pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}\cdot\frac{\mathbb{P}% \left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D% _{2}\right)}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}_{% \mathrm{train}}=D_{1}\right)},\end{split}

(A58)

where the last equality above follows from the fact that $\mathcal{D}_{\mathrm{train}}$ is uniquely determined by $\bm{X}_{1},\ldots,\bm{X}_{n}$ , $\mathcal{D}_{\mathrm{prune}}$ and $\mathcal{D}_{\mathrm{obs}}$ .

The first term on the right-hand-side of (A58) is given by Lemma A6:

\displaystyle\begin{split}&\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1% },\dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}},% \mathcal{D}_{\mathrm{obs}}\right)\\ &\quad=\left(\prod_{i=1}^{n}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-1)}\right)% \cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}\frac{1}{\bar{% n}^{c}_{\mathrm{obs}}-K(j-1)-k}.\end{split}

(A59)

Note that (A59) implies that, conditional on $\mathcal{D}_{\mathrm{obs}}$ and $\mathcal{D}_{\mathrm{prune}}$ , the distribution of $\bm{X}^{\mathrm{cal}}_{1},\ldots,\bm{X}^{\mathrm{cal}}_{n}$ does not depend on the order of these calibration groups.

Next, we focus on the second term on the right-hand-side of (A58), namely

\displaystyle\frac{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal% {D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)}{\mathbb{P}\left(\mathcal{D}_{% \mathrm{prune}}=D_{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)}.

(A60)

The numerator of (A60) is

\displaystyle\begin{split}&\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0},% \mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\\ &=\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\cdot% \mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{0}\mid\mathcal{D}_{\mathrm{obs% }}=D_{1}\cup D_{2}\right)\\ &=\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\cdot\prod_% {c\in[n_{c}]}\frac{1}{\mbinom{n^{c}_{\mathrm{obs}}}{m^{c}}}\\ &=\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_{2}\right)\cdot\prod_% {c\in[n_{c}]}\frac{1}{\mbinom{n^{c}_{\mathrm{obs}}}{\bar{n}^{c}_{\mathrm{obs}}% }}.\end{split}

(A61)

where $m^{c}:=n^{c}_{\mathrm{obs}}\mod K$ denotes the remainder of the integer division $n^{c}_{\mathrm{obs}}/K$ , and $\bar{n}^{c}_{\mathrm{obs}}=\lfloor n^{c}_{\mathrm{obs}}/K\rfloor=n^{c}_{% \mathrm{obs}}-m^{c}$ . Above, the denominator does not need to be simplified because it only depends on $D_{0}$ and $D_{1}$ .

Finally, combining (LABEL:eq:paired-partial-exch-1), (A58), (A59), (A60), and (A61), we arrive at:

	$\displaystyle\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}% ^{\mathrm{cal}}_{n}=\bm{x}_{n},\bm{X}^{*}=\bm{x}_{n+1}\mid\mathcal{D}_{\mathrm% {prune}}=D_{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)$
	$\displaystyle=\left[\frac{w^{}_{x_{n+1,1}}}{\sum_{(r,c)\in\bar{\mathcal{D}}_{% \mathrm{miss}}}w^{}_{r,c}}\cdot\prod_{k=2}^{K}\frac{w^{}_{x_{n+1,k}}}{\sum_{% (r,c)\in\bar{\mathcal{D}}_{\mathrm{miss}}}w^{}_{r,c}\mathbb{I}\left[c=x_{n+1,% 1,2}\right]-\sum_{k^{\prime}=1}^{k-1}w^{*}_{x_{n+1,k^{\prime}}}}\right]$
	$\displaystyle\quad\cdot\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup D_% {2}\right)\cdot\left[\frac{1}{\mathbb{P}\left(\mathcal{D}_{\mathrm{prune}}=D_{% 0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)}\cdot\prod_{c\in[n_{c}]}\frac{1}{% \mbinom{n^{c}_{\mathrm{obs}}}{\bar{n}^{c}_{\mathrm{obs}}}}\right]$
	$\displaystyle\quad\cdot\left(\prod_{i=1}^{n}\frac{1}{\bar{n}_{\mathrm{obs}}-K(% i-1)}\right)\cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}% \frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}.$

∎

Lemma A6.

Under the same setup as in Proposition A4,

\displaystyle\begin{split}&\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1% },\dots,\bm{X}^{\mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}},% \mathcal{D}_{\mathrm{obs}}\right)\\ &\quad=\left(\prod_{i=1}^{n}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-1)}\right)% \cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}\frac{1}{\bar{% n}^{c}_{\mathrm{obs}}-K(j-1)-k}.\end{split}

(A62)

Proof of Lemma A6.

We prove this result by induction on the number of calibration groups, $n$ . For ease of notation, we will denote the column of the $i$ -th calibration group as $c_{i}$ , for any $i\in[n]$ ; that is, $c_{i}=x_{i,k,2}$ for all $k\in[K]$ . In the base case where $n=2$ ,

	$\displaystyle\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\bm{X}^{% \mathrm{cal}}_{2}=\bm{x}_{2}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{D}% _{\mathrm{obs}}=D_{1}\right)$
	$\displaystyle\quad=\frac{1}{\bar{n}_{\mathrm{obs}}}\cdot\frac{1}{\bar{n}^{c_{1% }}_{\mathrm{obs}}-1}\cdot\ldots\cdot\frac{1}{\bar{n}^{c_{1}}_{\mathrm{obs}}-K+% 1}\cdot\frac{1}{\bar{n}_{\mathrm{obs}}-K}$
	$\displaystyle\quad\quad\cdot\left[\left(\frac{1}{\bar{n}^{c_{2}}_{\mathrm{obs}% }-1}\cdot\ldots\cdot\frac{1}{\bar{n}^{c_{2}}_{\mathrm{obs}}-K+1}\right)% \mathbbm{1}\left\{c_{1}\neq c_{2}\right\}+\left(\frac{1}{\bar{n}^{c_{2}}_{% \mathrm{obs}}-K-1}\cdot\ldots\cdot\frac{1}{\bar{n}^{c_{2}}_{\mathrm{obs}}-2K+1% }\right)\mathbbm{1}\left\{c_{1}=c_{2}\right\}\right]$
	$\displaystyle\quad=\left[\prod_{i=1}^{2}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-1)% }\right]\cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}\frac{% 1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}.$

Now, for the induction step, suppose Equation (A62) holds for $n-1$ . Then,

	$\displaystyle\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}% ^{\mathrm{cal}}_{n}=\bm{x}_{n}\mid\mathcal{D}_{\mathrm{prune}}=D_{0},\mathcal{% D}_{\mathrm{obs}}=D_{1}\right)$
	$\displaystyle\quad=\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,% \bm{X}^{\mathrm{cal}}_{n-1}=\bm{x}_{n-1}\mid\mathcal{D}_{\mathrm{prune}}=D_{0}% ,\mathcal{D}_{\mathrm{obs}}=D_{1}\right)$
	$\displaystyle\quad\quad\cdot\frac{1}{\bar{n}_{\mathrm{obs}}-K(n-1)}\cdot\frac{% 1}{{\bar{n}^{c_{n}}_{\mathrm{obs}}-K(N_{n}^{c_{n}}-1)-1}}\cdot\ldots\cdot\frac% {1}{{\bar{n}^{c_{n}}_{\mathrm{obs}}-KN^{n}_{c_{n}}+1}}$
	$\displaystyle\quad=\left[\prod_{i=1}^{n-1}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-% 1)}\right]\cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n-1}^{c}}\prod_{k=1}^{K-1}% \frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}$
	$\displaystyle\quad\quad\cdot\frac{1}{\bar{n}_{\mathrm{obs}}-K(n-1)}\cdot\prod_% {k=1}^{K-1}\frac{1}{{\bar{n}^{c_{n}}_{\mathrm{obs}}-K(N_{n}^{c_{n}}-1)-k}}$
	$\displaystyle\quad=\left[\prod_{i=1}^{n}\frac{1}{\bar{n}_{\mathrm{obs}}-K(i-1)% }\right]\cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^{c}}\prod_{k=1}^{K-1}\frac{% 1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}.$

where the last equality above follows because $N_{n-1}^{c}=N_{n}^{c}$ for all $c\neq c_{n}$ , while $N_{n}^{c_{n}}=N_{n-1}^{c_{n}}+1$ . ∎

A4.3 Characterization of the Conformalization Weights

Proof of Lemma 3.

Recall from Proposition A4 that

	$\displaystyle\mathbb{P}\left(\bm{X}^{\mathrm{cal}}_{1}=\bm{x}_{1},\dots,\bm{X}% ^{\mathrm{cal}}_{n}=\bm{x}_{n},\bm{X}^{*}=\bm{x}_{n+1}\mid\mathcal{D}_{\mathrm% {prune}}=D_{0},\mathcal{D}_{\mathrm{train}}=D_{1}\right)$
	$\displaystyle\quad=g(\{\bm{x}_{1},\ldots,\bm{x}_{n+1}\})\cdot\bar{h}(\{\bm{x}_% {1},\ldots,\bm{x}_{n}\},\bm{x}_{n+1}),$

for some permutation-invariant function $g$ and

\displaystyle\begin{split}\bar{h}(\{\bm{x}_{1},\ldots,\bm{x}_{n}\},\bm{x}_{n+1% })&=\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup{D}_{2}\right)\cdot% \left[\widetilde{w}^{*}_{x_{n+1,1}}\cdot\prod_{k=2}^{K}\widetilde{w}^{*}_{x_{n% +1,k}}\right]\\ &\quad\cdot\left[\prod_{c\in[n_{c}]}\frac{1}{\mbinom{n^{c}_{\mathrm{obs}}}{% \bar{n}^{c}_{\mathrm{obs}}}}\right]\cdot\prod_{c=1}^{n_{c}}\prod_{j=1}^{N_{n}^% {c}}\prod_{k=1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k},\end{split}

(A63)

with

\displaystyle\widetilde{w}^{*}_{x_{n+1,1}}=\cfrac{w^{*}_{x_{n+1,1}}}{\sum_{(r,% c)\in\bar{D}_{\mathrm{miss}}}w^{*}_{r,c}},

and, for all $k\in\{2,\dots,K\}.$ ,

\displaystyle\widetilde{w}^{*}_{x_{n+1,k}}=\frac{w^{*}_{x_{n+1,k}}}{\sum_{(r,c% )\in\bar{D}_{\mathrm{miss}}}w^{*}_{r,c}\mathbb{I}\left[c=x_{n+1,1,2}\right]-% \sum_{k^{\prime}=1}^{k-1}w^{*}_{x_{n+1,k^{\prime}}}}.

Therefore, Lemma 2 can be applied, with weights proportional to

\displaystyle p_{i}(\bm{x}_{1},\ldots,\bm{x}_{n+1})

\displaystyle\propto\bar{h}(\{\bm{x}_{1},\ldots,\bm{x}_{n+1}\}\setminus\{\bm{x% }_{i}\},\bm{x}_{i}).

(A64)

In order to compute the right-hand-side of (A64), one must understand how (A63) changes when $\bm{x}_{n+1}$ is swapped with $\bm{x}_{i}$ , for any fixed $i\in[n]$ . This can be done easily, one piece at a time.

To begin, it is immediate to see that swap** $\bm{x}_{n+1}$ with $\bm{x}_{i}$ results in $\mathbb{P}\left(\mathcal{D}_{\mathrm{obs}}=D_{1}\cup{D}_{2}\right)$ being replaced by $\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{obs};i}\right)$ . Similarly, $\widetilde{w}^{*}_{x_{n+1,1}},\ldots,\widetilde{w}^{*}_{x_{n+1,K}}$ are replaced by $\widetilde{w}^{*}_{x_{i,1}},\ldots,\widetilde{w}^{*}_{x_{i,K}}$ , defined as

\displaystyle\widetilde{w}^{*}_{x_{i,1}}=\frac{w^{*}_{x_{i,1}}}{\sum_{(r,c)\in% \bar{D}_{\mathrm{miss};i}}w^{*}_{r,c}},

and, for all $k\in\{2,\ldots,K\}$ ,

	$\displaystyle\widetilde{w}^{*}_{x_{i,k}}$	$\displaystyle=\frac{w^{}_{x_{n+1,k}}}{\sum_{(r,c)\in\bar{D}_{\mathrm{miss};i}% }w^{}_{r,c}\mathbb{I}\left[c=x_{i,1,2}\right]-\sum_{k^{\prime}=1}^{k-1}w^{*}_% {x_{i,k^{\prime}}}}$
		$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum_{(r,c)\in\bar{D}^{c_{i}}_{\mathrm{% miss};i}}w^{}_{r,c}-\sum^{k-1}_{k^{\prime}=1}w^{*}_{x_{i;k^{\prime}}}},$

here, for any $i\in[n+1]$ , $c_{i}$ denotes the column to which $\bm{x}_{i}$ belongs; i.e., $c_{i}\coloneqq x_{i,k,2},\forall k\in[K]$ , where $x_{i,k,2}$ is the column of the $k$ th entry in $\bm{x}_{i}$ .

To understand the notation in the equations above, recall that $\bar{D}_{\mathrm{miss}}=\{(r,c)\in D_{\mathrm{miss}}\mid n^{c}_{\mathrm{miss}}% \geq K\}$ is a realization of the pruned missing set $\bar{\mathcal{D}}_{\mathrm{miss}}$ , and $n^{c}_{\mathrm{miss}}=\lvert\{(r^{\prime},c^{\prime})\in\mathcal{D}_{\mathrm{% miss}}\mid c^{\prime}=c\}\rvert$ is the number of missing entries in column $c$ . In the parallel universe where $\bm{x}_{n+1}$ is swapped with $\bm{x}_{i}$ , the realization of the missing indices is denoted as $D_{\mathrm{miss};i}$ , and the realization of the pruned missing set is $\bar{D}_{\mathrm{miss};i}\coloneqq\{(r,c)\in D_{\mathrm{miss};i}\mid n^{c}_{% \mathrm{miss};i}\geq K\}$ , where $n^{c}_{\mathrm{miss};i}\coloneqq\lvert\{(r^{\prime},c^{\prime})\in\mathcal{D}_% {\mathrm{miss};i}\}:c^{\prime}=c\rvert$ . Similarly, $\bar{D}^{c}_{\mathrm{miss};i}\coloneqq\{(r^{\prime},c^{\prime})\in\bar{D}_{% \mathrm{miss};i}:c^{\prime}=c\}$ denotes entries belonging to column $c$ in the imaginary pruned missing set. Thus, $\widetilde{w}^{*}_{x_{i,1}}$ and $\widetilde{w}^{*}_{x_{i,k}}$ can be interpreted as normalized sampling weights for the imaginary test group $\bm{x}_{i}$ .

Next, let $n_{\mathrm{obs}}^{c}$ and $n^{c}_{\mathrm{obs};i}$ denote the numbers of observations in column $c$ from the sets $D_{\mathrm{obs}}$ and $D_{\mathrm{obs};i}$ , respectively. Define also $\bar{n}^{c}_{\mathrm{obs}}=\lfloor n^{c}_{\mathrm{obs}}/K\rfloor$ and $\bar{n}^{c}_{\mathrm{obs};i}=\lfloor n^{c}_{\mathrm{obs};i}/K\rfloor$ , the corresponding numbers of observations remaining in column $c$ after the random pruning step of Algorithm 2. Let $N^{c}_{n}\coloneqq\lvert\left\{i\in[n]:c_{i}=c\right\}\rvert$ denote the number of calibration groups in column $c\in[n_{c}]$ . Similarly, let $N^{c}_{n;i}$ denote the corresponding imaginary quantity obtained by swap** the calibration group $\bm{X}_{i}^{\mathrm{cal}}$ with the test group $\bm{X}^{*}$ ; i.e.,

\displaystyle N^{c}_{n;i}\coloneqq\lvert\left\{j\in[n+1]\setminus\{i\}:c_{j}=c% \right\}\rvert

Further, swap** $\bm{x}_{n+1}$ with $\bm{x}_{i}$ results in $n^{c}_{\mathrm{obs}}$ , $\bar{n}^{c}_{\mathrm{obs}}$ , and $N^{c}_{n}$ being replaced by $n^{c}_{\mathrm{obs};i}$ , $\bar{n}^{c}_{\mathrm{obs};i}$ , and $N^{c}_{n;i}$ , respectively. Therefore,

\displaystyle\begin{split}&p_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1})% \propto\bar{h}(\{\bm{x}_{1},\ldots,\bm{x}_{n+1}\}\setminus\{\bm{x}_{i}\},\bm{x% }_{i})\\ &\quad\propto\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^{K}\widetilde% {w}^{*}_{x_{i,k}}\right)\left(\prod\limits_{c=1}^{n_{c}}\mbinom{n^{c}_{\mathrm% {obs};i}}{\bar{n}^{c}_{\mathrm{obs};i}}^{-1}\right)\left(\prod\limits_{c=1}^{n% _{c}}\prod\limits_{j=1}^{N^{c}_{n;i}}\prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^% {c}_{\mathrm{obs};i}-K(j-1)-k}\right)\cdot\mathbb{P}_{\bm{w}}\left(\mathcal{D}% _{\mathrm{obs}}=D_{\mathrm{obs};i}\right).\end{split}

(A65)

Now, we will further simplify the expression in Equation (A65) to facilitate the practical computation of these weights.

Consider the first term on the right-hand-side of Equation (A65), namely

\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^{K}\widetilde{w}^{*}_{x_{i% ,k}}\right).

This quantity depends on the pruned set of missing indices $\bar{D}_{\mathrm{miss};i}$ and, by definition,

\displaystyle\widetilde{w}^{*}_{x_{i,1}}

\displaystyle=\frac{w^{*}_{x_{i,1}}}{\sum_{(r,c)\in\bar{D}_{\mathrm{miss};i}}w% ^{*}_{r,c}}=\frac{w^{*}_{x_{i,1}}}{\sum_{(r,c)\in\bar{D}_{\mathrm{miss}}}w^{*}% _{r,c}-\sum\limits_{k=1}^{K}\left(w^{*}_{x_{n+1,k}}-w^{*}_{x_{i,k}}\right)+u^{% *}_{x_{i,1}}},

where

\displaystyle u^{*}_{x_{i,1}}=\mathbb{I}\left[c_{i}\neq c_{n+1}\right]\left(% \mathbb{I}\left[n^{c_{i}}_{\mathrm{miss}}<K\right]\left(\sum\limits_{(r,c)\in D% ^{c_{i}}_{\mathrm{miss}}}w^{*}_{r,c}\right)-\mathbb{I}\left[n^{c_{n+1}}_{% \mathrm{miss}}<2K\right]\left(\sum\limits_{(r,c)\in D^{c_{n+1}}_{\mathrm{miss}% }\setminus\bm{x}_{n+1}}w^{*}_{r,c}\right)\right),

while, for all $k\in\{2,\ldots,K\}$ ,

	$\displaystyle\widetilde{w}^{*}_{x_{i,k}}$	$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum_{(r,c)\in\bar{D}^{c_{i}}_{\mathrm{% miss};i}}w^{}_{r,c}-\sum^{k-1}_{k^{\prime}=1}w^{*}_{x_{i;k^{\prime}}}}$
		$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum\limits_{(r,c)\in\bar{D}^{c_{i}}_{% \mathrm{miss}}}w^{}_{r,c}+\sum\limits_{k^{\prime}=k}^{K}w^{}_{x_{i,k^{\prime% }}}-\mathbb{I}\left[c_{i}=c_{n+1}\right]\left(\sum\limits_{k^{\prime}=1}^{K}w^% {}_{x_{n+1,k^{\prime}}}\right)+\mathbb{I}\left[n^{c_{i}}_{\mathrm{miss}}<K% \right]\left(\sum\limits_{(r,c)\in D^{c_{i}}_{\mathrm{miss}}}w^{*}_{r,c}\right)}$
		$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum\limits_{(r,c)\in D^{c_{i}}_{\mathrm{% miss}}}w^{}_{r,c}+\sum\limits_{k^{\prime}=k}^{K}w^{}_{x_{i,k^{\prime}}}-% \mathbb{I}\left[c_{i}=c_{n+1}\right]\left(\sum\limits_{k^{\prime}=1}^{K}w^{}_% {x_{n+1,k^{\prime}}}\right)}.$

In the equations above, with a slight abuse of notation, we denoted the set of missing indices in column $c_{n+1}$ excluding those in the group $\bm{x}_{n+1}$ as $D^{c_{n+1}}_{\mathrm{miss}}\setminus\bm{x}_{n+1}\coloneqq D^{c_{n+1}}_{\mathrm% {miss}}\setminus\{x_{n+1,k}\}_{k=1}^{K}$ .

Next, let us consider the second term on the right-hand-side of Equation (A65), namely

\left(\prod\limits_{c=1}^{n_{c}}\mbinom{n^{c}_{\mathrm{obs};i}}{\bar{n}^{c}_{% \mathrm{obs};i}}^{-1}\right).

This evaluates the probability of observing a particular realization of ${\mathcal{D}}_{\mathrm{prune}}$ . Since the pruned indices are chosen uniformly at random, this quantity only depends on the number of observations within each column before and after pruning, namely, $n^{c}_{\mathrm{obs};i}$ and $\bar{n}^{c}_{\mathrm{obs};i}$ . By definition, we have

\displaystyle n_{\mathrm{obs};i}^{c}=\begin{cases}n_{\mathrm{obs}}^{c}-K% \mathbb{I}\left[c_{i}\neq c_{n+1}\right],&c=c_{i},\\ n_{\mathrm{obs}}^{c}+K\mathbb{I}\left[c_{i}\neq c_{n+1}\right],&c=c_{n+1},\\ n_{\mathrm{obs}}^{c},&\mathrm{otherwise}.\\ \end{cases}

(A66)

The above equivalence from the fact that swap** $\bm{x}_{i}$ with $\bm{x}_{n+1}$ only affects the observed indices in column $c_{i}$ and $c_{n+1}$ , while all other indices remain the same. In particular, upon swap**, column $c_{i}$ will contain $K$ fewer observations, because $\bm{x}_{i}$ is treated as the unobserved test group, and column $c_{n+1}$ will contain $K$ more observations, because $\bm{x}_{n+1}$ is treated as the calibration group. Similarly,

\displaystyle\bar{n}_{\mathrm{obs};i}^{c}=\begin{cases}\bar{n}_{\mathrm{obs}}^% {c}-K\mathbb{I}\left[c_{i}\neq c_{n+1}\right],&c=c_{i},\\ \bar{n}_{\mathrm{obs}}^{c}+K\mathbb{I}\left[c_{i}\neq c_{n+1}\right],&c=c_{n+1% },\\ \bar{n}_{\mathrm{obs}}^{c},&\mathrm{otherwise}.\\ \end{cases}

(A67)

Combining (A66) and (A67), we can rewrite the second term in (A65) as:

\displaystyle\begin{split}\prod\limits_{c=1}^{n_{c}}\mbinom{n^{c}_{\mathrm{obs% };i}}{\bar{n}^{c}_{\mathrm{obs};i}}^{-1}&=\left[\prod\limits_{c=1}^{n_{c}}% \mbinom{n^{c}_{\mathrm{obs}}}{\bar{n}^{c}_{\mathrm{obs}}}^{-1}\right]\cdot% \cfrac{\mbinom{n^{c_{i}}_{\mathrm{obs};i}}{\bar{n}^{c_{i}}_{\mathrm{obs};i}}^{% -1}}{\mbinom{n^{c_{i}}_{\mathrm{obs}}}{\bar{n}^{c_{i}}_{\mathrm{obs}}}^{-1}}% \cdot\cfrac{\mbinom{n^{c_{n+1}}_{\mathrm{obs};i}}{\bar{n}^{c_{n+1}}_{\mathrm{% obs};i}}^{-1}}{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}}{\bar{n}^{c_{n+1}}_{\mathrm{% obs}}}^{-1}}\\ &=\left[\prod\limits_{c=1}^{n_{c}}\mbinom{n^{c}_{\mathrm{obs}}}{\bar{n}^{c}_{% \mathrm{obs}}}^{-1}\right]\cdot\left\{\cfrac{\mbinom{n^{c_{i}}_{\mathrm{obs}}}% {\bar{n}^{c_{i}}_{\mathrm{obs}}}}{\mbinom{n^{c_{i}}_{\mathrm{obs}}-K}{\bar{n}^% {c_{i}}_{\mathrm{obs}}-K}}\cdot\cfrac{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}}{\bar% {n}^{c_{n+1}}_{\mathrm{obs}}}}{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}+K}{\bar{n}^{% c_{n+1}}_{\mathrm{obs}}+K}}\right\}^{\mathbb{I}\left[c_{i}\neq c_{n+1}\right]}% .\end{split}

(A68)

Then, the third term on the right-hand-side of (A65) is

\left(\prod\limits_{c=1}^{n_{c}}\prod\limits_{j=1}^{N^{c}_{n;i}}\prod\limits_{% k=1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs};i}-K(j-1)-k}\right).

This relates to the probability of choosing a specific realization of the calibration groups given the observed indices remaining after random pruning. To aid the simplification of this term, we point out the following relation between $N^{c}_{n;i}$ and $N^{c}_{n}$ , i.e., the number of calibration groups from each column in the imagined observed set and original observed set respectively:

\displaystyle N^{c}_{n;i}=\begin{cases}N^{c}_{n}-\mathbb{I}\left[c_{i}\neq c_{% n+1}\right].&c=c_{i},\\ N^{c}_{n}+\mathbb{I}\left[c_{i}\neq c_{n+1}\right],&c=c_{n+1},\\ N^{c}_{n}&\mathrm{otherwise}.\\ \end{cases}

(A69)

Then, using (A67) and (A69), we can write:

\displaystyle\begin{split}&{\color[rgb]{1,1,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill% {1}=}\prod\limits_{c=1}^{n_{c}}\prod\limits_{j=1}^{N^{c}_{n;i}}\prod\limits_{k% =1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs};i}-K(j-1)-k}\\ &\qquad=\left[\prod\limits_{c=1}^{n_{c}}\prod\limits_{j=1}^{N^{c}_{n}}\prod% \limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}\right]\\ &\qquad{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}% \pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}\cdot\left(\cfrac{% \prod\limits_{j=1}^{N^{c_{i}}_{n;i}}\prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^{% c_{i}}_{\mathrm{obs};i}-K(j-1)-k}}{\prod\limits_{j=1}^{N^{c_{i}}_{n}}\prod% \limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c_{i}}_{\mathrm{obs}}-K(j-1)-k}}\cdot% \cfrac{\prod\limits_{j=1}^{N^{c_{n+1}}_{n;i}}\prod\limits_{k=1}^{K-1}\frac{1}{% \bar{n}^{c_{n+1}}_{\mathrm{obs};i}-K(j-1)-k}}{\prod\limits_{j=1}^{N^{c_{n+1}}_% {n}}\prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}-K(j-1)-k% }}\right)^{\mathbb{I}\left[c_{i}\neq c_{n+1}\right]}\\ &\qquad=\left[\prod\limits_{c=1}^{n_{c}}\prod\limits_{j=1}^{N^{c}_{n}}\prod% \limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}\right]\cdot% \left(\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{obs}}-k}{\bar{n}^% {c_{n+1}}_{\mathrm{obs}}+K-k}\right)^{\mathbb{I}\left[c_{i}\neq c_{n+1}\right]% }.\end{split}

(A70)

Above, the last equality follows from the following simplification based on (A69) and (A67), assuming that $c_{i}\neq c_{n+1}$ :

\displaystyle\begin{split}&{\color[rgb]{1,1,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill% {1}=}\cfrac{\prod\limits_{j=1}^{N^{c_{i}}_{n;i}}\prod\limits_{k=1}^{K-1}\frac{% 1}{\bar{n}^{c_{i}}_{\mathrm{obs};i}-K(j-1)-k}}{\prod\limits_{j=1}^{N^{c_{i}}_{% n}}\prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c_{i}}_{\mathrm{obs}}-K(j-1)-k}}% \cdot\cfrac{\prod\limits_{j=1}^{N^{c_{n+1}}_{n;i}}\prod\limits_{k=1}^{K-1}% \frac{1}{\bar{n}^{c_{n+1}}_{\mathrm{obs};i}-K(j-1)-k}}{\prod\limits_{j=1}^{N^{% c_{n+1}}_{n}}\prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}% -K(j-1)-k}}\\ &=\frac{\prod\limits_{j=1}^{N^{c_{i}}_{n}}\prod\limits_{k=1}^{K-1}\bar{n}^{c_{% i}}_{\mathrm{obs}}-K(j-1)-k}{\prod\limits_{j=1}^{N^{c_{i}}_{n}-1}\prod\limits_% {k=1}^{K-1}\left(\bar{n}^{c_{i}}_{\mathrm{obs}}-K\right)-K(j-1)-k}\cdot\frac{% \prod\limits_{j=1}^{N^{c_{n+1}}_{n}}\prod\limits_{k=1}^{K-1}\bar{n}^{c_{n+1}}_% {\mathrm{obs}}-K(j-1)-k}{\prod\limits_{j=1}^{N^{c_{n+1}}_{n}+1}\prod\limits_{k% =1}^{K-1}\left(\bar{n}^{c_{n+1}}_{\mathrm{obs}}+K\right)-K(j-1)-k}\\ &=\frac{\prod\limits_{j=1}^{N^{c_{i}}_{n}}\prod\limits_{k=1}^{K-1}\bar{n}^{c_{% i}}_{\mathrm{obs}}-K(j-1)-k}{\prod\limits_{j=1}^{N^{c_{i}}_{n}-1}\prod\limits_% {k=1}^{K-1}\bar{n}^{c_{i}}_{\mathrm{obs}}-Kj-k}\cdot\frac{\prod\limits_{j=1}^{% N^{c_{n+1}}_{n}}\prod\limits_{k=1}^{K-1}\bar{n}^{c_{n+1}}_{\mathrm{obs}}-K(j-1% )-k}{\prod\limits_{j=1}^{N^{c_{n+1}}_{n}+1}\prod\limits_{k=1}^{K-1}\bar{n}^{c_% {n+1}}_{\mathrm{obs}}-K(j-2)-k}\\ &=\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{obs}}-k}{\bar{n}^{c_{% n+1}}_{\mathrm{obs}}+K-k}.\end{split}

Finally, combining (A65) with (A68) and (A70), we arrive at

\displaystyle\begin{split}&p_{i}(\bm{x}_{1},\dots,\bm{x}_{n},\bm{x}_{n+1})\\ &\quad\propto\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{% obs};i}\right)\cdot\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^{K}% \widetilde{w}^{*}_{x_{i,k}}\right)\\ &\quad\quad\cdot\left[\prod\limits_{c=1}^{n_{c}}\mbinom{n^{c}_{\mathrm{obs}}}{% \bar{n}^{c}_{\mathrm{obs}}}^{-1}\right]\cdot\left[\cfrac{\mbinom{n^{c_{i}}_{% \mathrm{obs}}}{\bar{n}^{c_{i}}_{\mathrm{obs}}}}{\mbinom{n^{c_{i}}_{\mathrm{obs% }}-K}{\bar{n}^{c_{i}}_{\mathrm{obs}}-K}}\cdot\cfrac{\mbinom{n^{c_{n+1}}_{% \mathrm{obs}}}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}}}{\mbinom{n^{c_{n+1}}_{\mathrm% {obs}}+K}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}+K}}\right]^{\mathbb{I}\left[c_{i}% \neq c_{n+1}\right]}\\ &\quad\quad\cdot\left[\prod\limits_{c=1}^{n_{c}}\prod\limits_{j=1}^{N^{c}_{n}}% \prod\limits_{k=1}^{K-1}\frac{1}{\bar{n}^{c}_{\mathrm{obs}}-K(j-1)-k}\right]% \cdot\left(\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{obs}}-k}{% \bar{n}^{c_{n+1}}_{\mathrm{obs}}+K-k}\right)^{\mathbb{I}\left[c_{i}\neq c_{n+1% }\right]}\\ &\quad\propto\mathbb{P}_{\bm{w}}\left(\mathcal{D}_{\mathrm{obs}}=D_{\mathrm{% obs};i}\right)\cdot\left(\widetilde{w}^{*}_{x_{i,1}}\prod\limits_{k=2}^{K}% \widetilde{w}^{*}_{x_{i,k}}\right)\\ &\quad\quad\cdot\left[\cfrac{\mbinom{n^{c_{i}}_{\mathrm{obs}}}{\bar{n}^{c_{i}}% _{\mathrm{obs}}}}{\mbinom{n^{c_{i}}_{\mathrm{obs}}-K}{\bar{n}^{c_{i}}_{\mathrm% {obs}}-K}}\cdot\cfrac{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}}}{\mbinom{n^{c_{n+1}}_{\mathrm{obs}}+K}{\bar{n}^{c_{n+1}}_{% \mathrm{obs}}+K}}\cdot\prod\limits_{k=1}^{K-1}\frac{\bar{n}^{c_{i}}_{\mathrm{% obs}}-k}{\bar{n}^{c_{n+1}}_{\mathrm{obs}}+K-k}\right]^{\mathbb{I}\left[c_{i}% \neq c_{n+1}\right]}.\end{split}

∎

A4.4 Finite-Sample Coverage Bounds

Proof of Theorem 1.

Recall that, by construction,

\displaystyle\bm{M}_{\bm{X}^{*}}\in\mathcal{C}(\bm{X}^{*},\widehat{\tau}_{% \alpha,K},\widehat{\bm{M}})\Longleftrightarrow S^{*}\leq\widehat{\tau}_{\alpha% ,K}=Q\Big{(}1-\alpha;\sum_{i=1}^{n}p_{i}\delta_{S_{i}}+p_{n+1}\delta_{\infty}% \Big{)}.

Therefore, Theorem 1 follows directly by combining Proposition 1, Lemma 2, and the characterization of the conformalization weights given by Equation (19).

∎

Proof of Theorem 2.

Recall that, by construction,

\displaystyle\bm{M}_{\bm{X}^{*}}\in\mathcal{C}(\bm{X}^{*},\widehat{\tau}_{% \alpha,K},\widehat{\bm{M}})\Longleftrightarrow S^{*}\leq\widehat{\tau}_{\alpha% ,K}=Q\Big{(}1-\alpha;\sum_{i=1}^{n}p_{i}\delta_{S_{i}}+p_{n+1}\delta_{\infty}% \Big{)}.

Therefore, applying Lemma A5, we see that it suffices to prove

\displaystyle\mathbb{P}\left[S^{*}\leq Q\Big{(}1-\alpha;\sum_{i=1}^{n}p_{i}% \delta_{S_{i}}+p_{n+1}\delta_{S^{*}}\Big{)}\right]\leq 1-\alpha+\mathbb{E}% \left[{\max_{i\in[n+1]}p_{i}(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{% cal}}_{n},\bm{X}^{*})}\right].

(A71)

Let $E_{x}$ denote the event that $\{\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*}\}=\{\bm% {x}_{1},\dots,\bm{x}_{n+1}\}$ , $\mathcal{D}_{\mathrm{drop}}=D_{0}$ , and $\mathcal{D}_{\mathrm{train}}=D_{1}$ , for some possible realizations $\bm{x}=(\bm{x}_{1},\ldots,\bm{x}_{n+1})$ of $\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*}$ , $D_{0}$ of $\mathcal{D}_{\mathrm{drop}}$ , and $D_{1}$ of $\mathcal{D}_{\mathrm{train}}$ . Let also $\{v_{1},\ldots,v_{n+1}\}$ indicate the realization of $\{S_{1},\ldots,S_{n},S^{*}\}$ corresponding to the event $E_{x}$ , for all $i\in[n+1]$ . Applying the definition of conditional probability, as in the proof of Lemma 2, we can see that $S^{*}\mid E_{x}\sim\sum_{i=1}^{n+1}p_{i}(\bm{x}_{1},\dots,\bm{x}_{n+1})\delta_% {v_{i}}$ , with the weights $p_{i}$ given by (19). This implies that

\mathbb{P}\left[S^{*}\leq Q\Big{(}1-\alpha;\sum_{i=1}^{n+1}p_{i}\delta_{v_{i}}% \Big{)}\mid E_{x}\right]\leq 1-\alpha+\max_{i\in[n+1]}p_{i}(\bm{x}_{1},\dots,% \bm{x}_{n+1}),

and further, by taking an expectation with respect to the randomness in $E_{x}$ ,

\mathbb{P}\left[S^{*}\leq Q\Big{(}1-\alpha;\sum_{i=1}^{n}p_{i}\delta_{S_{i}}+p% _{n+1}\delta_{S^{*}}\Big{)}\right]\leq 1-\alpha+\mathbb{E}\left[\max_{i\in[n+1% ]}p_{i}(\bm{X}^{\mathrm{cal}}_{1},\dots,\bm{X}^{\mathrm{cal}}_{n},\bm{X}^{*})% \right].

∎

A4.5 Efficient Evaluation of the Conformalization Weights

Proof of Proposition 2.

We begin by focusing on the special case of $i=n+1$ . In that case, Equation (24) becomes a special case of the results derived for the multivariate Wallenius’ noncentral hypergeometric distribution \citepwallenius1963biased, chesson1976non, fog2007wnchypg. While the original problem addresses biased sampling without replacement from an urn containing colored balls, our model in (1) can be equivalently interpreted as drawing samples without replacement from an urn comprising $n_{r}n_{c}$ balls. Each ball is uniquely labeled with a color represented by $(r,c)\in[n_{r}]\times[n_{c}]$ , and it is drawn with a probability proportional to $w_{r,c}$ ; e.g., see Section 2. Therefore, from Equation (19) in \citetfog2007wnchypg:

\displaystyle\begin{split}\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{% \mathrm{obs}})&=\int_{0}^{1}\Phi(\tau;h)\,d\tau.\end{split}

(A72)

Next, we turn to proving Equation (24) for a general $i\in[n+1]$ .

For any $i\in[n+1]$ , imagine an alternative world in which $\bm{x}_{n+1}$ is swapped with $\bm{x}_{i}$ . Let $\delta_{i}:=\sum_{(r,c)\in D_{\mathrm{miss};i}}w_{r,c}$ indicate the cumulative weight of all missing entries analogous to $\delta$ in the aforementioned imaginary world. It is easy to see that $\delta_{i}=\delta+d_{i}$ , where $d_{i}:=\sum_{k=1}^{K}(w_{x_{i,k}}-w_{x_{n+1,k}})$ . Therefore, we can express the probability using Equation A72 in the imaginary world:

\displaystyle\begin{split}\mathbb{P}_{\bm{w}}({\mathcal{D}}_{\mathrm{obs}}=D_{% \mathrm{obs};i})&=\int_{0}^{1}h\delta_{i}\tau^{h\delta_{i}-1}\prod_{(r,c)\in D% _{\mathrm{obs};i}}\left(1-\tau^{hw_{r,c}}\right)\,d\tau\\ &=\int_{0}^{1}h\left(\delta+d_{i}\right)\tau^{h\left(\delta+d_{i}\right)-1}% \prod_{(r,c)\in D_{\mathrm{obs}}}\left(1-\tau^{hw_{r,c}}\right)\cdot\prod% \limits_{k=1}^{K}\left(\frac{1-\tau^{hw_{n+1,k}}}{1-\tau^{hw_{i,k}}}\right)\,d% \tau\\ &=\int_{0}^{1}h\delta\tau^{h\delta-1}\left(\prod_{(r,c)\in D_{\mathrm{obs}}}1-% \tau^{hw_{r,c}}\right)\cdot\frac{h\left(\delta+d_{i}\right)\tau^{h\left(\delta% +d_{i}\right)-1}}{h\delta\tau^{h\delta-1}}\cdot\left(\prod\limits_{k=1}^{K}% \frac{1-\tau^{hw_{n+1,k}}}{1-\tau^{hw_{i,k}}}\right)\,d\tau\\ &=\int_{0}^{1}\Phi(\tau;h)\cdot\eta_{i}(\tau;h)\,d\tau.\end{split}

where, for any $\tau\in(0,1)$ ,

\displaystyle\eta_{i}(\tau;h):=\frac{\tau^{hd_{i}}(\delta+d_{i})}{\delta}\cdot% \left(\prod\limits_{k=1}^{K}\frac{1-\tau^{hw_{n+1,k}}}{1-\tau^{hw_{i,k}}}% \right).

(A73)

∎

Proof of Lemma 4.

Recall that the logarithm of $\Phi(\tau;h)$ takes the form

\displaystyle\phi(\tau;h):=\log\Phi(\tau;h)=\log(h\delta)+(h\delta-1)\log(\tau% )+\sum_{(r,c)\in D_{\mathrm{obs}}}\log(1-\tau^{hw_{r,c}}),

(A74)

while its first derivative with respect to $\tau$ is:

\displaystyle\phi^{\prime}(\tau;h)

\displaystyle=\frac{h\delta-1}{\tau}-\sum_{(r,c)\in D_{\mathrm{obs}}}\frac{hw_% {r,c}\tau^{hw_{r,c}-1}}{1-\tau^{hw_{r,c}}}.

Consider the function $\tau\phi^{\prime}(\tau;h)$ ,

\displaystyle\tau\phi^{\prime}(\tau;h)

\displaystyle=h\delta-1-\sum_{(r,c)\in D_{\mathrm{obs}}}\frac{hw_{r,c}\tau^{hw% _{r,c}}}{1-\tau^{hw_{r,c}}},

which is strictly decreasing in $\tau$ for all $\tau\in(0,1)$ , because $w_{r,c}>0$ for all $(r,c)$ . Further, if $h>1/\delta$ ,

\displaystyle\lim_{\tau\to 0^{+}}\tau\phi^{\prime}(\tau;h)=h\delta-1>0,

\displaystyle\lim_{\tau\to 1^{-}}\tau\phi^{\prime}(\tau;h)=-\infty.

Then, by the intermediate value theorem, $\tau\phi^{\prime}(\tau;h)$ must have exactly one zero for $\tau\in(0,1)$ , as long as $h>1/\delta$ . In turn, this implies that $\phi^{\prime}(\tau;h)$ has exactly one zero for $\tau\in(0,1)$ , as long as $h>1/\delta$ . Further, the unique zero of $\phi^{\prime}(\tau;h)$ on $\tau\in(0,1)$ must be the unique maximum of $\phi(\tau;h)$ , because, under $h>1/\delta$ ,

\displaystyle\lim_{\tau\to 0^{+}}\phi(\tau;h)=-\infty,

\displaystyle\lim_{\tau\to 1^{-}}\phi(\tau;h)=-\infty.

∎

A4.6 Consistency of the Generalized Laplace Approximation

We begin by stating a formal version of Theorem 3, the result providing the motivation to apply the Laplace method to Equation (27).

Theorem A4.

Let $\{w_{i}\}_{i=1}^{\infty}$ be a sequence of i.i.d. random variables drawn from a distribution $F$ with support on the open interval $(0,1)$ . Consider a sequence of mutually independent Bernoulli random variables $\{x_{i}\}_{i=1}^{\infty}$ , where each $x_{i}\overset{\text{ind.}}{\sim}\mathrm{Bernoulli}(w_{i})$ . Define $\delta_{n}=\sum_{i=1}^{n}(1-x_{i})w_{i}$ and

\displaystyle\Phi_{n}(\tau):=h_{n}\delta_{n}\tau^{h_{n}\delta_{n}-1}\prod_{i=1% }^{n}\left(1-\tau^{h_{n}w_{i}}\right)^{x_{i}}.

(A75)

Above, $h_{n}$ is the unique root of the function

\displaystyle z(h)\coloneqq\frac{\phi_{n}^{\prime}\left(\frac{1}{2}\right)}{2h% }=\delta_{n}-\frac{1}{h}-\sum_{i=1}^{n}\frac{x_{i}w_{i}}{2^{hw_{i}}-1}

(A76)

in the interval $[\delta_{n},\infty)$ , where $\phi_{n}(\tau)$ is the logarithm of $\Phi_{n}(\tau)$ , namely

\displaystyle\phi_{n}(\tau)\coloneqq\log\Phi_{n}(\tau)=\log(h_{n})+\log(\delta% _{n})+(h_{n}\delta_{n}-1)\log(\tau)+\sum_{i=1}^{n}x_{i}\log\left(1-\tau^{h_{n}% w_{i}}\right).

(A77)

Then, $\operatorname*{arg\,max}_{\tau\in[0,1]}\Phi_{n}(\tau)=\frac{1}{2}$ .

Further, consider a sequence of functions $\{f_{n}\}$ , where each $f_{n}\in C^{1}(0,1)$ and $f_{n}\left(\frac{1}{2}\right)>\epsilon_{0}$ , for some constant $\epsilon_{0}>0$ and all $n$ . Suppose there exists some $M>0$ such that $|f_{n}^{\prime}(x)|\leq M$ for all $x\in(0,1)$ and for all $n$ . Then, it holds that

\displaystyle\int_{0}^{1}f_{n}(\tau)\Phi_{n}(\tau)\,d\tau\sim f_{n}\left(\frac% {1}{2}\right)\cdot\Phi_{n}\left(\frac{1}{2}\right)\sqrt{\frac{-2\pi}{\phi_{n}^% {\prime\prime}\left(\frac{1}{2}\right)}}\text{ almost surely as }n\to\infty,

(A78)

or equivalently,

\displaystyle\lim_{n\to\infty}\frac{\int_{0}^{1}f_{n}(\tau)\Phi_{n}(\tau)\,d% \tau}{f_{n}\left(\frac{1}{2}\right)\cdot\Phi_{n}\left(\frac{1}{2}\right)\sqrt{% \frac{-2\pi}{\phi_{n}^{\prime\prime}\left(\frac{1}{2}\right)}}}=1\text{ almost% surely}.

(A79)

Proof of Theorem A4.

The preliminary part of this result is proved in Appendix A2.2, where we show that selecting the scaling parameter $h_{n}$ as the unique root of the function in (A76) leads to $\tau_{n}^{*}\coloneqq\operatorname*{arg\,max}_{\tau\in[0,1]}\Phi_{n}(\tau)=% \frac{1}{2}$ .

Our main objective is to approximate the integral

\displaystyle\int_{0}^{1}f_{n}(\tau)\Phi_{n}(\tau)\,d\tau=\int_{0}^{1}f_{n}(% \tau)e^{\phi_{n}(\tau)}\,d\tau,

leveraging a suitable extension of the classical Laplace method reviewed in Appendix A1.4. To this end, we begin by applying a Taylor series expansion around $\tau_{n}^{*}$ , including Lagrange remainder terms; this leads to:

	$\displaystyle f_{n}(\tau)$	$\displaystyle=f_{n}\left(\frac{1}{2}\right)+f_{n}^{\prime}(\xi_{1})\left(\tau-% \frac{1}{2}\right),$
	$\displaystyle\phi_{n}(\tau)$	$\displaystyle=\phi_{n}\left(\frac{1}{2}\right)+\phi_{n}^{\prime}\left(\frac{1}% {2}\right)\left(\tau-\frac{1}{2}\right)+\frac{\phi_{n}^{\prime\prime}\left(% \frac{1}{2}\right)}{2}\left(\tau-\frac{1}{2}\right)^{2}+\frac{\phi_{n}^{\prime% \prime\prime}(\xi_{2})}{6}\left(\tau-\frac{1}{2}\right)^{3},$

for some real numbers $\xi_{1},\xi_{2}\in[1/2,\tau]$ , and

	$\displaystyle\phi_{n}(\tau)$	$\displaystyle=\log(h_{n})+\log(\delta_{n})+(h_{n}\delta_{n}-1)\log(\tau)+\sum_% {i=1}^{n}x_{i}\log\left(1-\tau^{h_{n}w_{i}}\right)$
	$\displaystyle\phi_{n}^{\prime}(\tau)$	$\displaystyle=\frac{h_{n}\delta_{n}-1}{\tau}-\sum_{i=1}^{n}x_{i}\frac{h_{n}w_{% i}\tau^{h_{n}w_{i}-1}}{\left(1-\tau^{h_{n}w_{i}}\right)}$
	$\displaystyle\phi_{n}^{\prime\prime}(\tau)$	$\displaystyle=-\frac{h_{n}\delta_{n}-1}{\tau^{2}}-\sum_{i=1}^{n}x_{i}\frac{h_{% n}w_{i}\tau^{h_{n}w_{i}-2}(\tau^{h_{n}w_{i}}+h_{n}w_{i}-1)}{\left(1-\tau^{h_{n% }w_{i}}\right)^{2}}$
	$\displaystyle\phi_{n}^{\prime\prime\prime}(\tau)$	$\displaystyle=2\frac{h_{n}\delta_{n}-1}{\tau^{3}}$
		$\displaystyle+\sum_{i=1}^{n}x_{i}\frac{(h_{n}w_{i})\tau^{(h_{n}w_{i}-3)}\left(% 3(h_{n}w_{i})(\tau^{h_{n}w_{i}}-1)+2(\tau^{h_{n}w_{i}}-1)^{2}+(h_{n}w_{i})^{2}% (\tau^{h_{n}w_{i}}+1)\right)}{(\tau^{h_{n}w_{i}}-1)^{3}}.$

By definition of $h_{n}$ , we know that $\phi_{n}^{\prime}\left(\frac{1}{2}\right)=0$ . Next, we need to establish a suitable bound for $\phi_{n}^{\prime\prime}$ . This task is complicated by the fact that we do not have an explicit expression for $h_{n}$ . Fortunately, however, it is possible to obtain sufficiently tight lower and upper bounds for $h_{n}$ .

Lemma A7.

In the setting of Theorem A4, for any $n>1$ ,

\displaystyle\left(\frac{\frac{2n}{\delta_{n}}}{2^{\frac{2n}{\delta_{n}}}-1}% \frac{s_{n}}{n}+\frac{1}{n}\right)\frac{1}{\delta_{n}}\leq h_{n}\leq\left(1+% \frac{n}{\log{2}}\right)\frac{1}{\delta_{n}}.

(A80)

Further, in the limit of $n\to\infty$ , it holds almost-surely that

\displaystyle\frac{J}{L_{2}}\leq h_{n}\leq\frac{1}{\log(2)L_{2}},

(A81)

where

\displaystyle J:=L_{1}\cdot\frac{2}{L_{2}}\cdot\frac{1}{2^{\frac{2}{L_{2}}}-1},

\displaystyle L_{1}:=\mathbb{E}[x],

\displaystyle L_{2}:=\mathbb{E}[(1-x)w].

The bounds on $h_{n}$ provided by Lemma A7 in turns allow us to bound $\phi_{n}^{\prime\prime}$ away from 0 almost surely for large $n$ . This gives us the necessary ingredients to tackle the approximation of the integral. To this end, note that

\displaystyle\int_{0}^{1}f_{n}(\tau)e^{\phi_{n}(\tau)}\,d\tau

\displaystyle=\int_{0}^{1}\left\{f_{n}\left(\frac{1}{2}\right)+f_{n}^{\prime}(% \xi_{1}(\tau))\left(\tau-\frac{1}{2}\right)\right\}e^{\phi_{n}\left(\frac{1}{2% }\right)+\frac{\phi_{n}^{\prime\prime}\left(\frac{1}{2}\right)}{2}\left(\tau-% \frac{1}{2}\right)^{2}+\frac{\phi_{n}^{\prime\prime\prime}(\xi_{2}(\tau))}{6}% \left(\tau-\frac{1}{2}\right)^{3}}\,d\tau,

since $\phi_{n}^{\prime}(\frac{1}{2})=0$ . Applying the change of variables $u=\sqrt{-\phi_{n}^{\prime\prime}\left(\frac{1}{2}\right)}(\tau-\frac{1}{2})$ and defining $K_{n}=\sqrt{-\phi_{n}^{\prime\prime}(1/2)}$ , the integral becomes

		$\displaystyle\int_{0}^{1}f_{n}(\tau)e^{\phi_{n}(\tau)}\,d\tau$		(A82)
		$\displaystyle\quad=\frac{e^{\phi_{n}\left(\frac{1}{2}\right)}}{K_{n}}\int_{-K_% {n}/2}^{K_{n}/2}\left(f_{n}\left(\frac{1}{2}\right)+\frac{u}{K_{n}}f_{n}^{% \prime}(\xi_{1}(u))\right)e^{-\frac{u^{2}}{2}+\frac{\phi_{n}^{\prime\prime% \prime}(\xi_{2}(u))u^{3}}{6K_{n}^{3}}}\,du$
		$\displaystyle\quad=\frac{e^{\phi_{n}\left(\frac{1}{2}\right)}}{K_{n}}\left\{f_% {n}\left(\frac{1}{2}\right)\int_{-K_{n}/2}^{K_{n}/2}e^{-\frac{u^{2}}{2}}\,du+f% _{n}\left(\frac{1}{2}\right)\int_{-K_{n}/2}^{K_{n}/2}e^{-\frac{u^{2}}{2}}\left% (e^{\frac{\phi_{n}^{\prime\prime\prime}(\xi_{2}(u))u^{3}}{6K_{n}^{3}}}-1\right% )\,du\right.$
		$\displaystyle\quad\qquad\left.+\int_{-K_{n}/2}^{K_{n}/2}\frac{u}{K_{n}}f_{n}^{% \prime}(\xi_{1}(u))e^{-\frac{u^{2}}{2}}\,du+\int_{-K_{n}/2}^{K_{n}/2}\frac{u}{% K_{n}}f_{n}^{\prime}(\xi_{1}(u))e^{-\frac{u^{2}}{2}}\left(e^{\frac{\phi_{n}^{% \prime\prime\prime}(\xi_{2}(u))u^{3}}{6K_{n}^{3}}}-1\right)\,du\right\}.$		(A83)

Note that now $\xi_{1}$ and $\xi_{2}$ depend implicitly on $u$ due to the change of variables (they previously depended on $\tau$ ). We will now separately analyze each term in (A82).

The limit of the first integral on the right-hand-side of (A82) can be found by leveraging the following lower bound for $K_{n}$ :

\displaystyle K_{n}=\sqrt{-\phi_{n}^{\prime\prime}\left(\frac{1}{2}\right)}=% \sqrt{n}\sqrt{\frac{-\phi_{n}^{\prime\prime}\left(\frac{1}{2}\right)}{n}}\geq% \sqrt{n}\sqrt{4\frac{h_{n}\delta_{n}-1}{n}}\geq 2\sqrt{n}\sqrt{J},

which leads to

\displaystyle\lim_{n\to\infty}\int_{-K_{n}/2}^{K_{n}/2}e^{-\frac{u^{2}}{2}}\,% du=\sqrt{2\pi}.

(A84)

By contrast, the third integral on the right-hand-side of (A82) is asymptotically negligible. Recall that, by assumption, $f_{n}$ satisfies $|f_{n}^{\prime}(x)|\leq M$ for all $x\in(0,1)$ and $n$ ; therefore,

\displaystyle\left|\int_{-K_{n}/2}^{K_{n}/2}\frac{u}{K_{n}}f_{n}^{\prime}(\xi_% {1}(u))e^{-\frac{u^{2}}{2}}\,du\right|\leq\frac{M}{K_{n}}\int_{-K_{n}/2}^{K_{n% }/2}|u|e^{-\frac{u^{2}}{2}}\,du\to 0\quad\text{ as }n\to\infty.

(A85)

We continue with the analysis of the second and fourth terms on the right-hand-side of (A83), which are more involved. The remainder of the second-order Taylor expansion, previously denoted as $\phi_{n}^{\prime\prime\prime}(\xi_{2}(u))u^{3}/(6K_{n}^{3})$ , can be expanded back into an infinite power series, given the smoothness of $\phi_{n}$ within the interval $(0,1)$ . This expansion is expressed as:

\displaystyle y_{n}(u)\coloneqq\frac{\phi_{n}^{\prime\prime\prime}(\xi_{2}(u))% u^{3}}{6K_{n}^{3}}=\sum_{j=3}^{\infty}\frac{\phi_{n}^{(j)}\left(\frac{1}{2}% \right)}{j!K_{n}^{j}}u^{j},

(A86)

where $\phi_{n}^{(j)}\left(\frac{1}{2}\right)$ represents the $j$ -th derivative of $\phi_{n}$ evaluated at $\frac{1}{2}$ , and this series converge for all $n$ and all $u\in(-K_{n}/2,K_{n}/2)$ .

Note that the $j$ -th derivative of $\phi_{n}$ at $1/2$ can be written as:

\displaystyle\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{n}

\displaystyle=\frac{h_{n}\delta_{n}-1}{n}\frac{d^{j}}{d\tau^{j}}(\log\tau)\Big% {|}_{\tau=\frac{1}{2}}+\frac{1}{n}\sum_{i=1}^{n}x_{i}\frac{d^{j}}{d\tau^{j}}% \log\left(1-\tau^{h_{n}w_{i}}\right)\Big{|}_{\tau=\frac{1}{2}}.

(A87)

To control $y_{n}(u)$ , we will show that (A87) is bounded for large $n$ . Lemma A7 tells us that the first term in (A87) is bounded by constants almost surely. For the second term, define:

\displaystyle g(h,x,w)\coloneqq x\frac{d^{j}}{d\tau^{j}}\log\left(1-\tau^{hw}% \right)\Big{|}_{\tau=\frac{1}{2}},

(A88)

which is continuous for $h>0$ and $w>0$ . Given that the interval $C\coloneqq[J,1/\log(2)]$ is compact, we can use the maximum value in $h$ to bound the function $g$ . Thus, we obtain:

\displaystyle\frac{1}{n}\sum_{i=1}^{n}g(h_{n},x_{i},w_{i})\leq\frac{1}{n}\sum_% {i=1}^{n}\max_{h\in C}g(h,x_{i},w_{i})\quad\text{almost surely as }n\to\infty.

(A89)

Then, by the strong law of large numbers,

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\max_{h\in C}g(h,x_{i},w_{i})% \xrightarrow{\mathrm{a.s.}}\mathbb{E}\left[\max_{h\in C}g(h,x,w)\right],

(A90)

leading to:

\displaystyle\frac{1}{n}\sum_{i=1}^{n}g(h_{n},x_{i},w_{i})\leq\mathbb{E}\left[% \max_{h\in C}g(h,x,w)\right]\quad\text{almost surely.}

(A91)

Similarly, we can also show:

\displaystyle\frac{1}{n}\sum_{i=1}^{n}g(h_{n},x_{i},w_{i})\geq\mathbb{E}\left[% \min_{h\in C}g(h,x,w)\right]\quad\text{almost surely.}

(A92)

Therefore, the second term in (A87) is also bounded by constants almost surely. As a result, the whole expression in (A87) is bounded by constants almost surely for large $n$ .

Combining the above results, we conclude that, for all $j\geq 3$ ,

\displaystyle\left|\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{j!K_{n}^{j}}% \right|\leq\left|\frac{n\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{n}}{j!K_% {n}^{j}}\right|\leq\left|\frac{n\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{% n}}{j!(2\sqrt{J})^{j}n^{\frac{j}{2}}}\right|=\mathcal{O}\left(n^{-\left(\frac{% j}{2}-1\right)}\right)\text{ almost surely.}

(A93)

We can now proceed to analyze the integral involving the exponential of the remainder term $y_{n}(u)$ , which appears in the second term on the right-hand-side of (A82). Specifically,

\displaystyle\int_{-\infty}^{\infty}e^{-\frac{u^{2}}{2}}\left(e^{y_{n}(u)}-1% \right)\,du

\displaystyle=\sqrt{2\pi}\mathbb{E}\left[e^{y_{n}(Z)}-1\right]=\sqrt{2\pi}\sum% _{k=1}^{\infty}\frac{\mathbb{E}\left[(y_{n}(Z))^{k}\right]}{k!},

(A94)

where $Z\sim\mathcal{N}(0,1)$ . Below, we show that all terms on the right-hand-side of (A94) are finite and converge to 0 as $n$ increases. To this end, let us start from $k=1$ , noting that

\displaystyle\mathbb{E}[y_{n}(Z)]=\mathbb{E}\left[\sum_{j=3}^{\infty}\frac{% \phi_{n}^{(j)}\left(\frac{1}{2}\right)}{j!K_{n}^{j}}Z^{j}\right]=\sum_{\begin{% subarray}{c}j\geq 4,\\ j\text{ is even}\end{subarray}}\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{j% !K_{n}^{j}}\frac{j!}{2^{\frac{j}{2}}(\frac{j}{2})!}=\sum_{\begin{subarray}{c}j% \geq 4,\\ j\text{ is even}\end{subarray}}\frac{\phi_{n}^{(j)}\left(\frac{1}{2}\right)}{K% _{n}^{j}2^{\frac{j}{2}}(\frac{j}{2})!},

where the simplification arises because all odd moments of a standard normal are zero, and the even moments follow from the moment generating function. Given (A93), it follows that $\mathbb{E}[y_{n}(Z)]\to 0$ as $n\to\infty$ . Similarly, it also can be shown that all higher moments of $y_{n}(Z)$ are finite and converge to 0 as $n$ increases. Consequently, we conclude that

\displaystyle\mathbb{E}\left[e^{y_{n}(Z)}-1\right]\to 0\quad\text{as }n\to\infty.

Therefore, the limit of the second term in the Taylor error expansion (A83) is

\displaystyle\lim_{n\to\infty}\int_{-\frac{K_{n}}{2}}^{\frac{K_{n}}{2}}e^{-% \frac{u^{2}}{2}}\left(e^{\frac{\phi_{n}^{\prime\prime\prime}(\xi_{2}(u))u^{3}}% {6K_{n}^{3}}}-1\right)\,du

\displaystyle=0\text{ almost surely.}

(A95)

The fourth term in the Taylor error expansion (A83) vanishes similarly because $f_{n}^{\prime}$ is bounded; i.e.,

\displaystyle\lim_{n\to\infty}\int_{-\frac{K_{n}}{2}}^{\frac{K_{n}}{2}}\frac{u% }{K_{n}}f_{n}^{\prime}(\xi_{1}(u))e^{-\frac{u^{2}}{2}}\left(e^{\frac{\phi_{n}^% {\prime\prime\prime}(\xi_{2}(u))u^{3}}{6K_{n}^{3}}}-1\right)\,du

\displaystyle=0\text{ almost surely.}

(A96)

Combining (A82) with (A84), (A85), (A95), and (A96), we arrive at the desired result.

∎

Proof of Lemma A7.

It is immediate from (A76) that $h_{n}>1/\delta_{n}$ . To obtain an upper bound, recall that $z(h_{n})=0$ , for the function $z$ in (A76). This implies that

	$\displaystyle\delta_{n}$	$\displaystyle=\frac{1}{h_{n}}+\sum_{i=1}^{n}\frac{x_{i}w_{i}}{2^{h_{n}w_{i}}-1}$
		$\displaystyle\leq\frac{1}{h_{n}}+\sum_{i=1}^{n}x_{i}\frac{w_{i}}{\log(2)h_{n}w% _{i}}\quad\text{(since }2^{x}-1>\log(2)x\quad\text{ for all }x>0)$
		$\displaystyle=\left(1+\frac{s_{n}}{\log(2)}\right)\frac{1}{h_{n}}\leq\left(1+% \frac{n}{\log(2)}\right)\frac{1}{h_{n}},$

where $s_{n}=\sum_{i=1}^{n}x_{i}$ denotes the number of successful Bernoulli trials. Therefore,

\displaystyle h_{n}\leq\left(1+\frac{n}{\log(2)}\right)\frac{1}{\delta_{n}}.

(A97)

The upper bound in (A97) also allows us to find a tighter lower bound. Note that $\frac{x}{2^{x}-1}$ is a decreasing function of $x>0$ , and $w_{i}<1$ for all $i$ . Therefore, $z(h_{n})=0$ implies

\displaystyle h_{n}\delta_{n}-1=\sum_{i=1}^{n}\frac{x_{i}h_{n}w_{i}}{2^{h_{n}w% _{i}}-1}\geq\sum_{i=1}^{n}\frac{x_{i}h_{n}}{2^{h_{n}}-1}=\frac{h_{n}}{2^{h_{n}% }-1}s_{n}.

Combining this result with (A97), we obtain the following lower bound:

\displaystyle\frac{h_{n}\delta_{n}-1}{n}\geq\frac{h_{n}}{2^{h_{n}}-1}\frac{s_{% n}}{n}\geq\frac{\left(1+\frac{n}{\log(2)}\right)\frac{1}{\delta_{n}}}{2^{\left% (1+\frac{n}{\log(2)}\right)\frac{1}{\delta_{n}}}-1}\frac{s_{n}}{n}\geq\frac{% \frac{2n}{\delta_{n}}}{2^{\frac{2n}{\delta_{n}}}-1}\frac{s_{n}}{n}.

(A98)

This completes the proof of (A80).

To prove the second part, we apply the strong law of large numbers, by which $\frac{s_{n}}{n}=\frac{\sum_{i=1}^{n}x_{i}}{n}\xrightarrow{\mathrm{a.s.}}% \mathbb{E}[x]$ as $n\to\infty$ , and $\frac{\delta_{n}}{n}=\frac{\sum_{i=1}^{n}(1-x_{i})w_{i}}{n}\xrightarrow{% \mathrm{a.s.}}\mathbb{E}[(1-x)w]$ as $n\to\infty$ . Therefore, by the continuous map** theorem,

\displaystyle\frac{h_{n}\delta_{n}-1}{n}\geq\frac{h_{n}}{2^{h_{n}}-1}\frac{s_{% n}}{n}\geq L_{1}\cdot\frac{2}{L_{2}}\cdot\frac{1}{2^{\frac{2}{L_{2}}}-1}:=J% \quad\text{ almost surely as }n\to\infty,

where the expected values $L_{1}=\mathbb{E}[x]$ and $L_{2}=\mathbb{E}[(1-x)w]$ do not depend on $n$ . Therefore, in the limit of $n\to\infty$ , it holds almost-surely that

\displaystyle(nJ+1)\frac{1}{\delta_{n}}\leq h_{n}\leq\left(1+\frac{n}{\log{2}}% \right)\frac{1}{\delta_{n}},

Finally, recall that $\delta_{n}/n\xrightarrow{\mathrm{a.s.}}L_{2}$ as $n\to\infty$ . Consequently, we have, almost surely, that: $J/L_{2}\leq h_{n}\leq 1/[\log(2)L_{2}]$ as $n\to\infty$ . ∎

	$\displaystyle\widetilde{w}^{*}_{x_{i,k}}$	$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum_{(r,c)\in\bar{D}^{c_{i}}_{\mathrm{% miss};i}}w^{}_{r,c}-\sum^{k-1}_{k^{\prime}=1}w^{*}_{x_{i;k^{\prime}}}}$
		$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum\limits_{(r,c)\in\bar{D}^{c_{i}}_{% \mathrm{miss}}}w^{}_{r,c}+\sum\limits_{k^{\prime}=k}^{K}w^{}_{x_{i,k^{\prime% }}}-\mathbb{I}\left[c_{i}=c_{n+1}\right]\left(\sum\limits_{k^{\prime}=1}^{K}w^% {}_{x_{n+1,k^{\prime}}}\right)+\mathbb{I}\left[n^{c_{i}}_{\mathrm{miss}}<K% \right]\left(\sum\limits_{(r,c)\in D^{c_{i}}_{\mathrm{miss}}}w^{*}_{r,c}\right)}$
		$\displaystyle=\frac{w^{}_{x_{i,k}}}{\sum\limits_{(r,c)\in D^{c_{i}}_{\mathrm{% miss}}}w^{}_{r,c}+\sum\limits_{k^{\prime}=k}^{K}w^{}_{x_{i,k^{\prime}}}-% \mathbb{I}\left[c_{i}=c_{n+1}\right]\left(\sum\limits_{k^{\prime}=1}^{K}w^{}_% {x_{n+1,k^{\prime}}}\right)}.$