ref.bib
Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems
Abstract
We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set.
Keywords: Collaborative filtering; confidence regions; conformal inference; exchangeability;
Laplace’s method; simultaneous inference.
1 Introduction
1.1 Background and Motivation
Many data-driven decision problems require a simultaneous understanding of several related unknowns, motivating the use of joint confidence (or prediction) regions. This explains why the concept of joint (or simultaneous) inference has a rich history, with roots tracing back to seminal works by \citetscheffe1953method and others \citeproy1953simultaneous,goodman1965simultaneous. Despite its established roots in classical statistics, this topic remains relatively under-explored in the context of distribution-free predictive inference, or conformal inference \citepvovk2005algorithmic,lei2018distribution. In fact, conformal methods have so far primarily focused on predicting individual outcomes one by one, typically under a data exchangeability assumption. As we shall see, this traditional approach can make it hard to aggregate individual inferences efficiently. This is especially a challenge in situations where the goal is to simultaneously predict several quantities exhibiting potentially complicated dependencies, and a straightforward Bonferroni correction \citepvovk2015transductive applied to individual-level predictions would be too conservative. We begin to fill this gap in the literature by develo** a novel simultaneous conformal inference \citepvovk2005algorithmic method for matrix completion, practically motivated by collaborative filtering for group recommendations.
Group recommender algorithms provide suggestions for collective decisions in diverse contexts, such as movies \citepquijano2011happy, music \citepmccarthy1998musicfx, travel \citepherzog2018tourrec, restaurants \citepmccarthy2002pocket, and hiring \citepbaskin2009preference. For example, think of a group of friends faced with the daunting task of selecting a movie that caters to everyone’s preferences. By utilizing data on each individual’s past streaming history and previously indicated movie preferences, a group recommender system can streamline the decision-making process, enhancing the likelihood of an inclusive and gratifying movie-watching experience for all involved. This explains why the topic is gaining increased traction \citepjameson2007recommendation,felfernig2018group,dara2020survey, fueled by increasing usage of mobile devices and social networks that connect users and gather relevant data.
While previous research on recommender systems concentrated on algorithmic aspects, uncertainty estimation is an essential ingredient in principled decision-making \citepzukerman2001predictive,adomavicius2007towards,zhang2016prediction,himabindu2018conformal,coscrato2023estimating, the significance of which is underscored by the inherent limitations of inferences based on sparse data, subjective ratings, diverse user behaviors, and potential inaccuracies due to noise \citeplam2006you. In fact, uncertainty estimation can enhance transparency \citepherlocker2000explaining and plays a crucial role in identifying flawed predictions and reinforcing user trust \citepmcnee2003confidence. Moreover, it can provide valuable insights for develo** innovative algorithmic strategies \citepprice2005optimal, potentially entailing the exclusion of less confident recommendations. It is also beneficial for internal company processes, such as comparing different algorithms \citepgunawardana2009survey and offering guidance on future data collection requirements \citepchen2021exploration. In the context of group recommendations, uncertainty estimation becomes even more important \citepde2009managing,sacharidis2019modeling,ismailoglu2022aggregating because it can play a critical role in the aggregation of potentially conflicting individual preferences \citepdara2020survey,xiao2017fairness,polylens-grs-2001.
To illustrate this idea, consider a recommendation algorithm that needs to suggest a movie for a group of friends, Alice, Ben, and Chris. All three users are predicted to give 5/5 ratings to “2001: A Space Odyssey”. However, the algorithm is only confident about Alice’s and Ben’s ratings, while Chris’s predicted rating is close to random guess, due for example to limited data related to his enjoyment of science-fiction. Moreover, the algorithm predicts all three users to give a 4/5 rating to “The Godfather”, but these predictions all are highly confident. Understanding the uncertainty in both cases would facilitate the recommendations process. For instance, it would lead to the selection of “The Godfather” if the system follows the “least misery” principle—assuming that the group’s overall satisfaction is influenced by the least satisfied member \citeppolylens-grs-2001—and to “2001: A Space Odyssey” under the opposite “most pleasure” principle.
With this motivation, we consider the problem of jointly predicting (or estimating) multiple ratings corresponding to different users. We focus on collaborative filtering via matrix completion—the task of approximating the missing entries in a sparsely observed ratings matrix \citepramlatchan2018survey, where rows represent users and columns depict products. If the matrix exhibits certain patterns, such as a low-rank structure, it becomes feasible to impute the missing entries \citeprennie2005fast, candes_exact_2008, candes_noise_2010. The underlying concept is quite intuitive. Individuals often seek recommendations from their peers, placing more trust in suggestions from people whose preferences align closely with their own. Thus, the structure of the preference matrix captures valuable similarities between diverse users and products.
While conformal inference has already been recognized as providing a useful uncertainty estimation framework in the context of collaborative filtering and matrix completion \citephimabindu2018conformal,gui2023conformalized,shao2023distribution, we will explain below that the task of simultaneously estimating uncertainty in a group setting is both novel and particularly challenging, necessitating a creative approach.
1.2 Preview of Our Contributions
We study the problem of constructing a confidence region for missing entries within the same column of a partially observed matrix, for different values of . Focusing on groups of entries within the same column is helpful for concreteness, but the main elements of our solution could be adapted to also address different related questions.
Figure 1 previews the performance of our method applied to the MovieLens 100K data \citepmovielens100k, analyzing a ratings matrix with 800 rows (users) and 1000 columns (movies). Approximately 94% of the entries are missing, and the goal is to construct a joint 90% confidence region (in the shape of a hyper-cube) for the unobserved ratings assigned to a random movie by a group of users. We refer to Section 5.2 for more details on this data analysis. In this context, an unadjusted application of conformalized matrix completion method of \citetgui2023conformalized would produce individual confidence intervals for one user/movie pair at a time but would not achieve group-level coverage. Therefore, we compare our method to a benchmark that seeks simultaneous group validity by applying a straightforward Bonferroni adjustment to the individual confidence intervals. Our method outperforms this Bonferroni benchmark, producing narrower and hence more informative confidence regions thanks to its ability to automatically adapt to the complex dependency structures arising from these data. We refer to Appendix A1 for a review of the method of \citetgui2023conformalized and further details on the undadjusted and Bonferroni baseline approaches.
Our method can be intuitively explained as follows. Conformal inference generally aims to transform the output of a machine learning model into a confidence set with a tunable parameter that controls the margin of error. This parameter is calibrated, using hold-out data assumed to be exchangeable with the test point, to minimize the size of the confidence set while guaranteeing a suitable notion of average coverage. To address our joint estimation problem, we extend the standard conformal strategy by constructing and leveraging a structured calibration sample consisting of groups of user/product observations that mimic the patterns expected at test time.
Yet, it is not easy to translate this high-level idea into a concrete method. The main obstacle is that our structured calibration data are inconsistent with exchangeability, preventing the use of standard conformal techniques. Our solution draws inspiration from the weighted exchangeability framework of \citettibshirani-covariate-shift-2019, which was developed to address a different covariate shift problem. However, this approach gives rise to significant computational challenges in our setting, as it involves summing over exponentially many permutations to account for the lack of exchangeability. We overcome these challenges by fusing the Gumbel-max trick \citepgumbel1954statistical with a new extension of the classical Laplace method \citeplaplace1774memoire. The Gumbel-max trick converts the intractable sum into a more manageable although still analytically unfeasible integral, and the Laplace method provides an accurate approximation of this integral. Although our integration technique is not exact, we demonstrate its reliability by proving that it is asymptotically consistent in the large-sample limit and then we verify the empirical validity of our results numerically.
1.3 Related Works
While matrix completion algorithms are well-studied \citeprennie2005fast, candes_exact_2008, candes_noise_2010, pmf_2007, bpmf_2008, the problem of quantifying their uncertainty has received attention only more recently. \citetchen_inference_2019,xia_statistical_2021,farias_uncertainty_2022 obtained asymptotic confidence intervals for missing entries estimated using convex optimization algorithms, under some modeling assumptions for the underlying matrix. \citetgui2023conformalized and \citetshao2023distribution proposed alternative approaches based on conformal inference \citepvovk1999machine,vovk2005algorithmic, requiring fewer assumptions and providing finite-sample inferences. Our method builds upon the conformal inference framework, but is not in competition with the parametric techniques of \citetchen_inference_2019,xia_statistical_2021,farias_uncertainty_2022; on the contrary, in the future these opposite points of view could be combined, possibly drawing strength from the convex optimization analysis to obtain even more informative conformal inferences.
gui2023conformalized and \citetshao2023distribution both seek to construct individual confidence intervals for the unobserved entries in a partially observed matrix, one at a time, although each tackles that problem from a different angle. The view of \citetgui2023conformalized is more similar to ours, as they assume the matrix is fixed while the missingness is random. In \citetshao2023distribution, the matrix is random and the missingness is fixed. Our work departs from that of \citetgui2023conformalized as we seek simultaneous confidence regions for structured groups of missing entries; this is a more challenging problem that requires different modeling choices and involves several methodological innovations due to lack of exchangeability.
While this paper focuses on matrix completion, many components of our method are likely to have broader relevance, potentially enabling structured conformal inferences in different contexts. Many prior works focused on regression \citeplei2014distribution,lei2018distribution,romano2019conformalized,sesia2021conformal, classification \citeplei2013distribution,romano2020classification,liang2024icp, or outlier detection \citepsmith2015conformal,bates2021testing,marandon2022machine, but the applicability of conformal inference extends to numerous other tasks, including functional data analysis \citeplei2015conformal, causal inference \citeplei2021conformal, data sketching \citepsesia2023conformal, and forecasting \citepgibbs2021adaptive,zhou2024conformalized; see also \citetangelopoulos2021gentle for a beginner-level review. In general, conformal inference is easier when dealing with exhangeable data, but there have been many efforts to deal with more difficult situations \citeptibshirani-covariate-shift-2019,podkopaev2021distribution,barber_beyond_2022,einbinder2022conformal,sesia2023adaptive. Building on the weighted exchangeability framework introduced by \citettibshirani-covariate-shift-2019, we focus on the challenges arising from structured calibration.
1.4 Outline of the Paper
Section 2 states our assumptions and goals. Section 3 presents our method. Section 4 delves into essential implementation aspects. The empirical performance of our method is investigated in Section 5. Section 6 concludes with a discussion and some ideas for future research. Additional content is in the appendices. Appendix A1 summarizes technical background information and details the baseline approaches. Appendix A2 explains additional implementation details, Appendix A3 summarizes further empirical results, and Appendix A4 contains all mathematical proofs.
2 Setup and Problem Statement
Let be a fixed matrix with rows and columns. One may consider that the rows of correspond to users and its columns to products, while each entry is the rating assigned by user to product . Although it is common to assume to have a particular (e.g., low-rank) structure, possibly including some independent additive random noise, we follow a different approach. Inspired by \citetgui2023conformalized, we allow to be deterministic and arbitrary, and we model instead the randomness in the data observation (or missingness) process.
Consider a fixed number of observations, , and let denote the observed portion of , indexed by with each . We assume that is randomly sampled without replacement from . For increased flexibility, we allow this sampling process to be weighted according to parameters , which we assume to be known for now. See Appendix A2.4 for details on how these weights may be estimated in practice. In a compact notation, we write our sampling model as:
(1) |
In general, denotes weighted random sampling without replacement of distinct elements (“balls”) from a finite dictionary (“urn”) denoted as , with . More precisely, corresponds to saying that, for any ,
Therefore, the weights do not need to be normalized. This model is a special case of the multivariate Wallenius’ noncentral hypergeometric distribution \citepwallenius1963biased,chesson1976non, and it reduces to random sampling without replacement if all are the same.
We denote as the unordered collection of indices corresponding to the observed matrix entries, and call its complement ; i.e.,
Our goal is to construct a joint confidence region for a group of unobserved matrix entries, indexed by , where each and the group size is fixed. The indices represented by are assumed to be sampled without replacement from , subject to the constraint that they must be in the same matrix column. Moreover, the sampling of from is guided by a distinct set of weights , which are also assumed to be known for now. We will discuss at the end of this section the implications of the choice of .
To ensure that the sampling model for is fully well-defined, we must account for the possibility that some matrix columns may have fewer than missing entries. Therefore, we consider a pruned set of missing indices , defined as:
(2) |
where is the number of missing entries in column . Then, we assume that is sampled according to
(3) |
where is a constrained version of that samples a group of indices belonging to the same column. This distribution is equivalent to the following sequential sampling procedure:
(4) |
where the weights are given by and is the column of .
Note that the sequential sampling procedure in (4) requires the existence of one column with at least unobserved entries. This is always the case as long as , and this is a reasonable assumption in applications where is only sparsely observed.
This premise allows us to state our goal formally. For a given coverage level , we seek a joint confidence region, denoted as , for the missing matrix entries indexed by . Crucially, this confidence region should be informative (i.e., not too wide) and guarantee finite-sample simultaneous coverage, in the sense that
(5) |
Note that the probability in (5) is taken with respect to both , which is sampled according to (1), and , which is sampled according to (4), while is fixed.
We conclude this section by discussing the important distinction between the sampling weights in (1) and the test weights in (3), which have different interpretations and purposes. Intuitively, the role of is to model situations in which the matrix is not observed uniformly at random. For example, in a collaborative filtering context, some types of users may be more engaged and certain movies tend to receive more ratings. Such patterns can be captured by the model in (1) using heterogeneous weights. Further, non-uniform missingness patterns are often apparent from the observed data, making it possible to estimate empirically \citepgui2023conformalized, as explained in Appendix A2.4.
By contrast, may be independently fixed by the practitioner, and its role is to control the interpretation of the coverage guarantee in (5). To understand this, consider the following two examples. If all , Equation (5) offers coverage only in a marginal sense, for drawn uniformly at random from the missing portion of the matrix \citepgui2023conformalized. If if and only if , for the subset corresponding to “action movies”, Equation (5) can be interpreted as ensuring coverage specifically for action movies. The latter is a stronger type of conditional guarantee \citepromano2020malice, which may be appealing if one indeed cares especially about action movies. Thus, generally allows one to place more or less emphasis on certain unobserved portions of the matrix, interpolating between marginal and conditional views. Of course, there is a trade-off. As we will see empirically, the price of stronger theoretical guarantees obtained with more concentrated weight vectors tends to take the form of wider (less informative) confidence regions \citepfoygel2021limits, which is why some flexibility in the choice of is desirable.
3 Methods
This section describes the key components of our method, which we call Structured Conformalized Matrix Completion (SCMC). Section 3.1 gives a high-level overview of SCMC and outlines it in Algorithm 1. Section 3.2 details the construction of the calibration set utilized by SCMC. Section 3.3 presents a generalized quantile inflation lemma that provides the main theoretical building block for our simultaneous coverage results. Section 3.4 characterizes precisely the conformalization weights needed to apply our quantile inflation lemma in the context of SCMC. Section 3.5 establishes our lower and upper simultaneous coverage bounds. Important computational shortcuts pertaining to the evaluation of our conformalization weights are postponed to Section 4.
3.1 Method Outline
Having observed the matrix entries indexed by , SCMC partitions into two disjoint subsets: a training set and a calibration set , so that . However, departing from the standard approach in (split) conformal inference, we do not partition the data completely at random. On the contrary, since we want the calibration set to exhibit a structure similar to that of the target group , we form and using a more sophisticated approach, the details of which are explained later in Section 3.2.
After appropriately partitioning the observations into and , SCMC trains a matrix completion algorithm using only the data in , producing a point estimate of the full matrix . Any matrix completion algorithm can be applied for this purpose. For example, if is suspected to have an underlying low-rank structure, it may be reasonable to follow a classical convex nuclear norm minimization approach \citepcandes_exact_2008, computing
where denotes the nuclear norm and is the orthogonal projection of onto the subspace of matrices that vanish outside the index set .
Beyond convex optimization, our method can be combined with any matrix completion algorithm, including those based on non-convex factorization \citepsun2016guaranteed or deep learning \citepsedhain2015autorec,fan2017deep. While SCMC tends to produce more informative confidence regions if estimates more accurately, its coverage guarantee will require no assumptions on how is derived from .
Our method translates any black-box estimate into confidence regions for the missing entries as follows. Let be a pre-specified set-valued function, termed prediction rule, that takes as input , a list of target indices , and a parameter , and outputs . (We will often make the dependence of on implicit.) Our method is flexible in the choice of the prediction rule, but we generally require that this function be monotone increasing in , in the sense that
(6) |
and satisfies the following boundary conditions almost-surely:
(7) |
Intuitively, corresponds to placing absolute confidence in the accuracy of , while approaching suggests that the point estimate carries no information about .
For example, a simple prediction rule that satisfies the aforementioned requirements is
(8) |
which produces regions in the shape of a hyper-cube. This approach will be utilized in our numerical experiments due to its ease of interpretation, but it is of course not unique. See Appendix A2.1 for further details and additional examples of alternative prediction rules.
The purpose of the observations indexed by , which were not used to train , is to find the smallest possible needed to achieve simultaneous coverage (5). As detailed in Section 3.2, SCMC carefully constructs so that it gives us a set of calibration groups , where each consists of observed matrix entries within the same column; i.e., . As explained in the next section, can be fixed arbitrarily, although it should be small compared to the total number of observed matrix entries and typically at least greater than 100 to avoid excessively high variance in the results \citepvovk2012conditional,sesia2020comparison. Intuitively, these calibration groups are constructed in such a way as to (approximately) simulate the structure of .
For each calibration group, we compute a conformity score , defined as the smallest value of for which the candidate confidence region covers all entries of :
(9) |
Then, the calibrated value of is obtained by evaluating the following weighted quantile \citeptibshirani-covariate-shift-2019 of the empirical distribution of the calibration scores:
(10) |
Above, denotes the quantile of a distribution on the augmented real line ; that is, for , . The distribution in (10) places a point mass on each observed value of and an additional point mass at . The expression of the weights and will be given in Section 3.4. These weights generally depend on and on all , although this dependence is kept implicit here for simplicity.
Finally, the calibrated parameter is utilized to construct a joint confidence region
(11) |
This will be proved in Section 3.5 to have valid simultaneous coverage (5), as long as is sampled from (1) and from (4). The overall procedure is summarized by Algorithm 1, while all missing details will be carefully explained in the subsequent sections.
3.2 Assembling the Structured Calibration Set
This section explains how to partition into a training set and a collection of calibration groups that approximately mimic the structure of . To begin, we note that the number of calibration groups cannot exceed and, further,
(12) |
where is the number of observed entries in column , which is a function of in (1). To satisfy these constraints, as a practical rule-of-thumb one may set . In the following, we will assume that is a fixed parameter (e.g., ) guaranteed to satisfy the upper bound in (12). This simplification streamlines the analysis of SCMC without much loss of generality. In principle, it would also be possible to set in a data-independent way so that (12) holds with high probability, as long as is not too large compared to and .
For any given satisfying (12), we partition into a training set and a collection of calibration groups as detailed in Algorithm 2. After initializing an empty , we iterate over each column and assign to a random subset of observations from that column, where is the total number of observations in column . This preliminary step ensures that the remaining number of observations in column is a multiple of (possibly zero). Then, for each , is obtained by sampling observations uniformly without replacement from a randomly chosen matrix column. Finally, all remaining observations are assigned to .
Algorithm 2 intuitively mimics the sampling model for defined in (3), with the key difference that it samples the calibration groups from instead of . This unavoidable discrepancy, however, is delicate, as it implies that are neither exchangeable nor weighted exchangeable \citeptibshirani-covariate-shift-2019 with the test group . Therefore, an innovative approach is needed to translate these calibration groups into valid simultaneous confidence regions, as explained in the next section.
3.3 A General Quantile Inflation Lemma
Consider a conformity score , defined similarly to the scores in (9),
(13) |
In words, is the smallest for which covers all entries of . Although this score cannot be observed because the matrix entries indexed by are latent, it is a well-defined and useful quantity. It allows us to write the probability that the confidence region output by Algorithm 1 simultaneously covers all elements of as:
(14) |
To establish that Algorithm 1 achieves simultaneous coverage (5), the right-hand-side of (14) must be bounded from below by , for a suitable (and practical) choice of the weights and used to compute in (19). This is not straightforward because the scores are neither exchangeable nor weighted exchangeable, as they respectively depend on and . A solution is provided by the following lemma due to \citettibshirani-covariate-shift-2019.
Lemma 1 (from \citettibshirani-covariate-shift-2019).
Let be random variables with joint law . For any fixed function and , define , where . Assume that are distinct almost surely. Define also
(15) |
where is the set of all permutations of . Then, for any ,
Translating Lemma 1 into a practical method requires evaluating the weights defined in (15), which generally involves a computationally unfeasible sum over an exponential number of permutations. If the distribution satisfies a symmetry condition called “weighted exchangeability”, it was shown by \citettibshirani-covariate-shift-2019 that the expression in (15) simplifies greatly, but this is not helpful in our case because do not enjoy such a property. Further, it is unclear how Algorithm 2 may be modified to achieve weighted exchangeability.
Fortunately, our groups satisfy a “leave-one-out exchangeability” property that still enables an efficient computation of the conformalization weights in (15). Intuitively, the joint distribution of is invariant to the reordering of the first variables.
Proposition 1.
Let and be subsets of observed and missing matrix entries, respectively, sampled according to (1). Let be a test group sampled according to (3) conditional on . Suppose are the calibration groups output by Algorithm 2, while and are the corresponding training and pruned observation sets. Then, for any permutation of ,
The proof of Proposition 1 is in Appendix A4.2. Intuitively, this is established by deriving the joint distribution of conditional on and . The usefulness of this result becomes clear in the light of the following specialized version of Lemma 1.
Lemma 2.
Let be leave-one-out exchangeable random variables, so that there exists a permutation-invariant function such that their joint law can be factorized as
(16) |
for some function taking as first input an unordered set of elements. For any fixed function and any , define , where . Assume that are almost-surely distinct. Then, ,
(17) |
where
(18) |
3.4 Characterization of the Conformalization Weights
We now characterize explicitly the conformalization weights needed to apply Lemma 2 to our problem. The following notation is useful for this purpose.
Denote by an augmented version of that also includes the unordered set of indices corresponding to the test group ; i.e.,
Similarly, let and denote possible realization of and , respectively. Then, for any , denote by the imaginary set of observations obtained by replacing the indices corresponding to the calibration group with those corresponding to the test group . Further, let denote the original observation set. In summary,
where is a realization of and is a realization of .
Next, let denote the numbers of observations in column from the sets . Define also , the corresponding numbers of observations remaining in column after the random pruning step of Algorithm 2. For any , let denote the column to which belongs; i.e., , where is the column of the -th entry in . Further, let denote a realization of in (2). With slight abuse of notation, we denote the set of missing indices in column excluding those in the group as . We are now ready to state how Lemma 2 applies in our setting, with an explicit expression for the conformalization weights in (18).
Lemma 3.
Under the setting of Proposition 1, let denote , and represent the corresponding scores given by (9) and (13), respectively, based on a matrix estimate computed based on the observations in . Then, Equation (17) from Lemma 2 applies conditional on and , with weights
(19) |
Above, and have explicit expressions that depend on the weights in (3); i.e.,
(20) |
with
and, for all ,
(21) |
The main challenge in the computation of (19) arises from the term , which is the probability of observing the matrix entries in and depends on the sampling weights in (1). Although this probability cannot be evaluated analytically, it can be approximated with an efficient algorithm, which makes it possible to compute the conformalization weights in (19) at cost , as explained in Section 4.
3.5 Finite-Sample Coverage Bounds
The following theorem states formally that Algorithm 1 produces joint confidence regions with simultaneous coverage for random groups sampled according to the model defined in (3). This result follows by integrating Proposition 1, Lemma 2, and Equation (19).
Theorem 1.
Note that the probability in Theorem 1 is taken over the randomness in and , while and can be considered fixed. Therefore, this result implies the simultaneous coverage property stated earlier in (5). Further, it is also possible to bound our simultaneous coverage from above.
Theorem 2.
Theorem 2 is proved in Appendix A4.4. A numerical investigation of the expected value on the right-hand-side of (22), conducted in Appendix A3.2, demonstrates that in practice the upper bound in (22) converges to as increases. This is consistent with our empirical observations that Algorithm 1 is not too conservative, as previewed in Figure 1.
4 Computational Shortcuts and Cost Analysis
4.1 Efficient Evaluation of the Conformalization Weights
We now explain how to efficiently approximate the conformalization weights in (19), for all . The main challenge is to evaluate according to the missingness model defined in (1). In truth, it suffices to relate this probability, which depends on the index , to , which is constant and can thus be ignored when computing (19). In this section, we demonstrate that their ratio can be expressed in a much more tractable form, one whose computational complexity does not increase with the matrix dimensions.
We begin by expressing and , for any , as closed-form integrals. Let denote the cumulative weight of all missing indices and, for any positive scaling parameter , define of as
(23) |
Further, define also for all .
Proposition 2.
For any fixed , scaling parameter , and ,
(24) |
where, for any ,
(25) |
Note that for all if , and in that case Proposition 2 recovers a classical result by \citetwallenius1963biased. See the proof of Proposition 2 in Appendix A4.5 for further details. Furthermore, the function in (25) is a product of only simple functions of , and therefore it is straightforward to evaluate even for large matrices.
Proposition 2 provides the foundation for evaluating the conformalization weights in (19). The remaining difficulty is that (24) has no analytical solution. Fortunately, the function satisfies some properties that make it feasible to approximate this integral accurately.
Lemma 4.
If , the function defined in (23) has a unique stationary point with respect to at some value . Further, is a global maximum.
See Figure 2 for a visualization of and , in two examples where the sampling weights in (1) are independent and uniformly distributed on . These results show that becomes increasingly concentrated around its unique maximum for larger sample sizes, while remains relatively smooth (or flat) at that point. Therefore, it makes sense to approximate this integral through a careful extension of Laplace’s method \citeplaplace1774memoire. This is explained below.
The first step to approximate the integral in (24) with a generalized Laplace method (justified later Section 4.2), is to modify the integrand in such a way as to move the peak away from the integration boundary. To this end, define as
(26) |
and recall that is a parameter that we are free to choose. Therefore, we will tune in such a way as to center the peak within the integration domain; that is, we pick a value such that . Fortunately, Lemma 4 tells us that the function has a unique global maximum at when , and a suitable value of such that can be found by applying the Newton-Raphson iterative algorithm; see Appendix A2.2 for further details.
Having fixed such that , a Laplace approximation can be obtained as follows. The key intuition is that, as the number of observations grows, the peak of the function increasingly dominates the integral. In particular, a second-order Taylor expansion shows that the integral is primarily determined by the value of at and by the curvature of at the peak, namely . This leads to the following approximation,
(27) |
As explained below, this approximation becomes very accurate in the large-sample limit, and it is useful because it allows us to approximate the ratio with a quantity, , that is straightforward to calculate. For example, if the sampling weights in (1) are uniformly constant, for all and any .
By combining (27) with (19), it follows that, for each , the conformalization weight can be approximately rewritten in the large-sample limit as
(28) |
with the un-normalized weight given by:
(29) |
This finally makes Algorithm 1 practical because evaluating in (29) only involves simple arithmetic operations and can be carried out very efficiently, as explained in Section 4.3.
4.2 Consistency of the Generalized Laplace Approximation
It is important to emphasize that our approximation in (27) is not obtained from a standard application of the Laplace method, since the latter is typically restricted to handling integrals of simpler functions; see Appendix A1.4. Yet, the Taylor approximation ideas underlying the Laplace method are versatile enough to be extended to our setting, as demonstrated by the following theorem. This novel result provides a rigorous justification for the generalized Laplace approximation in (27). For simplicity, but without much loss of generality, this theorem relies on some additional technical assumptions, which will be justified in our context towards the end of this section. This result is presented informally here for simplicity, but a formal statement can be found in Appendix A4.6, along with its proof.
Theorem 3 (Informal statement of Theorem A4).
Let denote a sequence of i.i.d. random variables from some distribution supported on , and a sequence of independent Bernoulli random variables, with . Define and let
(30) |
where is such that . Define also . Then, for a sequence of functions bounded away from 0 and satisfying certain smoothness conditions,
(31) |
To relate this result to the Laplace approximations described in Section 4.1, let us compare the function in (23) with the function in (30). Given a map** from the sequence to the matrix entries , we can express as:
(32) |
Therefore, the discrepancy between and can be traced to the different sampling models describing the distributions of our matrix observations and the variables in Theorem 3. In Theorem 3, the observations follow independent Bernoulli distributions, whereas the matrix entries in our model (3) are sampled without replacement. These views can be reconciled as follows. Sampling without replacement is a natural modelling choice for the simultaneous inference problem studied in this paper, but it would make the proof of Theorem A4 too complicated. Nevertheless, these two models are qualitatively consistent. Suppose the sampling weights in (1) are constant; then, in that special case our model corresponds to that of Theorem 3 with , for some constant , after conditioning on the observed number of entries .
4.3 Computational Complexity
The SCMC method described in this paper can be implemented efficiently and is able to handle completion tasks involving large matrices. Its practicality is demonstrated in this section, which summarizes the results of an analysis of the computational complexity of different components of Algorithm 1. We refer to Appendix A2.3 for the details behind this analysis and an explanation of the underling computational shortcuts with which all redundant operations are streamlined.
In summary, the cost of producing a joint confidence region for a test group of size using Algorithm 1 is , where denotes the fixed cost of training the black-box matrix completion model based on and is the number of calibration groups. Further, it is possible to recycle redundant calculations when constructing simultaneous confidence regions for distinct test groups , as explained in Appendix A2.3. Therefore, the overall cost of obtaining distinct confidence regions for different groups is only . See Table 1 for a summary of these results.
5 Empirical Demonstrations
We apply SCMC to simulated and real data, comparing its performance to those of the unadjusted and Bonferroni baselines. This section is organized as follows. Section 5.1 describes experiments based on simulated data, with Section 5.1.1 focusing on (known) uniform sampling weights, and Section 5.1.2 allowing the sampling weights for the observed data to be heterogeneous (although still known exactly). Section 5.2 describes more realistic experiments involving the MovieLens data, considering estimated sampling weights. The results of additional experiments are presented in the Appendices. Appendix A3.1 describes experiments with synthetic data involving heterogeneous test weights. Appendix A3.2 investigates the tightness of the theoretical coverage upper bounds derived in Section 3.5.
5.1 Numerical Experiments with Synthetic Data
5.1.1 Uniform Sampling Weights
We begin with a simple scenario in which the observation pattern in (3) is completely random and the test weights in (4) are uniform: for all . A matrix with rows and columns is generated based on a “signal plus noise” model that exhibits both a low-rank structure and column-wise dependencies. (For example, in the Netflix data set, users may tend to agree on the quality of certain movies, leading to positive dependency among the columns of the rating matrix.) This design is motivated by the intuition that column-wise dependencies make our simultaneous inference task especially challenging, hel** us better understand the settings under which our method brings larger practical advantages relative to the baselines.
The ground truth matrix is obtained as , where is low-rank while is a noise matrix exhibiting column-wise dependencies whose strength can be tuned as a control parameter, as detailed below.
-
1.
is given by a random factorization model with rank ; i.e., , where and are such that
(33) -
2.
, where is a vector of ones, has i.i.d. standard normal components, and is such that, for all ,
(34) for suitable parameters and . Thus, has constant columns, and larger values of result in stronger column-wise dependencies compared to the background i.i.d. noise described by the matrix . In the following, the value of is varied as a control parameter, while we fix .
For a given ground truth matrix generated as described above, we observe entries, randomly sampled according to model defined in (1) with for all . Let denote the unordered collection of these observed indices. Then, 100 test groups of size , where is a control parameter, are sampled without replacement from , according to the model defined in (3) with for all .
The simultaneous confidence region for a test group is constructed by applying Algorithm 1 with calibration groups, where , defined in (12), denotes the maximum possible number of such groups. Note that the matrix algorithm leveraged by our method can thus be trained using observed entries of , indexed by .
While SCMC can leverage any matrix completion algorithm producing point predictions, here we employ the alternating least squares approach of \citethu_cf_2008, which is designed to recover low-rank signals. For simplicity, we apply this algorithm with an hypothesized rank of 5, which matches the true rank of . It is worth repeating, however, that the validity of the SCMC confidence regions is independent of both the true and the matrix completion model.
Our method is compared to the two baselines introduced in Section 1.2. Recall that the first one is a naive unadjusted heuristic that ignores the multiple testing aspect of our simultaneous inference problem and essentially applies Algorithm 1 with repeatedly for every individual entry in . This ensures valid coverage for each entry in separately, but does not guarantee simultaneous coverage for groups with . By contrast, the second Bonferroni baseline relies on a crude and overly conservative multiple testing adjustment to achieve simultaneous coverage, essentially applying Algorithm 1 with at level instead of . Both baseline approaches are applied using the same matrix completion model leveraged by our method, and their predictions are calibrated using a calibration set containing observed matrix entries.
Figure 3 summarizes the results of these experiments as a function of and for different values of the noise parameter . Each method is assessed in terms of the average width of the output confidence regions, at level , and of the empirical simultaneous coverage for the 100 test groups. All results are averaged over 300 independent experiments. Our method always achieves the desired 90% simultaneous coverage, as predicted by the theory, while the unadjusted baseline becomes increasingly anti-conservative for larger values of . Further, our method leads to more informative confidence regions compared to the Bonferroni baseline, which becomes increasingly conservative with larger values of and . See Figure A10 in Appendix A3 for a different view of these results, highlighting the behavior of all methods as a function of , for different values of .
5.1.2 Heterogeneous Sampling Weights
Moving beyond the setting of data missing completely at random, we now consider similar experiments in which the sampling weights of the observation model (3) are heterogeneous, while the matrix has a simple low-rank structure. In particular, is generated according to the random factorization model defined in (33), so that with rank and . The sampling weights are chosen such as to introduce an interesting spatial missingness pattern, with some rows and columns being more densely observed than others. Precisely, we set
(35) |
where controls the degree of heterogeneity. If , the missingness is uniform, whereas larger values of result in columns with higher indices to be more densely observed.
Based on this model, we randomly sample without replacement matrix entries (from a total of 160,000) and then apply Algorithm 1 similarly to the previous section, using calibration groups and allocating the remaining observations to train the matrix completion model. For the latter, we rely on the same alternating least squares algorithm \citephu_cf_2008 as in the previous section, with hypothesized rank 8. The two baseline approaches are also applied similarly, following an approach analogous to that described in Section 5.1.1.
All methods are evaluated on a test set of 100 test groups sampled without replacement according to the model defined in (3), with uniform weights for all . The level is . All results are averaged over 300 independent experiments.
Figure 4 reports on the results of these experiments as a function of the parameter in (35), for different values of . As predicted by the theory, our method always achieves valid simultaneous coverage, unlike the unadjusted baseline. Further, our method produces relatively informative confidence regions compared to the Bonferroni approach, as the latter becomes more conservative for larger values of . This can be understood as follows. As the matrix completion model naturally finds it easier to recover more accurately the missing entries belonging to more densely observed columns, the heterogeneous sampling model tends to introduce spatial dependencies in the residual matrix . These dependencies, which intuitively become stronger for larger values of , make our simultaneous inference task intrinsically more challenging, resulting in wider confidence regions for all methods, but have a disproportionate adverse effect on the Bonferroni approach (which implicitly but incorrectly assumes the miscoverage events corresponding to different entries to be mutually independent). See Figure A11 in Appendix A3 for a different view of these results, highlighting the behavior of all methods as a function of , for different values of .
5.2 Numerical Experiments with MovieLens Data
We now apply our method to the MovieLens 100K \citepmovielens100k data and compare its performance to those of the unadjusted and Bonferroni baselines. This data set contains 100,000 ratings (on a scale from 1 to 5) provided by 943 users for 1682 movies. Therefore, approximately 94% of all possible ratings are missing. To reduce the memory requirements of the matrix completion algorithm utilized to compute , we reduce the matrix size by half, focusing on a smaller rating matrix , corresponding to a random subset of 800 users and 1000 movies.
As usual, we denote the set of indices for the observed matrix entries as and its complement as . Since the true sampling weights are unknown in this application, we compute estimated weights with a data-driven approach inspired by \citetgui2023conformalized, as described in Appendix A2.4. Algorithm 1 is then applied with instead of , to construct simultaneous confidence regions for the unobserved ratings of 100 random test groups . We utilize calibration groups and vary the group size as a control parameter. The test groups are randomly sampled without replacement from according to the model defined in (3), with uniform weights . The matrix completion algorithm is trained as described in the previous sections, applying the alternating least squares approach of \citethu_cf_2008 based on observations. The hypothesized rank of utilized by this model to obtain is varied as an additional control parameter. As before, the baseline approaches are also applied based on the same matrix completion model, to facilitate the comparison with our method.
Figure 1, previewed earlier in Section 1.2, reports on the results of these experiments as a function of the group size and of the hypothesized rank utilized by the matrix completion model. The confidence regions are assessed based on their average width alone, since it is impossible to measure the empirical coverage given that the ground truth is unknown. The results show that SCMC produces more informative (narrower) confidence regions compared to the Bonferroni approach, consistently with the results of our previous experiments based on synthetic data. Figure 1 displays only the performance of the Bonferroni baseline because the unadjusted baseline is not intended to provide valid simultaneous coverage, making it less suitable for comparisons lacking a verifiable ground truth. Nevertheless, Figure A12 in Appendix A3.5 includes a comparison with both baselines, demonstrating that our simultaneous confidence regions are not much wider than those produced by the unadjusted baseline. Further, our method’s higher reliability compared to the unadjusted baseline is supported by the following additional experiments, conducted using the same data but under a more artificial setting in which the ground truth is known.
To evaluate the coverage on the MovieLens data, we carry out similar but more closely controlled experiments in which the test groups are drawn not from (for which the ground truth is unknown) but from a hold-out subset containing 20% of the observed matrix indices in . Algorithm 1 is then applied to construct confidence regions for the unobserved ratings of 100 random test groups sampled from , proceeding as described before but utilizing only the observed data in instead of .
Since the estimation of acknowledges the existence of an unobserved set of entries , in this setting our method is essentially aiming to achieve simultaneous coverage for test groups sampled from instead of . Of course, we can only evaluate the empirical coverage for test groups sampled from , and this is why these experiments are useful to understand the robustness of our inferences to possible distribution shifts between and .
Figure 5 compares the performances of each method under this hybrid setting, focusing on test groups sampled from the hold-out data in . These results are reported as a function of , for different values of the hypothesized rank in the matrix completion model. Consistently with the previous results, our method leads to more informative inferences compared to the Bonferroni approach, and it nearly achieves the desired 90% simultaneous coverage for the test groups sampled from , even though in theory one would only expect it to have valid coverage on average over all test groups sampled from . The nearly valid coverage also demonstrates the robustness of our method towards possible misspecification of the sampling weights.
6 Discussion
This paper introduces a principled and effective method for simultaneous conformal inference in matrix completion. Although primarily motivated by the challenges of uncertainty estimation for group recommender systems, our approach is sufficiently modular and flexible to be potentially relevant beyond our initial focus. In particular, the core idea of leveraging a structured calibration set to approximately replicate the patterns expected at test time could be adapted to obtain joint inferences beyond the task of predicting multiple user ratings for the same product. Moreover, our newly introduced notion of leave-one-out exchangeability and the related conformalization techniques extend the existing framework for conformal inference under covariate shift proposed by \citettibshirani-covariate-shift-2019 and these advances may be useful in other applications of conformal inference.
A related direction for future research may involve extending our method to accommodate the jackknife+ framework of \citetbarber_cv+_2021. The data-splitting approach adopted in this paper may not be fully satisfactory in situations where the observations are very limited. In fact, a scarcity of training data may result in less accurate point estimates, thereby reducing the informativeness of our inferences, and a scarcity of calibration data generally leads to more unstable outputs. In contrast, cross-validation can make a more efficient use of the limited data, although at the price of increased theoretical challenges and more expensive computations.
Software Availability
A software package implementing the methods and numerical experiments described in this paper is available at https://github.com/ZiyiLiang/simultaneous-matrix-completion.
Acknowledgements
M. S. was partly supported by NSF grant DMS 2210637.
Appendix A1 Additional Technical Background
A1.1 Review of Individual-Level Conformalized Matrix Completion
This section reviews the conformalized matrix completion method proposed by \citetgui2023conformalized, which is designed to produce confidence intervals for one missing entry at a time.
The setup of \citetgui2023conformalized is similar to ours as they also treat as fixed and assume the randomness in the matrix completion problem comes from the observation process or, equivalently, the missingness mechanism. However, their modeling choices do not match exactly with ours. Specifically, they assume that each matrix entry in row and column is independently observed with some (known) probability , which roughly corresponds to our sampling weights in (1); i.e., , where
(A36) |
Therefore, the total number of observed entries is a random variable in \citetgui2023conformalized, whereas we can allow to be fixed within the sampling without replacement model defined in (1). As shown in this paper, our modeling choice is natural when aiming to construct group-level simultaneous inferences. The model assumed by \citetgui2023conformalized also differs from ours in its requirement that all sampling weights must be strictly positive; for all in (A36). Further, the approach of \citetgui2023conformalized differs from ours in that they assume the missing matrix index of interest, namely , to be sampled uniformly at random from , that is, , where . By contrast, our sampling model for the test groups, defined in (3), can accommodate heterogeneous weights .
The method proposed by \citetgui2023conformalized constructs conformal confidence intervals for individual missing entries as follows. First, is partitioned into a training set and a disjoint calibration set by randomly sampling independently for all , for some fixed parameter , and then defining
(A37) |
Similar to us, \citetgui2023conformalized utilize to compute , leveraging any black-box algorithm, and then evaluate conformity scores on the calibration data as explained below.
Let denote a pre-specified prediction rule for a single matrix entry, which should be monotonically increasing in the tuning parameter as explained in Section 3.1; for example, this could correspond to the prediction rule defined in (8) in the special case of . For any , let denote the conformity score corresponding to , as in (9). Imagining that the calibration set contains the indices of matrix entries——the method of \citetgui2023conformalized evaluates for all and then calibrates the tuning parameter by computing
(A38) |
where the conformalization weights are given by
(A39) |
for all , with the convention that . Finally, the confidence interval for the latent value of at index is given by:
(A40) |
The following result establishes that the confidence intervals defined in (A40) have guaranteed marginal coverage at level .
Proposition A3 (from \citetgui2023conformalized).
Proof.
This result follows directly from the proof of Theorem 3.2 in \citetgui2023conformalized. Alternatively, the following proof can be obtained by applying our Lemma 2. Conditioning on , such that , note that the joint distribution of trivially satisfies the leave-one-out exchangeability condition defined in Lemma 2. Specifically, let be a realization of , so that is a realization of sampled according to (A37). Then,
where the second equality follows from Lemma 3.1 in \citetgui2023conformalized, for a suitable function that is invariant to any permutation of its input. Further, it follows that
with defined as in (A39). This proves that are leave-one-out exchangeable random variables by the definition in (16), with . Therefore, the coverage guarantee of Proposition A3 follows directly from Lemma 2. ∎
A1.2 Limitations of the Unadjusted and Bonferroni Baselines
It is not easy to construct informative simultaneous confidence regions satisfying (5) and, to the best of our knowledge, there are no satisfactory alternatives to the method proposed in this paper. In fact, standard conformal methods are designed to deal with one test point at a time, and directly aggregating separate prediction intervals into a joint confidence region is neither precise nor efficient in our context, as explained in more detail below.
Recall that the conformalized matrix completion method of \citetgui2023conformalized, reviewed in Appendix A1.1, is designed to construct a confidence interval for one missing entry at a time, denoted as , such that
(A41) |
under a suitable sampling model for and . The model for and considered by \citetgui2023conformalized is different from ours, as they treat as random, rely on independent Bernoulli observations instead of sampling without replacement, and do not consider the possibility that the sampling weights in (3) may be non-uniform. However, a similar idea can be adapted to construct confidence intervals for a single matrix entry under our sampling model (1)–(3), as explained in Appendix A1.3. In any case, regardless of these modeling details, the limitations of the baseline approaches within our simultaneous inference context can already be understood as follows.
If the goal is to make joint predictions for a group of matrix entries, concatenating individual-level predictions clearly does not guarantee simultaneous coverage in the sense of (5), as the errors across different coordinates tend to accumulate. This may be seen as an instance of the prototypical multiple testing problem. The unadjusted baseline approach essentially computes:
(A42) |
As demonstrated by Figure 5 and other synthetic experiments in Section 5.1, this approach often leads to low simultaneous coverage.
Figure 1 previewed the performance of a second baseline approach that relies on a simple but inefficient Bonferroni correction to approximately ensure simultaneous coverage. Intuitively, this tries to (conservatively) account for the multiplicity of the problem by applying (A42) at level instead of , computing
(A43) |
Although a Bonferroni correction may seem reasonable at first sight, it is still unsatisfactory for at least two reasons. Firstly, it is not rigorous because we know the missing entries indexed by must belong to the same column, but this constraint cannot be easily taken into account by individual-level predictions. Secondly, and even more crucially, the Bonferroni correction tends to be overly conservative in practice because the coverage events for different values of are mutually dependent, since they are all affected by the same observations . These dependencies, however, are potentially very complex.
A1.3 Implementation Details for the Baselines
To facilitate the empirical comparison with our method, which relies on the sampling model for and defined (1)–(3), in this paper we apply the unadjusted and Bonferroni baseline approaches described in Appendix A1.2 based on individual-level conformal prediction intervals obtained as follows. Instead of directly applying the conformalized matrix completion method of \citetgui2023conformalized, we repeatedly apply our own method separately for each element in , imagining each time that we are dealing with a trivial group of size 1. This provides us with individual-level prediction intervals that are similar in spirit to those of \citetgui2023conformalized but whose construction more faithfully mirrors the sampling model assumed in this paper (although they still ignore the constraint that all elements of must belong to the same column). In summary, the implementation of the unadjusted and Bonferroni baseline approaches applied in this paper is outlined by Algorithms A3 and A4, respectively.
A1.4 Review of the Classical Laplace Method
This section provides a concise review of the classical version of Laplace’s method, as detailed for example in \citetbutler2007saddlepoint. This method is a powerful tool for approximating analytically intractable integrals of the form , where the function is sufficiently well-behaved and smooth, with a unique global maximum at an interior point , the function is positive and does not vary significantly near , and is a relatively large constant. The method hinges on the principle that this integral’s value is predominantly determined by a small region around the point where achieves its maximum. This idea is explained in more detail and motivated precisely below.
Let be a twice continuously differentiable function on an interval , and assume there exists a unique global maximum at an interior point , such that and . Suppose is a function that varies slowly around and is such that for all . Then, Laplace’s approximation involves replacing the integral with
(A44) |
A standard mathematical justification for this approximation starts by proving that, under suitable technical assumptions on and in the spirit of the intuitive conditions outlined above,
(A45) |
The classical proof of (A45) consists of three high-level steps:
-
1.
Local second-order approximation: Approximate near using a second-order Taylor expansion: .
-
2.
Integral transformation: Standardize the quadratic term in the integral to apply results from Gaussian integral analysis.
-
3.
Asymptotic evaluation: Assess the integral in the standardized coordinates to achieve the asymptotic equivalence in (A45).
Appendix A2 Additional Methodological Details
A2.1 Practical Computation of the Conformity Scores
As detailed in Section 3.1, our method allows flexibility in the choice of the prediction rule , which uniquely determines the conformity scores. In this section, we explore three practical options for the prediction rules and their respective conformity scores.
A2.1.1 Hyper-Cubic Confidence Regions
An intuitive prediction rule, introduced in Section 3.1, is:
with the parameter taking value in . Note that this rule leads to hyper-cubic confidence regions, with constant widths for all users in a group.
The conformity scores corresponding to this rule can be written explicitly, for any , as:
Remark. The function is an strictly increasing function on . Therefore, we can equivalently define the prediction set as the following. Let
(A46) |
and define the alternate confidence set as
with taking value in . The expression in (A46) is more closely related to the typical notation in the conformal inference literature; e.g., see \citetlei2018distribution.
A2.1.2 Hyper-Rectangular Confidence Regions
An alternative type of prediction rule, yielding intervals of varying lengths for different users, involves scaling the hyper-cube defined in (A2.1.1). This modification may be particularly useful in applications involving count data with wide ranges, where the variance may be expected to increase in proportion to the observed values. We define this linearly-scaled prediction rule as
(A47) |
which leads to confidence regions in the shape of a hyper-rectangle. The corresponding scores are:
A2.1.3 Hyper-Spherical Confidence Regions
The prediction rules described above all result in confidence regions with a hyper-rectangular shape. Alternatively, one can construct a confidence region with a hyper-spherical shape using the following prediction rule, where represents the Euclidean norm:
(A48) |
The corresponding conformity scores are
Note that replacing the Euclidean norm with the max norm in (A48) recovers the hyper-cubic the prediction rule.
The concept of a hyper-spherical confidence region is quite rare in the conformal inference literature, where the majority of existing methods focus on constructing confidence sets for a single test point individually. However, when aiming to provide simultaneous coverage for multiple entries, it becomes possible to develop confidence regions of varying geometric shapes.
A2.2 Efficient Evaluation of the Conformalization Weights
We discuss in more detail here the choice of scaling parameter for the function defined in Equation (23) (Section 4.1). This free parameter controls the location of . Since our Laplace approximation hinges on being not too close to the integration boundary, an intuitive and effective choice is to set so that ; e.g., see \citetfog2007wnchypg. We explain below how to achieve this using the Newton-Raphson algorithm.
Recall Lemma 4, which tells us that the function has a unique stationary point with respect to at some value , and that this stationary point is a global maximum. Therefore, since the function is smooth, the Newton-Raphson algorithm can be applied as follows to find a value of such that . Define
and note that this function is monotonically increasing in . Then it suffices to find such that . Note that is smooth, and , for . It is also clear that the solution of must be greater than . Further, has a unique root in the interval because and .
Thus, it follows from Theorem 2.2 in \citetatkinson1989numerical that the Newton-Raphson algorithm will converge to the root quadratically, for any starting point within the interval . In practice, one can choose as the starting point.
The time complexity of the Newton-Raphson iteration depends on the desired precision level. If the tolerable error is a predetermined small constant, the iteration terminates after a constant number of updates due to quadratic convergence. Evaluating and at any given requires . Hence, solving takes .
A2.3 Computational Shortcuts and Complexity Analysis
A2.3.1 Evaluation of the Conformalization Weights
Evaluating the simplified weights in (29) only involves arithmetic operations and can be carried out for all at a total computational cost roughly of order . To understand this, first note that computing defined in (25), for all with any given and , has cost ; and finding the correct value of and according to (26) has cost , or equivalently no worse than , as explained in Section A2.2.
Next, evaluating for all has cost , because the constant in the denominators of (20) and (21) can be pre-computed at cost , while the remaining terms in (20) and (21) can be evaluated at cost separately for each .
The cost of evaluating the term within the square brackets in (29) for all is . This is achieved by pre-computing factorials up to since is upper-bounded by for any . Then for each , computing binomial coefficients, given the pre-computed factorials, takes constant time, and the remaining term in the brackets requires . Putting everything together, the conformalization weights in (29) has cost for all .
A2.3.2 Cost Analysis of Algorithm 1
Analysis for a single test group. The cost of computing a confidence region for a single test group is , as shown below.
-
•
Training the black-box matrix completion model has cost .
-
•
After the black-box model is trained, the cost of computing scores for all is .
-
•
The cost of computing for all is , as explained in Section A2.3.1.
-
•
After the conformalization weights are computed, the cost of computing is . This is because sorting the scores for all has a worst-time cost of , while it takes to find the weighted quantile based on and the sorted scores.
Therefore, the overall cost is .
Analysis for distinct test groups. The cost of computing confidence regions for distinct test groups is , as shown below.
-
•
Training the black-box matrix completion model has cost , since the model only needs to be trained once.
-
•
The cost of computing conformity scores for all is , since the calibration groups are the same for any new test group.
-
•
The cost of computing for all is , as explained in Section A2.3.1.
-
•
After the conformalization weights are computed, the cost of computing the confidence sets for all test groups is . Sorting the scores for all has a worst-time cost of , which only needs to be performed once. For any , it takes to find the weighted quantile given weights and the sorted scores.
Therefore, the overall cost is .
A2.3.3 Cost Analysis of Algorithm 2
-
•
For each column, the cost of computing is , and the cost of sampling indices uniformly at random is . Hence the cost of sampling the pruned indices for all columns is , which simplifies to by the fact that .
-
•
Initializing given the pruned indices has cost of .
-
•
After is initialized, the cost of sampling the th calibration group (and updating and ) is , for each . Hence sampling all calibration groups takes .
Therefore, Algorithm 2 has time complexity of , and it does not need to be repeatedly applied when dealing with distinct groups involving the same matrix.
A2.4 Estimation of the Sampling Weights
We describe here a method, inspired by \citetgui2023conformalized, to estimate empirically the sampling weights for our sampling model in (1), leveraging the available matrix observations indexed by . In general, this estimation problem is made feasible by introducing the assumption that has some lower-dimensional structure that can be summarized for example by a parametric model. The approach suggested by \citetgui2023conformalized assumes that the weight matrix is low-rank. For simplicity, we follow the same approach here, although our framework could also accommodate alternative estimation techniques in situations where different modeling assumptions about may be justified.
Suppose the sampling weights follow the parametric model
where is a matrix with rank and bounded infinity norm; i.e., , for some pre-defined constant . Then, if each matrix entry is independently observed (i.e., included in ) with probability , i.e.,
(A49) |
then the log-likelihood of can be written as
(A50) |
where . This suggests estimating by solving
subject to: | |||
where is the nuclear norm. Finally, having obtained , the estimated sampling weights for each are given by
(A51) |
In practice, the numerical experiments described in this paper apply this estimation procedure using the default choices of the parameters and suggested by \citetgui2023conformalized.
It is worth remarking that the independent Bernoulli observation model (A49) underlying this maximum-likelihood estimation approach differs from the weighted sampling without replacement model (1) that we utilize to calibrate our simultaneous conformal inferences. This discrepancy, however, is both useful and unlikely to cause issues, as explained next. On the one hand, sampling without replacement model is essential to capture the structured nature of our group-level test case and of the calibration groups . On the other hand, sampling without replacement would make the likelihood function in (A50) intractable, unnecessarily hindering the estimation process. Fortunately, however, the interpretation of the sampling weights remains largely consistent across the models (1) and (A49), which justifies the use of the estimated weights in (A51) for the purpose of calibrating conformal inferences under the model defined in (1).
Appendix A3 Additional Empirical Results
A3.1 Additional Experiments with Synthetic Data
A3.1.1 Heterogeneous Test Sampling Weights
This section describes experiments in which the test group is sampled according to a model (3) with heterogeneous weights . As explained in Section 2, the heterogeneous nature of these weights makes it feasible to ensure valid coverage conditional on interesting features of . Therefore, the following experiments demonstrate the ability of our method to smoothly interpolate between marginal and conditional coverage guarantees, giving practitioners flexibility to up-weight or down-weight different types of test cases, as needed.
The ground-truth matrix is generated according to the random factorization model defined in Equation (33), with rank . We observe entries of this matrix, sampled based on the model in (1) with uniform weights ; these are indexed by , whose complement is . Algorithm 1 is then applied as in the previous experiments, using calibration groups and allocating the remaining observations for training. For the latter purpose, we rely on the usual alternating least square approach, with hypothesized rank , and thus obtain a point estimate and its corresponding factor matrices and , such that .
The weights for in (3) are based on an oracle procedure that leverages perfect knowledge of and to construct a sampling process that over-represents portions of the matrix for which the point estimate is less accurate. This process is controlled by a parameter , which determines the heterogeneity of . In the special case of , the test weights become for all matrix entries, recovering the experimental setup considered earlier in Section 5.1.1. By contrast, smaller values of tend to increasingly over-sample portions of the matrix for which the point estimate is less accurate. We refer to Appendix A3.1.2 for details about this construction of the test sampling weights, which gives rise to an interesting and particularly challenging experimental setting in which attaining high coverage is intrinsically difficult.
To highlight the importance of correctly accounting for the heterogeneous nature of the test sampling weights , in these experiments we compare the performance of joint confidence regions obtained with two alternative approaches. The first approach consists of applying Algorithm 1 based on the correct values of the data-generating weights and . The second approach consists of applying Algorithm 1 based on the correct values of the data-generating weights but incorrectly specified weights for all . In both cases, the nominal significance level is , and the methods are evaluated based on 100 random test groups sampled from , according to the model in (3) with the weights defined in Equation (A54) within Section A3.1.2. All results are averaged over 300 independent experiments.
Figure A6 compares the performances of the two aforementioned implementations of our method as a function of the group size , for different values of the parameter . The results show that our method applied with the correct weights always achieves the desired 90% simultaneous coverage, as predicted by the theory. By contrast, using mis-specified uniform test sampling weights leads to lower coverage than expected, especially for lower values of the parameter . Figure A7 provides an alternative but qualitatively consistent view of these findings, varying the parameter separately for different values of the group size .
It is interesting to note from Figures A6 and A7 that our method is sometimes slightly over-conservative when applied with highly heterogeneous test sampling weights (corresponding to small values of the parameter ). This phenomenon is due to the unavoidable challenge of constructing valid confidence regions in the presence of strong distribution shifts, and it can be understood more precisely as follows. Smaller values of result in a stronger distribution shift between the observed data in and , increasing the likelihood that the weighted empirical quantile defined in (10) might become infinite, leading to trivially wide confidence regions. In those (relatively rare) cases in which diverges, to avoid numerical issues we simply set equal to , the highest calibration conformity score. Fortunately, as shown explicitly in Figure A8, this issue is not very common (it is observed in fewer than 2.5% of the cases), which explains why our method appears to be only slightly over-conservative in Figures A6 and A7.
A3.1.2 Additional Details for Section A3.1.1
The sampling weights for utilized in the experiments of Section A3.1.1 are defined based on the following oracle procedure, which leverages perfect knowledge of and to construct a sampling process that over-represents portions of the matrix for which the point estimate is less accurate. This gives rise a particularly challenging experimental setting. For each entry , define the latent feature vector , where and are the -th row of and the -th row of , respectively. Let also denote a subset containing of the matrix indices in , chosen uniformly at random.
The values of and indexed by are utilized by the oracle to construct with an approach inspired by \citetcauchois2020knowing and \citetromano2020classification. For any fixed , define the worst-slab estimation error,
(A52) |
where, for any and , the subset is defined as
(A53) |
Intuitively, is a subset (or slab) of the matrix entries in characterized by a direction in the latent feature space and two scalar thresholds . Accordingly, is the average absolute residual between and evaluated for the entries within , after selecting the worst-case subset containing at least a fraction of the observations within .
In practice, the optimal (worst-case) choice of in (A52) is approximated by fitting an ordinary least square regression model to predict the absolute residuals as a linear function of the latent features . Then, the corresponding optimal values of in (A52) are approximated through a grid search, for a fixed value of the parameter .
Finally, the test sampling weights are given by
(A54) |
where denotes the density function of the Gaussian distribution with mean and variance . This density function is introduced for smoothing purposes, setting . These sampling weights enable us to select test groups from indices that predominantly fall within the worst-slab region for which estimates least accurately. Intuitively, attaining valid coverage for this portion of the matrix should be especially challenging.
A3.2 Investigation of the Coverage Upper Bound
In this section, we investigate in more detail the upper coverage bound for our method established by Theorem 2, which is equal to . Ideally, a small expected value in this equation would guarantee that our conformal inferences are not too conservative. However, given that it would be unfeasible to evaluate this expected value analytically, we rely on a Monte Carlo numerical study.
We begin by focusing on groups of size and consider for simplicity matrices with an equal number of rows and columns; i.e., . We simulate the observation process by sampling matrix entries without replacement according to the model defined in (1), with
with a scaling parameter . Note that this is the same choice of sampling weights utilized in the experiments of Section 5.1.2. For simplicity, the test group is sampled from the model defined in (3) using weights exhibiting the same patterns as . Then, the conformalization weights for all are computed by applying Algorithm 1, and varying the number of calibration groups as a control parameter. Finally, we estimate by taking the empirical average of over 10 independent experiments.
Figure A9 [left] reports on the results of these experiments as a function of . The results show that our coverage upper bound approaches roughly at rate , as one would generally expect in the case of standard conformal inferences based on exchangeable data \citepvovk2005algorithmic. This is consistent with our empirical observations that Algorithm 1 is typically not too conservative in practice. Figure A9 [right] reports on the results of additional experiments in which the group size is varied, while kee** the number of calibration groups fixed to . The results shows that the coverage upper bound tends to become more conservative as the group size increases, reflecting the intrinsic higher difficulty of producing valid simultaneous conformal inferences for larger groups.
A3.3 Additional Results for Section 5.1.1
A3.4 Additional Results for Section 5.1.2
A3.5 Additional Results for Section 5.2
Appendix A4 Mathematical Proofs
A4.1 A General Quantile Inflation Lemma
Proof of Lemma 1.
The proof follows the same strategy as that of \citettibshirani-covariate-shift-2019. Let denote the event that , for some possible realization of , and let for all . By the definition of conditional probability, for each ,
where is a permutation of . In other words,
where denotes a point mass at . This implies that
which is equivalent to
Finally, marginalizing over leads to
This is equivalent to the desired result because, by Lemma A5,
∎
Lemma A5 (also appearing implicitly in \citettibshirani-covariate-shift-2019).
Consider random variables and some weights such that and . Then, for any ,
Proof of Lemma A5.
This result was previously utilized by \citettibshirani-covariate-shift-2019 and a proof is included here for completeness. It is straightforward to establish one direction of the result, namely
because, almost surely, , and hence
To prove the other direction, suppose . By definition of the quantile function, we can write without loss of generality that , where is defined such that
where are the order statistics of . Therefore, , and re-assigning does not change . This means that , leading to . Thus, we have shown that
∎
A4.2 Conformal Inference with Structured Calibration
Proof of Proposition 1.
This result is a direct consequence of Proposition A4, which characterizes the joint distribution of conditional on and . It is easy to see from (A56) that this distribution is invariant to permutations of . ∎
Proposition A4.
Consider the same setting of Proposition 1. Let and denote arbitrary realizations of and , respectively. Let be any sequence of -groups involving elements of such that
Define also , the unordered collection of matrix entries indexed by the groups , noting that . Further, define
(A55) |
Intuitively, represents the pruned missing indices in column , is the number of indices in corresponding to entries in column , while is the number of calibration groups in column . Note that all quantities in (A55) are uniquely determined by and . Then,
(A56) |
Proof of Proposition A4.
First, note that
(A57) |
where the first term on the right-hand-side above was written explicitly using the sequential sampling characterization of in (4).
Next, we focus on the second term on the right-hand-side of (LABEL:eq:paired-partial-exch-1):
(A58) |
where the last equality above follows from the fact that is uniquely determined by , and .
The first term on the right-hand-side of (A58) is given by Lemma A6:
(A59) |
Note that (A59) implies that, conditional on and , the distribution of does not depend on the order of these calibration groups.
Next, we focus on the second term on the right-hand-side of (A58), namely
(A60) |
The numerator of (A60) is
(A61) |
where denotes the remainder of the integer division , and . Above, the denominator does not need to be simplified because it only depends on and .
∎
Lemma A6.
Under the same setup as in Proposition A4,
(A62) |
Proof of Lemma A6.
We prove this result by induction on the number of calibration groups, . For ease of notation, we will denote the column of the -th calibration group as , for any ; that is, for all . In the base case where ,
Now, for the induction step, suppose Equation (A62) holds for . Then,
where the last equality above follows because for all , while . ∎
A4.3 Characterization of the Conformalization Weights
Proof of Lemma 3.
Recall from Proposition A4 that
for some permutation-invariant function and
(A63) |
with
and, for all ,
Therefore, Lemma 2 can be applied, with weights proportional to
(A64) |
In order to compute the right-hand-side of (A64), one must understand how (A63) changes when is swapped with , for any fixed . This can be done easily, one piece at a time.
To begin, it is immediate to see that swap** with results in being replaced by . Similarly, are replaced by , defined as
and, for all ,
here, for any , denotes the column to which belongs; i.e., , where is the column of the th entry in .
To understand the notation in the equations above, recall that is a realization of the pruned missing set , and is the number of missing entries in column . In the parallel universe where is swapped with , the realization of the missing indices is denoted as , and the realization of the pruned missing set is , where . Similarly, denotes entries belonging to column in the imaginary pruned missing set. Thus, and can be interpreted as normalized sampling weights for the imaginary test group .
Next, let and denote the numbers of observations in column from the sets and , respectively. Define also and , the corresponding numbers of observations remaining in column after the random pruning step of Algorithm 2. Let denote the number of calibration groups in column . Similarly, let denote the corresponding imaginary quantity obtained by swap** the calibration group with the test group ; i.e.,
Further, swap** with results in , , and being replaced by , , and , respectively. Therefore,
(A65) |
Now, we will further simplify the expression in Equation (A65) to facilitate the practical computation of these weights.
Consider the first term on the right-hand-side of Equation (A65), namely
This quantity depends on the pruned set of missing indices and, by definition,
where
while, for all ,
In the equations above, with a slight abuse of notation, we denoted the set of missing indices in column excluding those in the group as .
Next, let us consider the second term on the right-hand-side of Equation (A65), namely
This evaluates the probability of observing a particular realization of . Since the pruned indices are chosen uniformly at random, this quantity only depends on the number of observations within each column before and after pruning, namely, and . By definition, we have
(A66) |
The above equivalence from the fact that swap** with only affects the observed indices in column and , while all other indices remain the same. In particular, upon swap**, column will contain fewer observations, because is treated as the unobserved test group, and column will contain more observations, because is treated as the calibration group. Similarly,
(A67) |
Combining (A66) and (A67), we can rewrite the second term in (A65) as:
(A68) |
Then, the third term on the right-hand-side of (A65) is
This relates to the probability of choosing a specific realization of the calibration groups given the observed indices remaining after random pruning. To aid the simplification of this term, we point out the following relation between and , i.e., the number of calibration groups from each column in the imagined observed set and original observed set respectively:
(A69) |
Then, using (A67) and (A69), we can write:
(A70) |
Above, the last equality follows from the following simplification based on (A69) and (A67), assuming that :
∎
A4.4 Finite-Sample Coverage Bounds
Proof of Theorem 1.
Recall that, by construction,
Therefore, Theorem 1 follows directly by combining Proposition 1, Lemma 2, and the characterization of the conformalization weights given by Equation (19).
∎
Proof of Theorem 2.
Let denote the event that , , and , for some possible realizations of , of , and of . Let also indicate the realization of corresponding to the event , for all . Applying the definition of conditional probability, as in the proof of Lemma 2, we can see that , with the weights given by (19). This implies that
and further, by taking an expectation with respect to the randomness in ,
∎
A4.5 Efficient Evaluation of the Conformalization Weights
Proof of Proposition 2.
We begin by focusing on the special case of . In that case, Equation (24) becomes a special case of the results derived for the multivariate Wallenius’ noncentral hypergeometric distribution \citepwallenius1963biased, chesson1976non, fog2007wnchypg. While the original problem addresses biased sampling without replacement from an urn containing colored balls, our model in (1) can be equivalently interpreted as drawing samples without replacement from an urn comprising balls. Each ball is uniquely labeled with a color represented by , and it is drawn with a probability proportional to ; e.g., see Section 2. Therefore, from Equation (19) in \citetfog2007wnchypg:
(A72) |
Next, we turn to proving Equation (24) for a general .
For any , imagine an alternative world in which is swapped with . Let indicate the cumulative weight of all missing entries analogous to in the aforementioned imaginary world. It is easy to see that , where . Therefore, we can express the probability using Equation A72 in the imaginary world:
where, for any ,
(A73) |
∎
Proof of Lemma 4.
Recall that the logarithm of takes the form
(A74) |
while its first derivative with respect to is:
Consider the function ,
which is strictly decreasing in for all , because for all . Further, if ,
Then, by the intermediate value theorem, must have exactly one zero for , as long as . In turn, this implies that has exactly one zero for , as long as . Further, the unique zero of on must be the unique maximum of , because, under ,
∎
A4.6 Consistency of the Generalized Laplace Approximation
We begin by stating a formal version of Theorem 3, the result providing the motivation to apply the Laplace method to Equation (27).
Theorem A4.
Let be a sequence of i.i.d. random variables drawn from a distribution with support on the open interval . Consider a sequence of mutually independent Bernoulli random variables , where each . Define and
(A75) |
Above, is the unique root of the function
(A76) |
in the interval , where is the logarithm of , namely
(A77) |
Then, .
Further, consider a sequence of functions , where each and , for some constant and all . Suppose there exists some such that for all and for all . Then, it holds that
(A78) |
or equivalently,
(A79) |
Proof of Theorem A4.
The preliminary part of this result is proved in Appendix A2.2, where we show that selecting the scaling parameter as the unique root of the function in (A76) leads to .
Our main objective is to approximate the integral
leveraging a suitable extension of the classical Laplace method reviewed in Appendix A1.4. To this end, we begin by applying a Taylor series expansion around , including Lagrange remainder terms; this leads to:
for some real numbers , and
By definition of , we know that . Next, we need to establish a suitable bound for . This task is complicated by the fact that we do not have an explicit expression for . Fortunately, however, it is possible to obtain sufficiently tight lower and upper bounds for .
Lemma A7.
In the setting of Theorem A4, for any ,
(A80) |
Further, in the limit of , it holds almost-surely that
(A81) |
where
The bounds on provided by Lemma A7 in turns allow us to bound away from 0 almost surely for large . This gives us the necessary ingredients to tackle the approximation of the integral. To this end, note that
since . Applying the change of variables and defining , the integral becomes
(A82) | ||||
(A83) |
Note that now and depend implicitly on due to the change of variables (they previously depended on ). We will now separately analyze each term in (A82).
The limit of the first integral on the right-hand-side of (A82) can be found by leveraging the following lower bound for :
which leads to
(A84) |
By contrast, the third integral on the right-hand-side of (A82) is asymptotically negligible. Recall that, by assumption, satisfies for all and ; therefore,
(A85) |
We continue with the analysis of the second and fourth terms on the right-hand-side of (A83), which are more involved. The remainder of the second-order Taylor expansion, previously denoted as , can be expanded back into an infinite power series, given the smoothness of within the interval . This expansion is expressed as:
(A86) |
where represents the -th derivative of evaluated at , and this series converge for all and all .
Note that the -th derivative of at can be written as:
(A87) |
To control , we will show that (A87) is bounded for large . Lemma A7 tells us that the first term in (A87) is bounded by constants almost surely. For the second term, define:
(A88) |
which is continuous for and . Given that the interval is compact, we can use the maximum value in to bound the function . Thus, we obtain:
(A89) |
Then, by the strong law of large numbers,
(A90) |
leading to:
(A91) |
Similarly, we can also show:
(A92) |
Therefore, the second term in (A87) is also bounded by constants almost surely. As a result, the whole expression in (A87) is bounded by constants almost surely for large .
Combining the above results, we conclude that, for all ,
(A93) |
We can now proceed to analyze the integral involving the exponential of the remainder term , which appears in the second term on the right-hand-side of (A82). Specifically,
(A94) |
where . Below, we show that all terms on the right-hand-side of (A94) are finite and converge to 0 as increases. To this end, let us start from , noting that
where the simplification arises because all odd moments of a standard normal are zero, and the even moments follow from the moment generating function. Given (A93), it follows that as . Similarly, it also can be shown that all higher moments of are finite and converge to 0 as increases. Consequently, we conclude that
Therefore, the limit of the second term in the Taylor error expansion (A83) is
(A95) |
The fourth term in the Taylor error expansion (A83) vanishes similarly because is bounded; i.e.,
(A96) |
∎
Proof of Lemma A7.
It is immediate from (A76) that . To obtain an upper bound, recall that , for the function in (A76). This implies that
where denotes the number of successful Bernoulli trials. Therefore,
(A97) |
The upper bound in (A97) also allows us to find a tighter lower bound. Note that is a decreasing function of , and for all . Therefore, implies
Combining this result with (A97), we obtain the following lower bound:
(A98) |
This completes the proof of (A80).
To prove the second part, we apply the strong law of large numbers, by which as , and as . Therefore, by the continuous map** theorem,
where the expected values and do not depend on . Therefore, in the limit of , it holds almost-surely that
Finally, recall that as . Consequently, we have, almost surely, that: as . ∎