OmniInput: A Model-centric Evaluation Framework through Output Distribution

Weitang Liu
Department of Computer Science Engineering
University of California, San Diego
La Jolla, CA 92093
[email protected]
&Ying Wai Li
Los Alamos National Laboratory
Los Alamos, NM 87545
&Tianle Wang
Department of Computer Science Engineering
University of California, San Diego
La Jolla, CA 92093
&Yi-Zhuang You
Department of Physics
University of California, San Diego
La Jolla, CA 92093
&**gbo Shang
Department of Computer Science Engineering
University of California, San Diego
La Jolla, CA 92093

Abstract

We propose a novel model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model’s predictions on all possible inputs (including human-unrecognizable ones), which is crucial for AI safety and reliability. Unlike traditional data-centric evaluation based on pre-defined test sets, the test set in OmniInput is self-constructed by the model itself and the model quality is evaluated by investigating its output distribution. We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model, which, after selective annotation, can be used to estimate the model’s precision and recall at different output values and a comprehensive precision-recall curve. Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models, especially when their performance is almost the same on pre-defined datasets, leading to new findings and insights for how to train more robust, generalizable models.

1 Introduction

A safe, reliable AI/ML model deployed in real world should be able to make reasonable predictions on all the possible inputs, including uninformative ones. For instance, an autonomous vehicle image processing system might encounter carefully designed backdoor attack patterns (that may look like noise) [35, 40], which can potentially lead to catastrophic accidents if such backdoor patterns interfere the stop sign or traffic light classification.

Existing evaluation frameworks are mostly, if not all, data-centric, meaning that they are based on pre-defined, annotated datasets. The drawback is the lack of a comprehensive understanding of the model’s fundamental behaviors over all possible inputs. Recent literature showed that a great performance on a pre-defined (in-distribution) test set cannot guarantee a strong generalization to different regions in the input space, such as out-of-distribution (OOD) test sets [38, 21, 22, 24, 32, 33] and adversarial test sets [62, 53, 43, 29]. One possible reason for poor generalization in the open-world setting is overconfident prediction [46], where the model could wrongly predict OOD input as in-distribution objects with high confidence.

Refer to caption — Figure 1: An overview of our novel OmniInput evaluation framework. (a) Use an efficient sampler, e.g. GWL [39], to obtain the output distribution $\rho(z)$ and sample representative inputs; (b) Annotate representative inputs; (c) Estimate the precision and recall at different threshold $\lambda$ . $r(z)$ denotes the precision of the model within the bin of output value $z$ ; (d) Construct a precision-recall curve as evaluation results.

Inspired by the evaluation frameworks for generative models [23, 55, 45, 54, 4], we propose a novel model evaluation approach from a model-centric perspective: after the model is trained, we construct the test set from the model’s self-generated, representative inputs corresponding to different model output values. We then annotate these samples, and estimate the model performance over the entire input space using the model’s output distribution. While existing generative model evaluation frameworks are also model-centric, we are the first to leverage the output distribution as a unique quantity to generalize model evaluation from representative inputs to the entire input space. To illustrate our proposed novel evaluation framework OmniInput, we focus on a binary classification task of classifying if a picture is digit 1 or not. As shown in Fig. 1, it consists of four steps:

(a)

We employ a recently proposed sampler to obtain the output distribution $\rho(z)$ of the trained model (where $z$ denotes the output value of the model) over the entire input space [39] and efficiently sample representative inputs from different output value (e.g., logit) bins. The output distribution is a histogram counting the number of inputs that lead to the same model output. In the open-world setting without any prior knowledge of the samples, all possible inputs should appear equally.
(b)

We annotate the sampled representative inputs to finalize the test set, e.g., rate how likely the picture is digit 1 using a score from 0 to 1.¹¹1In data-centric evaluations, the pre-defined test set is typically human-annotated as well. Our experiments show that 40 to 50 human annotations per output bin are enough for a converged precision-recall curve (Fig. 4), hence human involvement required is significantly smaller in our method.
(c)

We compute the precision for each bin as $r(z)$ , then estimate the precision and recall at different threshold values $\lambda$ . When aggregating the precision across different bins, a weighted average of $r(z)$ by the output distribution $\rho(z)$ is required i.e., $\frac{\sum_{z\geq\lambda}r(z)\cdot\rho(z)}{\sum_{z\geq\lambda}{\rho(z)}}$ . See Sec. 2.2 for details.
(d)

We finally put together the precision-recall curve for a comprehensive evaluation of the model performance over the entire input space.

OmniInput samples the representative inputs solely by the model itself, eliminating possible human biases introduced by the test data collection process. The resulting precision-recall curve can help decide the limit of the model in real-world deployment. The overconfident prediction issue can also be quantified precisely manifested by a low precision when the threshold $\lambda$ is high.

Our OmniInput framework enables a more fine-grained comparison between models, especially when their performance is almost the same on the pre-defined datasets. Take the MNIST dataset as an example, many models (e.g., ResNet, CNN, and multi-layer Perceptron network (MLP)) trained by different methods (e.g., textbook cross-entropy (CE), CE with (uniform noise) data augmentation, and energy-based generative framework) can all achieve very high or nearly perfect performance. Our experiments using OmniInput reveals, for the first time, the differences in the precision-recall curves of these models over the entire input space and provides new insights. They include:

•

The architectural difference in MLP and CNN, when training with the CE loss and original training set, can lead to significant difference in precision-recall curves. CNN prefers images with dark background as representative inputs of digit 1, while MLP prefers to invert the background of zeros as digit 1.
•

Different training schemes used on the same ResNet architecture can lead to different performance. Adding noise to the training set in general can lead to significant improvements in precision and recall than using energy-based generative models; however, the latter leads to samples with a better visual diversity. These results suggest that combining the generative and classification objectives may be the key for the model to learn robust classification criteria for all possible samples.

Additionally, we have evaluated DistilBERT for sentiment classification and ResNet on CIFAR (binary classification) using OmniInput. Our results indicate a significant number of overconfident predictions, a strong suggestion of poor performance in the entire input space. It is worth mentioning that these findings are specific to the models we trained. Thus, this is not a conclusive study of the differences of the models with different training methods and architectures, but a demonstration of how to use our OmniInput framework to quantify the performance of the models and generate new insights for future research. The contributions of this work are as follows:

•

We propose to evaluate AI/ML models by considering all the possible inputs with equal probability, which is crucial to AI safety and reliability.
•

We develop a novel model-centric evaluation framework, OmniInput, that constructs the test set by representative inputs, and leverages output distribution to generalize the evaluation assessment from representative inputs to the entire input space. This approach largely eliminates the potential human biases in the test data collection process and allows for a comprehensive understanding and quantification of the model performance.
•

We apply OmniInput to evaluate various popular models paired with different training methods. The results reveal new findings and insights for how to train robust, generalizable models.

2 The OmniInput Framework

In this section, we present a detailed background on sampling the output distribution across the entire input space. We then propose a novel model-centric evaluation framework OmniInput in which we derive the performance metrics of a neural network (binary classifier) from its output distribution.

2.1 Output Distribution and Sampler

Output Distribution. We denote a trained binary neural classifier parameterized by $\theta$ as $f_{\mathbf{\theta}}:\mathbf{x}\rightarrow z$ where $\mathbf{x}\in\Omega_{T}$ is the training set, $\Omega_{T}\subseteq\{0,...,N\}^{D}$ , and $z\in\mathbb{R}$ is the output of the model. In our framework, $z$ represents the logit and each of the $D$ pixels takes one of the $N+1$ values.

The output distribution represents the frequency count of each output logit $z$ given the entire input space $\Omega=\{0,...,N\}^{D}$ . In our framework, following the principle of equal a priori probabilities, we assume that each input sample within $\Omega$ follows a uniform distribution. This assumption is based on the notion that every sample in the entire input space holds equal importance for the evaluation of the model. Mathematically, the output distribution, denoted by $\rho(z)$ , is defined as:

\rho(z)=\sum_{\mathbf{x}\in\Omega}\delta(z-f_{\mathbf{\theta}}(\mathbf{x})),

where $\delta$ is the Dirac delta function.

Samplers

The sampling of an output distribution finds its roots in physics, particularly in the context of the sampling of the density of states (DOS) [66, 65, 11, 26, 36, 74], but its connection to ML is revealed only recently [39].

The Wang–Landau (WL) algorithm [66] aims to sample the output distribution $\rho(z)$ which is unknown in advance. In practical implementations, the “entropy” (of discretized bins of $z$ ), $\tilde{S}(z)=\log\tilde{\rho}(z)$ , is used to store the instantaneous estimation of the ground truth $S(z)=\log\rho(z)$ . The WL algorithm leverages the reweighting technique, where the sampling weight $w(\mathbf{x})$ is inversely proportional to the instantaneous estimation of the output distribution:

\displaystyle~{}w(\mathbf{x})\propto\frac{1}{\tilde{\rho}(f_{\theta}(\mathbf{x% }))}.

(1)

When the approximation $\tilde{\rho}(z)$ converges to the true value $\rho(z)$ , the entire output space would be sampled uniformly.

The fundamental connection between the output distribution of neural networks and the DOS in physics has been discovered and elucidated in Ref. [39]. Additionally, it is shown that the traditional Wang–Landau algorithm sometimes struggles to explore the parameter space if the MC proposals are not designed carefully. Gradient Wang–Landau sampler (GWL) [39] circumvent this problem by incorporating a gradient MC proposal similar to GWG [15], which improves Gibbs sampling by picking the pixels that are likely to change. The GWL sampler has demonstrated the feasibility and efficiency of sampling the entire input space for neural networks.

The key component of the output distribution samplers is that they can sample the output space equally and efficiently, thereby providing a survey of the input-output map** for all the possible logits. This is in contrast with traditional MCMC samplers which are biased to sample the logits corresponding to high log-likelihood (possible informative samples) over logits correspond to low log-likelihood (noisy and uninformative samples).

2.2 Model-Centric Evaluation

Our model evaluation framework revolves around the output distribution sampler. Initially, we obtain the output distribution and the representative inputs exhibiting similar output logit values.

Representative Inputs. Although there are exponentially many uninformative samples in the entire input space, it is a common practice in generative model evaluation to generate (representative) samples by sampling algorithms and then evaluate samples, such as Fréchet Inception Distance (FID) [23]. In our framework, other sampling algorithms can also be used to collect representative inputs. There should be no distributional difference in the representative inputs between different samplers (Fig. 8). However, Wang–Landau type algorithms provide a more effective means for traversing across the logit space and are hence more efficient than traditional MCMC algorithms in sampling the representative inputs from the output distribution.

Normalized Output Distribution. To facilitate a meaningful comparison of different models based on their output distribution, it is important to sample the output distribution of (all) possible output values to ensure the normalization can be calculated as accurately as possible. We leverage the fact that the entire input space contains an identical count of $(N+1)^{D}$ samples for all models under comparison [30]. Consequently, the normalized output distribution $\rho(z)$ can be expressed as:

\log\rho(z)=\log\hat{\rho}(z)-\log\sum_{z}\hat{\rho}(z),

where $\hat{\rho}(z)$ denotes the unnormalized output distribution.

Annotation of Samples. For our classifiers, we designate a specific class as the target class.The (human) evaluators would assign a score to each sample within the same “bin” of the output distribution (each “bin” collects the samples with a small range of logit values $[z-\Delta z,z+\Delta z)$ ). This score ranges from $0$ when the sample completely deviates from the evaluator’s judgment for the target class, to $1$ when the sample perfectly aligns with the evaluator’s judgment. Following the evaluation, the average score for each bin, termed “precision per bin”, $r(z)$ , is calculated. It is the proportion of the total evaluation score on the samples relative to the total number of samples within that bin. We have 200-600 bins for the experiments.

Precision and Recall. Without loss of generality, we assume that the target class corresponds to large logit values: we define a threshold $\lambda$ such that any samples with $z\geq\lambda$ are predicted as the target class. Thus, the precision given $\lambda$ is defined as

\mathrm{precision}_{\lambda}=\frac{\sum^{+\infty}_{z\geq\lambda}r(z)\rho(z)}{% \sum^{+\infty}_{z\geq\lambda}\rho(z)}.

The numerator is the true positive and the denominator is the sum of true positive and false positive. This denominator can be interpreted as the area under curve (AUC) of the output distribution from the threshold $\lambda$ to infinity.

When considering recall, we need to compute the total number of ground truth samples that the evaluators labeled as the target class. This total number of ground truth samples remains constant (albeit unknown) over the entire input space. Hence recall is proportional to $\sum^{+\infty}_{z\geq\lambda}r(z)\rho(z)$ :

\mathrm{recall}_{\lambda}=\frac{\sum^{+\infty}_{z\geq\lambda}r(z)\rho(z)}{% \text{number of positive samples}}\propto\sum^{+\infty}_{z\geq\lambda}r(z)\rho% (z).

A higher recall indicates a better model. As demonstrated above, the output distribution provides valuable information for deriving both precision and (unnormalized) recall. These metrics can be utilized for model evaluation through the precision-recall curve, by varying the threshold $\lambda$ . In the extreme case where $\rho(z)$ differs significantly for different $z$ , precision ${}_{\lambda}$ is approximated as $r(z^{*})$ where $z^{*}=\operatorname*{arg\,max}_{z\geq\lambda}\rho(z)$ and recall ${}_{\lambda}$ is approximated as $\max_{z\geq\lambda}r(z)\rho(z)$ .

Quantifying Overconfident Predictions in OmniInput. Overconfident predictions refer to the samples that (a) the model predicts as positive with very high confidence (i.e., above a very high threshold $\lambda$ ) but (b) human believes as negative. The ratio of overconfident predictions over the total positive predictions is simply $1-\mathrm{precision}_{\lambda}$ in OmniInput. Moreover, even if two models have nearly the same (high) precision, the difference in (unnormalized) recall $\mathrm{recall}_{\lambda}$ can indicate which model captures more ground-truth-positive samples. Therefore, compared to methods that only quantify overconfident prediction, OmniInput can offer a deeper insight of model performance using recall.

Scalability. Our OmniInput framework mainly focuses on how to leverage the output distribution for model evaluation over the entire input space. To handle larger input spaces and/or more complicated models, more efficient and scalable samplers are required. However, it is beyond the scope of this paper and we leave it as a future work. Our evaluation framework is parallel to the development of the samplers and will be easily compatible to new samplers.

3 Experiments on MNIST and related datasets

The entire input space considered in our experiment contains $256^{28\times 28}$ samples (i.e., $28\times 28$ gray images), which is significantly larger than any of the pre-defined datasets, and even larger than the number of atoms in the universe (which is about $10^{81}$ ).

Models for Evaluation. We evaluate several backbone models: convolution neural network (CNN), multi-layer Perceptron network (MLP), and ResNet [19]. The details of the model architectures are provided in Appendix 9. We use the MNIST training set to build the classifiers, but we extract only the samples with labels $\{0,1\}$ , which we refer to as MNIST-0/1. For generative models, we select only the samples with label=1 as MNIST-1; samples with labels other than label=1 are considered OOD samples. We build models using different training methods: (1) Using the vanilla binary cross-entropy loss, we built CNN-MNIST-0/1 and MLP-MNIST-0/1²²2The results for RES-MNIST-0/1 are omitted due to reported sampling issues in ResNet [39]. which achieve test accuracy of 97.87% and 99.95%, respectively; (2) Using the binary cross-entropy loss and data augmentation by adding uniform noise with varying levels of severity to the input images, we built RES-AUG-MNIST-0/1, MLP-AUG-MNIST-0/1, and CNN-AUG-MNIST-0/1 which achieve test accuracy of 99.95%, 99.91%, and 99.33%, respectively; and (3) Using energy-based models that learn by generating samples, we built RES-GEN-MNIST-1 and MLP-GEN-MNIST-1³³3CNN-GEN-MNIST-1 is untrainable because model complexity is low..

3.1 Traditional Data-centric Evaluation

We show that data-centric evaluation might be sensitive to different pre-defined test sets, leading to inconsistent evaluation results. Specifically, we construct different test sets for those MNIST binary classifiers by fixing the positive test samples as the samples in the MNIST test set with label=1, and varying the negative test samples in five different ways: (1) the samples in the MNIST test set with label=0 (in-dist), and the out-of-distribution (OOD) samples from other datasets such as (2) Fashion MNIST [68], (3) Kuzushiji MNIST [8], (4) EMNIST [9] with the byclass split, and (5) Q-MNIST [71].

Judging from the Area Under the Precision-Recall Curve (AUPR) scores in Table 1, pre-defined test sets such as the ones above can hardly lead to consistent model rankings in the evaluation. For example, RES-GEN-MNIST-1 performs the best on all the test sets with OOD samples while only ranked 3 out of 4 on the in-distribution test set. Also, CNN-MNIST-0/1 outperforms MLP-MNIST-0/1 on Kuzushiji MNIST, but on the other test sets, it typically performs the worst. Additional inconsistent results using other evaluation metrics can be found in Appendix 10.

3.2 Our Model-centric OmniInput Evaluation

[Uncaptioned image] — Table 1: Traditional data-centric evaluations: Area Under the Precision-Recall Curve (AUPR) scores on pre-defined test sets with five different types of negative samples, leading to inconsistent evaluation results for model ranking.

	in-dist	out-of-distribution (OOD)
	MNIST	Fashion	Kuzushiji	EMNIST	QMNIST
Model	label=0	MNIST	MNIST
CNN-MNIST-0/1	99.81	98.87	93.93	79.42	13.84
RES-GEN-MNIST-1†	99.99	100.00	99.99	99.87	16.49
RES-AUG-MNIST-0/1	100.00	99.11	93.93	95.10	15.69
MLP-MNIST-0/1	100.00	99.42	92.03	90.68	15.81

Precision-Recall Curves over the Entire Input Space. Fig. 1 presents a comprehensive precision-recall curve analysis using OmniInput. The results suggest that RES-AUG-MNIST-0/1 is probably the best model and MLP-MNIST-0/1 is the second best, demonstrating relatively high recall and precision scores. RES-GEN-MNIST-1, as a generative model, displays a low recall but a relatively good precision. Notably, CNN-MNIST-0/1 and CNN-AUG-MNIST-0/1 exhibit almost no precision greater than 0, indicating that “hand-written” digits are rare in the representative inputs even when the logit value is large (see Appendix 12). This suggests that these two models are seriously subjected to overconfident prediction problem.

Insights from Representative Inputs. An inspection of the representative inputs (Appendix 12) reveals interesting insights. Firstly, different models exhibit distinct preferences for specific types of samples, indicating significant variations in their classification criteria. Specifically,

•

MLP-MNIST-0/1 and MLP-AUG-MNIST-0/1 likely define the positive class as the background-foreground inverted version of digit “0”.
•

CNN-MNIST-0/1 classifies samples with a black background as the positive class (digit “1”).
•

RES-GEN-MNIST-1, a generative model, demonstrates that it can map digits to large logit values.
•

RES-AUG-MNIST-0/1, a classifier with data augmentation, demonstrates that adding noise during training can help the models better map samples that look like digits to large logit values.

These results suggest that generative training methods can improve the alignment between model and human classification criteria, though it also underscores the need for enhancing recall in generative models. Adding noise to the data during training can also help.

Moreover, RES-AUG-MNIST-0/1 exhibits relatively high recall as the representative inputs generally look like digit 1 with noise when the logits are high. Conversely, RES-GEN-MNIST-1 generates more visually distinct samples corresponding to the positive class, but with limited diversity in terms of noise variations.

Discussion of results. First, the failure case of CNN-MNIST-0/1 does not eliminate the fact that informative digit samples can be found in these logit ranges. It indicates the number of these informative digit samples is so small that the model makes much more overconfident predictions than successful cases. Having this mixture of bad and (possibly) good samples mapped to the same outputs means a bad model, because further scrutinization of the samples is needed due to uninformative and unreliable model outputs. Second, the model does not use reliable features, such as the “shapes” to distinguish samples. Had this model use the shape to achieve high accuracy, the representative inputs would have more shape-based samples instead of unstructured and black background samples. Third, this failure case also does not indicate our sampler fails, because the same sampler finds informative samples for RES-GEN-MNIST-1.

The representative inputs of MLP-MNIST-0/1 and MLP-AUG-MNIST-0/1 display visual similarities but decreasing level of noise when the logit increases, indicating how the noise affects the model’s prediction. Importantly, this type of noise is presented by the model rather than trying different types of noise [20]. Our result indicates that OmniInput finds representative samples that may demonstrate distribution shifts with regard to model outputs.

Combining these findings with the previous precision-recall curve analysis suggests that different types of diversity may be preferred by the models. Future research endeavors can focus on enhancing both robustness and visual diversity.

Evaluation Effort, Efficiency and Human Annotation Ambiguity. We have at least 50 samples per bin for evaluation for all the models after deleting the duplicates. The models with fewer samples per bin typically have a larger number of bins due to the limitation in the sampling cost. Evaluating these samples in our OmniInput framework requires less effort than annotating a dataset collected for data-centric evaluation, e.g., 60000 samples for MNIST.

In Fig. 4, we vary the number of annotated samples per bin in OmniInput from 10 to 50 and plot different precision-recall curves for the MLP-MNNIST-0/1 model. The results show that the evaluation converges quickly when the number of samples approaches 40 or 50, empirically demonstrating that OmniInput does not need many annotated samples though the number required will be model-dependent. We believe that this is because the representative inputs follow some underlying patterns learned by the model.

We observe that models exhibit varying degrees of robustness and visual diversity. To assess the ambiguity in human labeling, we examine the variations in $r(z)$ when three different individuals label the same dataset (Fig. 3). Notably, apart from the CNN model, the other models display different levels of labeling ambiguity.

4 Results on CIFAR-10 and Language Model

CIFAR10 and Other Samplers. We train a ResNet binary classifier for the first two classes in CIFAR10, i.e., class 0 (airplane) vs. class 1 (automobile). The test set accuracy of this ResNet model is 93.34%. In Appendix 14, Fig. 9 shows the output distribution and Fig 8 provides some representative inputs. We scrutinize 299 bins with 100 samples per bin on average. Even though the representative inputs seem to have shapes when their logits are very positive or negative, they are uninformative in general. We can conclude that this classifier should perform with almost 0 precision (with the given annotation effort) and this model is subjected to serious overconfident prediction.

We also compare the representative inputs in OmniInput and the samples from a Langevin-like sampler [73] in Fig 8. The sampling results show that our representative inputs generally agree with those of the other sampler(s).

Language Model. We fine-tune a DistilBERT [56] using SST2 [58] and achieve 91% accuracy. We choose DistilBERT because of sampler efficiency concern and leave LLMs as future work after more efficient samplers are developed. We then evaluate this model using OmniInput. Since the maximum length of the SST2 dataset is 66 tokens, one can define the entire input space as the sentences with exactly 66 tokens. For shorter sentences, the last few tokens can be simply padding tokens. One might be more interested in shorter sentences because a typical sentence in SST2 contains 10 tokens. Therefore, we conduct the evaluation for length 66 and length 10, respectively. We sample the output distribution of this model until the algorithm converges; some representative inputs can be found in Appendix 13.

When the sentence has only 10 tokens, the representative inputs are not fluent or understandable sentences. For sentence length equals 66, we have 15 bins with around 200 samples per bin. Looking at the representative inputs per bin for each logit, it shows that the model classifies the positive sentiment mostly based on some positive keywords without understanding the grammar and structure of the sentences. Therefore, the precision of human evaluation is very low, if not exactly zero, indicating the model is subjected to serious overconfident prediction.

5 Discussions

Human Annotation vs. Model Annotation. In principle, metrics employed in evaluating generative models [55, 23, 45, 54, 4] could be employed to obtain the $r(z)$ values in our method. However, our framework also raises a question whether a performance-uncertified model with respect to the entire input space can generate features for evaluating another model. We examined the Fréchet Inception Distance (FID) [23] , one of the most commonly used generative model performance metrics.

Table 2: Labeling results between humans and FID score. Although FID scores are similar between two models, humans label significantly differently than FID score.

RES-AUG-MNIST-0/1			CNN-MNIST-0/1
logits	humans $\uparrow$	FID $\downarrow$	logits	humans $\uparrow$	FID $\downarrow$
43	0.9	360.23	12	0	346.42
42	0.88	362.82	11	0	358.37
41	0.85	368.75	10	0	363.23
40	0.83	375.58	9	0	365.01

Feature extractors generate features for both ground truth test set images and the images generated by the generative model. It then compares the distributional difference between these features. In our experiment, the ground truth samples are test set digits from label=1. In general, the performance trends are consistent between humans and FID scores, e.g. for RES-AUG-MNIST-0/1, as FID score is decreasing (better performance) and human score is increasing (better performance) when the logit increases. This result demonstrates that the scores for evaluating generative models may be able to replace human annotations.

However, humans and these commonly used generative metrics can lead to very different results. Comparing the results of RES-AUG-MNIST-0/1 and CNN-MNIST-0/1, Table 2 shows that the FID score can be completely misleading. While the representative inputs of CNN-MNIST-0/1 do not contain any semantics for the logits on the table, the FID scores are similar to those of samples from RES-AUG-MNIST-0/1 where representative inputs are clearly “1.” This is not the only inconsistent case between humans and metrics. The trend of FID for MLP-MNIST-0/1 is also the opposite of human intuitions, as shown in Table 5 in Appendix 11. When the logits are large, humans label the representative inputs as “1.” When the logits are small, representative inputs look like “0.” However, the FID scores are better for these “0” samples, indicating the feature extractors believe these “0” samples look more like digits “1.” The key contradiction is that the feature extractors of these metrics, when trained on certain datasets, are not verified to be applicable to all OOD settings, but surely they will be applied in OOD settings to generate features of samples from models for evaluation. It is difficult to ensure they will perform reliably.

Perfect classifiers and perfect generative models could be the same. Initially it is difficult to believe the classifiers, such as CNN-MNIST-0/1, perform poorly in the open-world setting when we assume the samples are from the entire input space. In retrospect, however, it is understandable because the classifiers are trained with the objective of the conditional probability $p(class|\mathbf{x})$ where $\mathbf{x}$ are from the training distribution. In order to deal with the open-world setting, the models also have to learn the data distribution $p(\mathbf{x})$ in order to tell whether the samples are from the training distribution. This seems to indicate the importance of learning $p(\mathbf{x})$ and this is the objective of generative models. In Fig. 10, if we can construct a classifier with perfect map** in the entire input space where the models successfully learn to map all positive and negative samples in the entire input space to the high and low output values respectively, this model is also a generative model because we can use traditional MCMC samplers to reach the output with high (or low) values. As we know those output values only contain positive (or negative) samples, we are able to “generate” positive (or negative) samples. Therefore, we speculate that a perfect classifier and a perfect generator should converge to be the same model.

Our method indicates an important trade-off of generative models. The generative models trade the recall for precision. This would mean the model may miss a lot of various ways of writing the digits “1.” In summary, our method can estimate not only the overconfident predictions for the models, but also the recall. Future work needs to improve both metrics in the entire input space for better models.

6 Related Works

Performance Characterization has been extensively studied in the literature [18, 28, 63, 50, 2, 1, 51, 49]. Previous research has focused on various aspects, including simple models [17] and mathematical morphological operators [14, 27]. In our method, we adopt a black box setting where the analytic characterization of the input-to-output function is unknown [10, 7], and we place emphasis on the output distribution [16]. This approach allows us to evaluate the model’s performance without requiring detailed knowledge of its internal workings. Furthermore, our method shares similarities with performance metrics used for generative models, such as the Fréchet Inception Distance score [23] and Inception Score [55]. Recent works [45, 54, 4] have formulated the evaluation problem in terms of precision and recall of the distributional differences between generated and ground truth samples. While these methods can be incorporated into our sampler to estimate precision, we leverage the output distribution to further estimate the precision-recall curve. Recent works [48, 31, 41, 47] evaluate model performance without test set. They used other generators to generate samples for evaluating a model. On the contrary, we used a sampler to sample the model to be evaluated. Sampling is transparent with convergence estimates, but other generators are still considered as black boxes. Given the inherently unknown biases in models, utilizing other models to evaluate a model (as explained in the Discussions section in our manuscript) carries the risk of yielding unfair and potentially incorrect conclusions. Our method brings the focus back to the model to be tested, tasking it with generating samples by itself for scrutiny, rather than relying on external agents such as human or other models to come up with testing data. An additional benefit is that this approach offers a novel framework for estimating errors in the entire input space when comparing different models.

Samplers

MCMC samplers have gained widespread popularity in the machine learning community [5, 67, 34, 70]. Among these, CSGLD [12] leverages the Wang–Landau algorithm [66] to comprehensively explore the energy landscape. Gibbs-With-Gradients (GWG)[15] extends this approach to the discrete setting, while discrete Langevin proposal (DLP)[73] achieves global updates. Although these algorithms can in principle be used to sample the output distribution, efficiently sampling it requires an unbiased proposal distribution. As a result, these samplers may struggle to adequately explore the full range of possible output values. Furthermore, since the underlying distribution to be sampled is unknown, iterative techniques become necessary. The Wang–Landau algorithm capitalizes on the sampling history to efficiently sample the potential output values. The Gradient Wang–Landau algorithm (GWL) [39] combines the Wang–Landau algorithm with gradient proposals, resulting in improved efficiency.

Open-world Model Evaluation requires model to perform well in in-distribution test sets [13, 64, 59, 6, 75, 19, 57, 61, 25, 72], OOD detection [38, 21, 22, 24, 32, 33, 37, 44, 52], generalization [3, 60], and adversarial attacks [62, 53, 43, 29, 69, 42]. Understanding performance of the model needs to consider the entire input space that includes all these types of samples.

7 Conclusion

In this paper, we introduce OmniInput, a new model-centric evaluation framework built upon the output distribution of the model. As future work, it is necessary to develop efficient samplers and scale to larger inputs and outputs. While the ML community has developed many new samplers, sampling the output distribution (and from larger input) is far from receiving enough attention in the community. Our work demonstrated the importance of sampling from output distribution by showing how it enables the quantification of model performance, hence the need for more efficient samplers. Scaling to multi-dimensional output is possible and has already been developed previously. Once a scalable samplers are developed, our method will be automatically scalable to larger datasets, because the output distribution is training-set independent.

References

[1] Farzin Aghdasi. Digitization and analysis of mammographic images for early detection of breast cancer. PhD thesis, University of British Columbia, 1994.
[2] Kevin Bowyer and P Jonathon Phillips. Empirical evaluation techniques in computer vision. IEEE Computer Society Press, 1998.
[3] Kaidi Cao, Maria Brbic, and Jure Leskovec. Open-world semi-supervised learning. In International Conference on Learning Representations, 2022.
[4] Fasil Cheema and Ruth Urner. Precision recall cover: A method for assessing generative models. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6571–6594. PMLR, 25–27 Apr 2023.
[5] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691. PMLR, 2014.
[6] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
[7] Kyu** Cho, Peter Meer, and Javier Cabrera. Performance assessment through bootstrap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1185–1198, 1997.
[8] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018.
[9] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. 2017 International Joint Conference on Neural Networks (IJCNN), 2017.
[10] Patrick Courtney, Neil Thacker, and Adrian F Clark. Algorithmic modelling for performance evaluation. Machine Vision and Applications, 9(5):219–228, 1997.
[11] Antônio Gonçalves da Cunha-Netto, AA Caparica, Shan-Ho Tsai, Ronald Dickman, and David Paul Landau. Improving wang-landau sampling with adaptive windows. Physical Review E, 78(5):055701, 2008.
[12] Wei Deng, Guang Lin, and Faming Liang. A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions. In Advances in Neural Information Processing Systems, 2020.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
[14] Xiang Gao, Visvanathan Ramesh, and Terry Boult. Statistical characterization of morphological operator sequences. In European Conference on Computer Vision, pages 590–605. Springer, 2002.
[15] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pages 3831–3841. PMLR, 2021.
[16] Michael Greiffenhagen, Dorin Comaniciu, Heinrich Niemann, and Visvanathan Ramesh. Design, analysis, and engineering of video monitoring systems: An approach and a case study. Proceedings of the IEEE, 89(10):1498–1517, 2001.
[17] AM Hammitt and EB Bartlett. Determining functional relationships from trained neural networks. Mathematical and computer modelling, 22(3):83–103, 1995.
[18] Robert M Haralick. Performance characterization in computer vision. In BMVC92, pages 1–8. Springer, 1992.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
[20] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
[21] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
[22] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[24] Yen-Chang Hsu, Yilin Shen, Hongxia **, and Zsolt Kira. Generalized ODIN: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10951–10960, 2020.
[25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
[26] Christoph Junghans, Danny Perez, and Thomas Vogel. Molecular dynamics in the multicanonical ensemble: Equivalence of wang–landau sampling, statistical temperature molecular dynamics, and metadynamics. Journal of chemical theory and computation, 10(5):1843–1847, 2014.
[27] Tapas Kanungo and Robert M Haralick. Character recognition using mathematical morphology. In Proc. of the Fourth USPS Conference on Advanced Technology, pages 973–986, 1990.
[28] Reinhard Klette, H Siegfried Stiehl, Max A Viergever, and Koen L Vincken. Performance characterization in computer vision. Springer, 2000.
[29] Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016.
[30] DP Landau, Shan-Ho Tsai, and M Exler. A new approach to monte carlo simulations in statistical physics: Wang-landau sampling. American Journal of Physics, 72(10):1294–1302, 2004.
[31] Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim, William T. Freeman, Phillip Isola, Amir Globerson, Michal Irani, and Inbar Mosseri. Explaining in style: Training a gan to explain a classifier in stylespace. arXiv preprint arXiv:2104.13369, 2021.
[32] Kimin Lee, Honglak Lee, Kibok Lee, and **woo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.
[33] Kimin Lee, Kibok Lee, Honglak Lee, and **woo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
[34] Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[35] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[36] Ying Wai Li and Markus Eisenbach. A histogram-free multicanonical monte carlo algorithm for the basis expansion of density of states. In Proceedings of the Platform for Advanced Scientific Computing Conference, pages 1–7, 2017.
[37] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
[38] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 2020.
[39] Weitang Liu, Yi-Zhuang You, Ying Wai Li, and **gbo Shang. Gradient-based wang-landau algorithm: A novel sampler for output distribution of neural networks over the input space. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 22338–22351. PMLR, 23–29 Jul 2023.
[40] Yuntao Liu, Ankit Mondal, Abhishek Chakraborty, Michael Zuzak, Nina Jacobsen, Daniel Xing, and Ankur Srivastava. A survey on neural trojans. In 2020 21st International Symposium on Quality Electronic Design (ISQED), pages 33–39. IEEE, 2020.
[41] **qi Luo, Zhaoning Wang, Chen Henry Wu, Dong Huang, and Fernando De La Torre. Zero-shot model diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[42] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[43] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[44] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang Wang. Self-supervised learning for generalizable out-of-distribution detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5216–5223, April 2020.
[45] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. 2020.
[46] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
[47] Viraj Prabhu, Sriram Yenamandra, Prithvijit Chattopadhyay, and Judy Hoffman. Lance: Stress-testing visual models by generating language-guided counterfactual images. In Neural Information Processing Systems (NeurIPS), 2023.
[48] Haonan Qiu, Chaowei Xiao, Lei Yang, Xinchen Yan, Honglak Lee, and Bo Li. Semanticadv: Generating adversarial examples via attribute-conditioned image editing. In ECCV, 2020.
[49] V Ramesh and RM Haralick. A methodology for automatic selection of iu algorithm tuning parameters. In ARPA Image Understanding Workshop, 1994.
[50] Visvanathan Ramesh, RM Haralick, AS Bedekar, X Liu, DC Nadadur, KB Thornton, and X Zhang. Computer vision performance characterization. RADIUS: Image Understanding for Imagery Intelligence, pages 241–282, 1997.
[51] Visvanathan Ramesh and Robert M Haralick. Random perturbation models and performance characterization in computer vision. In Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 521–522. IEEE Computer Society, 1992.
[52] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pages 14680–14691, 2019.
[53] Andras Rozsa, Ethan M Rudd, and Terrance E Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–32, 2016.
[54] Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lučić, Olivier Bousquet, and Sylvain Gelly. Assessing Generative Models via Precision and Recall. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[55] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
[56] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
[57] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[58] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
[59] Andreas Steiner, Alexander Kolesnikov, , Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
[60] Yiyou Sun and Yixuan Li. Open-world contrastive learning. arXiv preprint arXiv:2208.02764, 2022.
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[62] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[63] Neil A Thacker, Adrian F Clark, John L Barron, J Ross Beveridge, Patrick Courtney, William R Crum, Visvanathan Ramesh, and Christine Clark. Performance characterization in computer vision: A guide to best practices. Computer vision and image understanding, 109(3):305–334, 2008.
[64] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
[65] Thomas Vogel, Ying Wai Li, Thomas Wüst, and David P Landau. Generic, hierarchical framework for massively parallel wang-landau sampling. Physical review letters, 110(21):210603, 2013.
[66] Fugao Wang and David P Landau. Efficient, multiple-range random walk algorithm to calculate the density of states. Physical review letters, 86(10):2050, 2001.
[67] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
[68] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
[69] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2730–2739, 2019.
[70] Pan Xu, **ghui Chen, Difan Zou, and Quanquan Gu. Global convergence of langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018.
[71] Chhavi Yadav and Léon Bottou. Cold case: The lost mnist digits. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.
[72] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
[73] Ruqi Zhang, Xingchao Liu, and Qiang Liu. A langevin-like sampler for discrete distributions. International Conference on Machine Learning, 2022.
[74] Chenggang Zhou, T. C. Schulthess, Stefan Torbrügge, and D. P. Landau. Wang-landau algorithm for continuous models and joint density of states. Phys. Rev. Lett., 96:120201, Mar 2006.
[75] Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. ICLR, 2022.

8 Supplementary Material

9 Details of the Models used in Evaluation

The ResNet used in our experiments is the same as the one used in GWG [15]. For the input pixels, we employ one-hot encoding and transform them into a 3-channel output through a 3-by-3 convolutional layer. The resulting output is then processed by the backbone models to generate features. The CNN backbone consists of two 2-layer 3-by-3 convolutional filters with 32 and 128 output channels, respectively. The MLP backbone comprises a single hidden layer with flattened images as inputs and produced 128-dimensional features as output. All the features from the backbone models are ultimately passed through a fully-connected layer to generate a scalar output.

10 Traditional Model Evaluation Results

Tab. 3 shows the AUROC of different models based on pre-defined test sets with different negative class(es). The MLP-MNIST-0/1 performs better on Fashion MNIST but worse in the rest than RES-AUG-MNIST-0/1. RES-GEN-MNIST-1 usually perform the best. CNN-MNIST-0/1 performs better in Kuzushiji MNIST than RES-AUG-MNIST-0/1 and MLP-MNIST-0/1 but worse on the rest. Tab 4 shows the FPR95 results. CNN-MNIST-0/1 performs better on Kuzushiji MNIST than RES-AUG-MNIST-0/1 and MLP-MNIST-0/1 but worse on the rest. These results show the inconsistency between the metrics, dataset and the models.

Test Set	class=0 (in-dist)	Fashion MNIST (OOD)	Kuzushiji MNIST (OOD)	EMNIST (OOD)	QMNIST (OOD)
CNN-MNIST-0/1	99.76	99.88	99.31	99.56	92.46
RES-GEN-MNIST-1†	99.99	100.00	100.00	100.00	94.85
RES-AUG-MNIST-0/1	100.00	99.91	99.15	99.93	94.32
MLP-MNIST-0/1	100.00	99.93	98.62	99.83	94.17

†Class=0 is OOD for GEN model.

Table 3: AUROC. The higher the better.

Test Set	class=0 (in-dist)	Fashion MNIST (OOD)	Kuzushiji MNIST (OOD)	EMNIST (OOD)	QMNIST (OOD)
CNN-MNIST-0/1	0.54	0.51	2.78	1.98	21.08
RES-GEN-MNIST-1†	0.00	0.00	0.00	0.00	10.55
RES-AUG-MNIST-0/1	0.00	0.34	4.60	0.31	14.17
MLP-MNIST-0/1	0.00	0.27	6.68	0.64	13.24

†Class=0 is OOD for GEN model.

Table 4: FPR95. The lower the better.

11 Human-metrics inconsistency

In table 5 of MLP-MNIST-0/1, the FID scores indicate the samples are bad when humans think they are good. The FID scores indicate the even better performance (lower scores) in the logit ranges when humans label as incorrect in general.

logits	humans $\uparrow$	FID $\downarrow$
17	0.73	434.32
16	0.67	436.60
15	0.58	432.89
14	0.48	430.79
-19	0.18	422.01
-20	0.2	419.94
-21	0.2	412.96
-22	0.216	405.20

Table 5: For MLP-MNIST-0/1, the FID scores indicate the samples are bad when humans think they are good. The FID scores indicate the even better performance (lower scores) in the logit ranges when humans label as incorrect in general.

12 Representative inputs for MNIST images

Representative inputs for different models are in Fig. 5.

13 Reprensetaive inputs for SST2 dataset

For sentence length 66, some representative inputs with logit equals 7 (positive sentiment) in Fig 6.

For sentence length 10, some representative inputs with logit equals to 7 (positive sentiment) in Fig 7.

14 Representative inputs for CIFAR10

Fig. 8 shows the representative inputs from MCMC samplers and our samplers. The values on top label the logit of the corresponding image (for MCMC sampler) or a column of images (for our sampler). The patterns found are essentially no difference, proving our sampler finds exactly the same type of representative inputs. Moreover, these samples are not recognizable to humans, suggesting the precision will super low.

15 Perfect classifier

In Fig. 10 shows a perfect classifier. a perfect classifier can map all the ground truth digits “0” on the close to $p(y=1|\mathbf{x})=0$ and ground truth digits “0” on the close to $p(y=1|\mathbf{x})=1$ . We speculate this seems to show this is also a perfect generative model.

16 Sampler Details

Gradient-with-Gibbs (GWG) is a Gibbs sampler by nature, thus it updates only one pixel at a time. Recently, a discrete Langevin proposal (DLP) [73] is proposed to achieve global update, i.e, updating multiple pixels at a time. We adopt this sampler to traverse the input space more quickly, but we treat $-\frac{d\tilde{S}}{df}$ the same value as $\beta$ for both $q(\mathbf{x}^{\prime}|\mathbf{x})$ and $q(\mathbf{x}|\mathbf{x}^{\prime})$ .

We use two different ways to generate $\beta$ . In the first way, we sample $\beta$ uniformly from a range of values, including positive and negative values. In the second way, since the WL/GWL algorithms strive to achieve a flat histogram [39], we add a directional mechanism to direct the sampler to visit larger logit values before it moves to smaller logit values, and vice versa. We introduce a changeable parameter $\gamma=\{-1,1\}$ to signify the direction. For example, if $\gamma=1$ and the sampler hits the maximum known logit, $\gamma$ is set to $-1$ to reverse the direction of the random walk. Moreover, we sample $\beta$ uniformly from a range of non-negative values in order to balance small updates ( $\beta$ is small) and aggressive updates ( $\beta$ is large). Finally, we check whether the current histogram entry passes the flatness check. If so, it means that this particular logit value has been sampled adequately, we then multiply $\beta$ by $\gamma$ ; otherwise, we set $\beta=0.1$ which slightly modifies the input but allows the sampler to stay in the current bin until the histogram flatness passes for the current logit value. With the above heuristic fixes, the sampler does not need to propose $\mathbf{x}^{\prime}$ with smaller $\tilde{S}$ , but focuses on how to make the histogram flat.

17 Results on CIFAR10 and CIFAR-100

Multi-class classification setting. The current output format for classification problems employs a one-hot encoding, representing an anticipated ground truth distribution. We establish the output as a log-softmax for the prediction vector, defining a range from $(-\inf,0]$ . This formulation allows for the sampling of each dimension within the log-softmax, akin to the approach employed in binary classification and generative model scenarios.

Results of repsentative samples and output distribution . We train ResNet with CIFAR-10 to reach $88\%$ accuracy and CIFAR-100 to reach $62\%$ accuracy with cross-entropy. Scrutizing the samples from the (in-dist) test set with log-softmax near 0 confirm the model trained with CIFAR-10 successfully learns to map these samples to near log-softmax $=0$ . Fig 11 shows representative samples and output distributions.

First, we plot the representative samples of CIFAR-10 for class 0 and 1 respectively. Building upon the analysis previously articulated in the context of MNIST, wherein it was demonstrated that classifiers generally fail to learn the data distribution, our observations extend to the current model. Specifically, the model tends to map a significant portion of uninformative samples to the output region where informative test set samples reside, resulting in a precision value of 0 in the precision-recall curve.

Second, different from the previous experiments where the output distribution for informative test set inputs (output values near 0) was generally low in binary classification, our findings in the context of multi-class classification reveal a notable distinction. Specifically, the output distribution for these regions tends to be high, indicating that the model maps a substantial number of uninformative samples to the output values shared by informative test set samples.

Lastly, we extended our analysis to CIFAR-100, and the observed trend in output distribution is generally consistent with that of CIFAR-10. Thus, to ensure the model’s effectiveness across the entirety of the input space, there remains a necessity for further refinement and enhancement of precision in log-softmax values near 0.