OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

Dujian Ding University of British Columbia Bicheng Xu University of British Columbia Laks V.S. Lakshmanan University of British Columbia
Abstract

Image classification is a fundamental building block for a majority of computer vision applications. With the growing popularity and capacity of machine learning models, people can easily access trained image classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over image classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40%percent4040\%40 % cost reduction with little to no accuracy drop.

1 Introduction

With the breakthroughs in AI and advances in computer hardware (e.g., GPUs and TPUs) in recent decades, applications of computer vision have permeated our daily lives, ranging from face recognition systems to autonomous driving technologies. Among all the day-to-day computer vision applications, a fundamental building block is the task of image classification, where given an image, an algorithm needs to recognize the object content inside the image.

The task of image classification has a long history in the computer vision literature. Before the emergence of deep learning, people mainly focused on designing handcrafted features or descriptors for images, such as HOG [Dalal and Triggs, 2005] and SIFT [Lowe, 2004]. With the growing capability of deep learning models, many neural network architectures including convolutional neural networks (CNNs) [LeCun et al., 1998] and Transformers [Vaswani et al., 2017] have been proposed, e.g., AlexNet [Krizhevsky et al., 2012], ResNet [He et al., 2016], Vision Transformer [Dosovitskiy et al., 2020], and Swin Transformer [Liu et al., 2021]. Though larger neural network models are equipped with higher capacity, they often come with higher costs as well, e.g., hardware usage and latency (time), for both training and inference. This can potentially impose an enormous cost on both end users of image classification services and the service providers (e.g., Google111https://cloud.google.com/prediction, Amazon222https://aws.amazon.com/machine-learning, and Microsoft333https://studio.azureml.net). In response to this challenge, there has been a notable surge in interest directed towards the development of smaller, cost-effective image classifiers, e.g., MobileNet [Howard et al., 2017], where depthwise separable convolutions are used to trade classification accuracy for efficiency. However, empirical evaluations conducted in [Su et al., 2018], as well as our own independent assessment (see Figure 1(a)), consistently indicate that smaller models tend to exhibit a gap in classification accuracy compared to their larger counterparts.

Refer to caption
(a) Accuracy v/s classifier sizes.
Refer to caption
(b) Classifier agreement frequency.
Refer to caption
(c) OCCAM results.
Figure 1: We investigate Tiny ImageNet dataset consisting of 200200200200 classes (see Section 5 for details). We observe that (a) smaller classifiers (e.g., ResNet-18) generally yield lower accuracy, (b) small classifiers can agree with large classifiers at a high frequency (each entry indicates the percentage of queries on which the classifier on the row makes the right prediction and so does the classifier on the column), (c) our approach OCCAM achieves 20%percent2020\%20 % cost reduction with less than 1%percent11\%1 % accuracy drop.

Confronted with the general tradeoff between classification accuracy and inference cost, we advocate a hybrid inference framework which seeks to combine the advantages of both small and large models. Specifically, we study the problem, given a user specified cost budget and a group of ML classifiers of different capacity and cost, assign classifiers to resolve different image classification queries so that the aggregated accuracy is maximized and the overall cost is under the budget. We formally define it as the optimal model portfolio problem (details in Section 3). Our approach is motivated by the observation that while small classifiers typically have reduced accuracy over the population, they can still agree with large classifiers on certain queries a large proportion of the time, which suggests the existence of a subset of “easy” queries on which even small classifiers can make the right prediction. This is also illustrated in Figure 1(b) where we plot the frequency with which different classifiers successfully make the right prediction on the same image queries. For instance, ResNet-18 [He et al., 2016] can correctly classify 75%percent7575\%75 % of the images on which SwinV2-B [Liu et al., 2022] makes the right prediction, suggesting that we can replace SwinV2-B with ResNet-18 on these image queries, saving significant inference costs without any accuracy drop (details in Section 5).

With this insight, we propose a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively identify easy queries and assign classifiers to different user queries to maximize the overall classification accuracy subject to the given cost budgets. We present an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees. The intuition is that for well-separated classification problems such as image classification [Yang et al., 2020], we can learn robust classifiers that have similar performance on similar queries. For each query image, we compute its nearest neighbours in pre-computed samples to estimate the test accuracy for each classifier. Previous work [Chen et al., 2022] trains ML models to predict the accuracy, which requires sophisticated configuration and lacks performance guarantees that are critical in real-world scenarios. To our best knowledge, we are the first to open up the black box by develo** a white-box accuracy estimator for ML classifiers with statistical guarantees. Next, armed with our classifier accuracy estimator, we compute the optimal classifier assignment strategy over all query images (optimal model portfolio) subject to a given cost budget by solving an integer linear programming (ILP) problem (see Section 4). As a preview, Figure 1(c) shows that OCCAM can achieve 20%percent2020\%20 % cost reduction with less than 1%percent11\%1 % accuracy drop. We show even higher cost reduction with little to no accuracy drop on various real-world datasets in Section 5. Figure 2 depicts the overall pipeline of OCCAM.

Our main technical contributions are: (1) we formally define the optimal model portfolio problem to reduce overall inference costs while maintaining high performance subject to user-specified cost budgets (Section 3); (2) we propose a novel and principled approach, OCCAM, to effectively compute the optimal model portfolio with statistical guarantees (Section 4); and (3) we provide an extensive experimental evaluation on a variety of real-world datasets on the image classification task (Section 5) demonstrating the effectiveness of OCCAM.

Refer to caption
Figure 2: OCCAM: Optimization with Cost Constraints towards Accuracy Maximization.

2 Related Work

Image Classification. Image classification is a fundamental task in computer vision, where given an image, a label needs to be predicted. It serves as an essential building block for many high-level AI tasks, e.g., image captioning [Vinyals et al., 2015] and visual question answering [Antol et al., 2015], where objects need to be first recognized. Before the deep learning era, researchers mainly adopted statistical methods with handcrafted features for the task, e.g., SIFT [Lowe, 2004]. With the growing capacity of deep learning models, from convolutional neural networks (CNN) [Krizhevsky et al., 2012, Simonyan and Zisserman, 2014, He et al., 2016, Szegedy et al., 2016] to Transformer architectures [Dosovitskiy et al., 2020, Liu et al., 2021], the classification accuracy on standard image classification benchmarks [Krizhevsky et al., 2009, Russakovsky et al., 2015] has been greatly improved. In this work, we utilize both CNN (e.g., ResNet models) and Transformer (e.g., Swin Transformers) image classifiers to illustrate and evaluate our proposed approach, OCCAM.

Efficient Machine Learning (ML) Inference. Efficient ML inference is crucial for real-time decision-making in various applications such as autonomous vehicles [Tang et al., 2021], healthcare [Miotto et al., 2018], and fraud detection [Alghofaili et al., 2020]. It involves applying a pre-trained ML model to make predictions, where the inference cost is expected to dominate the overall cost incurred by the model. Model compression, which replaces a large model with a smaller model of comparable accuracy, is the most common approach employed for enhancing ML inference efficiency. Common techniques for model compression include model pruning [Hassibi et al., 1993, LeCun et al., 1989], quantization [Jacob et al., 2018, Vanhoucke et al., 2011], knowledge distillation [Hinton et al., 2015, Urban et al., 2016], neural architecture search [Elsken et al., 2019, Zoph and Le, 2016], and so on. These static efficiency optimizations typically lead to a fixed model with lower inference cost but also reduced accuracy compared to its larger counterpart, which may not suffice in highly sensitive applications like collision detection [Wang et al., 2021] and prognosis prediction [Zhu et al., 2020]. This shortcoming is already evident in the inference platforms discussed in Section 1, highlighting the need for more dynamic optimizations to effectively address the diverse demands of users.

Hybrid ML Inference. Recent works [Kag et al., 2022, Ding et al., 2022, 2024] have introduced a novel inference paradigm termed hybrid inference, which invokes models of different sizes on different queries, as opposed to employing a single model on all inference queries. The smaller model generally incurs a lower inference cost but also exhibits reduced accuracy compared to the larger model. The key idea is to identify easy inference queries on which the small models are likely to make correct predictions and invoke small models on them when cost budgets are limited, thereby reducing overall inference costs while preserving solution accuracy. By adjusting the cost budgets, users can dynamically trade off between accuracy and cost within the same inference setup.  [Kag et al., 2022, Ding et al., 2022, 2024] consider a simple setting of only one large and one small model and do not allow for explicit cost budget specification, which could be necessary for production scenarios. [Chen et al., 2020] studies a setup with multiple ML models and learns an adaptive strategy to generate predictions by calling a base model and sometimes an add-on model when the base model quality scores are lower than the learned thresholds. However, both base and add-on models are selected in a probabilistic manner and this approach fails to satisfy the user-specified cost budgets deterministically. [Chen et al., 2022] studies a similar setup with multiple ML models and allocates cost budgets according to model-based accuracy prediction. This approach requires a separate training phase for the accuracy predictor, which needs a large amount of training data, and provides no guarantee on the prediction quality. Unlike previous works, we propose an unbiased and low-variance accuracy estimator with asymptotic guarantees, based on which we present a novel approach, OCCAM, to effectively compute the optimal assignment of classifiers to given queries, under the cost budgets given by users.

3 Problem Definition

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subseteq\mathbb{R}^{d}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be an instance space (e.g., images) equipped with a metric dist𝑑𝑖𝑠𝑡distitalic_d italic_i italic_s italic_t: 𝒳×𝒳0𝒳𝒳superscriptabsent0\mathcal{X}\times\mathcal{X}\to\mathbb{R}^{\geq 0}caligraphic_X × caligraphic_X → blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT, and [C]={1,2,,C}delimited-[]𝐶12𝐶[C]=\{1,2,\cdots,C\}[ italic_C ] = { 1 , 2 , ⋯ , italic_C } be the set of possible labels with C2𝐶2C\geq 2italic_C ≥ 2. Let 𝒳𝒳\mathcal{X}caligraphic_X contain C𝐶Citalic_C disjoint classes, 𝒳(1),𝒳(2),,𝒳(C)superscript𝒳1superscript𝒳2superscript𝒳𝐶\mathcal{X}^{(1)},\mathcal{X}^{(2)},\cdots,\mathcal{X}^{(C)}caligraphic_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_X start_POSTSUPERSCRIPT ( italic_C ) end_POSTSUPERSCRIPT where for each i[C]𝑖delimited-[]𝐶i\in[C]italic_i ∈ [ italic_C ], all x𝒳(i)𝑥superscript𝒳𝑖x\in\mathcal{X}^{(i)}italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT have label i𝑖iitalic_i. Let f1,f2,,fMsubscript𝑓1subscript𝑓2subscript𝑓𝑀f_{1},f_{2},\cdots,f_{M}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT be a set of classifiers, with bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the cost of a single inference call of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given a query x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, each classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT outputs a single label from [C]delimited-[]𝐶[C][ italic_C ] at the cost bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define a model portfolio as follows.

Definition 3.1 (Model Portfolio).

Given queries X𝒳𝑋𝒳X\subseteq\mathcal{X}italic_X ⊆ caligraphic_X to be classified and classifiers f1,f2,,fMsubscript𝑓1subscript𝑓2subscript𝑓𝑀f_{1},f_{2},\cdots,f_{M}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, a model portfolio μ𝜇\muitalic_μ is a map** μ:X[M]:𝜇𝑋delimited-[]𝑀\mu:X\to[M]italic_μ : italic_X → [ italic_M ] such that each xX𝑥𝑋x\in Xitalic_x ∈ italic_X is classified by the classifier fμ(x)subscript𝑓𝜇𝑥f_{\mu(x)}italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT.

We assume an oracle classifier O𝑂Oitalic_O: 𝒳[C]𝒳delimited-[]𝐶\mathcal{X}\to[C]caligraphic_X → [ italic_C ] which outputs the ground truth label O(x)𝑂𝑥O(x)italic_O ( italic_x ) for all queries xX𝑥𝑋x\in Xitalic_x ∈ italic_X. Given a finite set of queries X𝒳𝑋𝒳X\subseteq\mathcal{X}italic_X ⊆ caligraphic_X, the accuracy of a model portfolio μ𝜇\muitalic_μ on X𝑋Xitalic_X is the frequency of the ground truth labels correctly predicted by μ𝜇\muitalic_μ, i.e., AccuracyX(μ)=xX𝟙{fμ(x)(x)=O(x)}|X|𝐴𝑐𝑐𝑢𝑟𝑎𝑐subscript𝑦𝑋𝜇subscript𝑥𝑋1subscript𝑓𝜇𝑥𝑥𝑂𝑥𝑋Accuracy_{X}(\mu)=\frac{\sum_{x\in X}\mathbbm{1}\{f_{\mu(x)}(x)=O(x)\}}{|X|}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_μ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT blackboard_1 { italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) = italic_O ( italic_x ) } end_ARG start_ARG | italic_X | end_ARG, where 𝟙{condition}1𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛\mathbbm{1}\{condition\}blackboard_1 { italic_c italic_o italic_n italic_d italic_i italic_t italic_i italic_o italic_n } is an indicator function that outputs 1111 iff condition𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛conditionitalic_c italic_o italic_n italic_d italic_i italic_t italic_i italic_o italic_n is satisfied. Similarly, the cost of model portfolio μ𝜇\muitalic_μ on X𝑋Xitalic_X is the sum of all inference costs incurred by executing μ𝜇\muitalic_μ on X𝑋Xitalic_X, i.e., CostX(μ)=xXbμ(x)𝐶𝑜𝑠subscript𝑡𝑋𝜇subscript𝑥𝑋subscript𝑏𝜇𝑥Cost_{X}(\mu)=\sum\nolimits_{x\in X}b_{\mu(x)}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_μ ) = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT. We will use the notation Accuracy(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ ), Cost(μ)𝐶𝑜𝑠𝑡𝜇Cost(\mu)italic_C italic_o italic_s italic_t ( italic_μ ) when X𝑋Xitalic_X is clear from the context. We define our problem as follows.

Definition 3.2 (Optimal Model Portfolio).

Given queries X𝒳𝑋𝒳X\subseteq\mathcal{X}italic_X ⊆ caligraphic_X, a cost budget B+𝐵superscriptB\in\mathbb{R}^{+}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and classifiers f1,f2,,fMsubscript𝑓1subscript𝑓2subscript𝑓𝑀f_{1},f_{2},\cdots,f_{M}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, find the optimal model portfolio μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that Cost(μ)B𝐶𝑜𝑠𝑡superscript𝜇𝐵Cost(\mu^{*})\leq Bitalic_C italic_o italic_s italic_t ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_B and Accuracy(μ)Accuracy(μ),𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦superscript𝜇𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu^{*})\geq Accuracy(\mu),italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ ) , for all model portfolios μ𝜇\muitalic_μ with Cost(μ)B𝐶𝑜𝑠𝑡𝜇𝐵Cost(\mu)\leq Bitalic_C italic_o italic_s italic_t ( italic_μ ) ≤ italic_B.

4 Methodology

We describe the general framework to solve the optimal model portfolio problem in the next sections. Our overall strategy consists of two steps. Firstly, we propose an unbiased low-variance estimator for the accuracy of any given model portfolio μ𝜇\muitalic_μ, with asymptotic guarantees. Next, we describe how to determine the optimal model portfolio μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by formulating it as an integer linear programming (ILP) problem, subject to user-specified budget constraints. All proofs can be found in Appendix B.

4.1 Estimating Accuracy(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ )

Previous work on Hybrid ML [Kag et al., 2022, Ding et al., 2022, 2024, Chen et al., 2022] typically relies on training a neural router to predict the accuracy of a given set of classifiers for given user queries, based on which queries are routed to different classifiers. Such a paradigm not only involves a non-trivial training configuration but also lacks estimation guarantees which can be critical in scientific and production settings. We propose a principled approach to estimate the test accuracy of a given model portfolio for given user queries. By leveraging the specific structure of well-separated classification problems like image classification, we propose an unbiased low-variance estimator for the test accuracy with asymptotic guarantees.

Without loss of generality, we consider the wide class of soft classifiers in this study. Given query x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, a soft classifier first outputs a distribution over all labels [C]delimited-[]𝐶[C][ italic_C ], based on which it then makes prediction at random. Given a soft classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we abuse the notation and let fi(x)[j]subscript𝑓𝑖𝑥delimited-[]𝑗f_{i}(x)[j]italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_j ] denote the likelihood that fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT predicts label j[C]𝑗delimited-[]𝐶j\in[C]italic_j ∈ [ italic_C ], that is, fi(x)[j]:=Pr[fi(x)=j]assignsubscript𝑓𝑖𝑥delimited-[]𝑗𝑃𝑟delimited-[]subscript𝑓𝑖𝑥𝑗f_{i}(x)[j]:=Pr[f_{i}(x)=j]italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_j ] := italic_P italic_r [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_j ], for query x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Deterministic classifiers (e.g., the oracle O𝑂Oitalic_O) can be seen as a special case of soft classifiers with one-hot distribution over all labels. In practice, from softmax classifiers (e.g., ResNet), soft classifiers can be constructed by simply sampling w.r.t. the probability distribution output by the softmax layer.

Clearly, given a model portfolio μ𝜇\muitalic_μ, Accuracy(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ ) is a random variable due to the random nature of the soft classifiers. The expected accuracy of any given model portfolio μ𝜇\muitalic_μ is,

𝔼[Accuracy(μ)]=𝔼[xX𝟙{fμ(x)(x)=O(x)}|X|]=xX𝔼[𝟙{fμ(x)(x)=O(x)}]|X|=xXfμ(x)(x)[O(x)]|X|𝔼delimited-[]𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇𝔼delimited-[]subscript𝑥𝑋1subscript𝑓𝜇𝑥𝑥𝑂𝑥𝑋subscript𝑥𝑋𝔼delimited-[]1subscript𝑓𝜇𝑥𝑥𝑂𝑥𝑋subscript𝑥𝑋subscript𝑓𝜇𝑥𝑥delimited-[]𝑂𝑥𝑋\mathbb{E}[Accuracy(\mu)]=\mathbb{E}[\frac{\sum_{x\in X}\mathbbm{1}\{f_{\mu(x)% }(x)=O(x)\}}{|X|}]=\frac{\sum_{x\in X}\mathbb{E}[\mathbbm{1}\{f_{\mu(x)}(x)=O(% x)\}]}{|X|}=\frac{\sum_{x\in X}f_{\mu(x)}(x){[O(x)]}}{|X|}blackboard_E [ italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ ) ] = blackboard_E [ divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT blackboard_1 { italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) = italic_O ( italic_x ) } end_ARG start_ARG | italic_X | end_ARG ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT blackboard_E [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) = italic_O ( italic_x ) } ] end_ARG start_ARG | italic_X | end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] end_ARG start_ARG | italic_X | end_ARG

(1)

where the last equality follows from 𝔼[𝟙{fμ(x)(x)=O(x)}]=1Pr[fμ(x)(x)=O(x)]=fμ(x)(x)[O(x)]𝔼delimited-[]1subscript𝑓𝜇𝑥𝑥𝑂𝑥1𝑃𝑟delimited-[]subscript𝑓𝜇𝑥𝑥𝑂𝑥subscript𝑓𝜇𝑥𝑥delimited-[]𝑂𝑥\mathbb{E}[\mathbbm{1}\{f_{\mu(x)}(x)=O(x)\}]=1\cdot Pr[f_{\mu(x)}(x)=O(x)]=f_% {\mu(x)}(x){[O(x)]}blackboard_E [ blackboard_1 { italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) = italic_O ( italic_x ) } ] = 1 ⋅ italic_P italic_r [ italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) = italic_O ( italic_x ) ] = italic_f start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ]. Note that fi(x)[O(x)]subscript𝑓𝑖𝑥delimited-[]𝑂𝑥f_{i}(x){[O(x)]}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] is the success probability that the classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correctly predicts the ground truth label for query x𝑥xitalic_x. For brevity, we define SPi(x):=fi(x)[O(x)]assign𝑆subscript𝑃𝑖𝑥subscript𝑓𝑖𝑥delimited-[]𝑂𝑥SP_{i}(x):=f_{i}(x){[O(x)]}italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) := italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] and rewrite the expected accuracy as 𝔼[Accuracy(μ)]=xXSPμ(x)(x)|X|𝔼delimited-[]𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇subscript𝑥𝑋𝑆subscript𝑃𝜇𝑥𝑥𝑋\mathbb{E}[Accuracy(\mu)]=\frac{\sum_{x\in X}SP_{\mu(x)}(x)}{|X|}blackboard_E [ italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ ) ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_S italic_P start_POSTSUBSCRIPT italic_μ ( italic_x ) end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG | italic_X | end_ARG.

The exact computation of success probability is intractable since the ground truth of user queries is unknown a priori. We propose a novel data-driven approach to estimate it for any classifier and show that our estimator is unbiased and low-variance with asymptotic guarantees, for well-separated classification problems like image classification. Based on this, we develop a principled approach for estimating the expected accuracy of a given model portfolio.

Definition 4.1 (r𝑟ritalic_r-separation [Yang et al., 2020]).

We say a metric space (𝒳,dist)𝒳𝑑𝑖𝑠𝑡(\mathcal{X},dist)( caligraphic_X , italic_d italic_i italic_s italic_t ) where 𝒳=i[C]𝒳(i)𝒳subscript𝑖delimited-[]𝐶superscript𝒳𝑖\mathcal{X}=\cup_{i\in[C]}\mathcal{X}^{(i)}caligraphic_X = ∪ start_POSTSUBSCRIPT italic_i ∈ [ italic_C ] end_POSTSUBSCRIPT caligraphic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is r𝑟ritalic_r-separated, if there exists a constant r>0𝑟0r>0italic_r > 0 such that dist(𝒳(i),𝒳(j))r𝑑𝑖𝑠𝑡superscript𝒳𝑖superscript𝒳𝑗𝑟dist(\mathcal{X}^{(i)},\mathcal{X}^{(j)})\geq ritalic_d italic_i italic_s italic_t ( caligraphic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ≥ italic_r, ij,for-all𝑖𝑗\forall i\neq j,∀ italic_i ≠ italic_j , where dist(𝒳(i),𝒳(j))=minx𝒳(i),x𝒳(j)dist(x,x)𝑑𝑖𝑠𝑡superscript𝒳𝑖superscript𝒳𝑗subscriptformulae-sequence𝑥superscript𝒳𝑖superscript𝑥superscript𝒳𝑗𝑑𝑖𝑠𝑡𝑥superscript𝑥dist(\mathcal{X}^{(i)},\mathcal{X}^{(j)})=\min_{x\in\mathcal{X}^{(i)},x^{% \prime}\in\mathcal{X}^{(j)}}dist(x,x^{\prime})italic_d italic_i italic_s italic_t ( caligraphic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

In words, in an r𝑟ritalic_r-separated metric space, there is a constant r>0𝑟0r>0italic_r > 0, such that the distance between instances from different classes is at least r𝑟ritalic_r. The key observation is that many real-world classification tasks comprise of distinct classes. For instance, images of different categories (e.g., gold fish, bullfrog, etc.) are very unlikely to sharply change their classes under minor image modification. It has been widely observed [Yang et al., 2020] that the classification problem on real-world images empirically satisfies r𝑟ritalic_r-separation under standard metrics (e.g., lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm). We also observe similar patterns on a number of standard image datasets (e.g., Tiny ImageNet) and provide more empirical evidence in Section A.1. With this observation, we can show that the oracle classifier O𝑂Oitalic_O is Lipschitz continuous 444The Lipschitz continuity for soft classifiers is defined w.r.t. the output distribution. [Eriksson et al., 2004].

Definition 4.2 (Lipschitz Continuity).

Given two metric spaces (𝐗,d𝐗)𝐗subscript𝑑𝐗(\mathbf{X},d_{\mathbf{X}})( bold_X , italic_d start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ) and (𝐘,d𝐘)𝐘subscript𝑑𝐘(\mathbf{Y},d_{\mathbf{Y}})( bold_Y , italic_d start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) where d𝐗subscript𝑑𝐗d_{\mathbf{X}}italic_d start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT (resp. d𝐘subscript𝑑𝐘d_{\mathbf{Y}}italic_d start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT) is the metric on the set 𝐗𝐗\mathbf{X}bold_X (resp. 𝐘𝐘\mathbf{Y}bold_Y), a function f𝑓fitalic_f: 𝐗𝐘𝐗𝐘\mathbf{X}\to\mathbf{Y}bold_X → bold_Y is Lipschitz continuous if there exists a constant L0𝐿0L\geq 0italic_L ≥ 0 s.t.

x,x𝐗:d𝐘(f(x),f(x))Ld𝐗(x,x):for-all𝑥superscript𝑥𝐗subscript𝑑𝐘𝑓𝑥𝑓superscript𝑥𝐿subscript𝑑𝐗𝑥superscript𝑥\forall x,x^{\prime}\in\mathbf{X}:\;\;d_{\mathbf{Y}}(f(x),f(x^{\prime}))\leq L% \cdot d_{\mathbf{X}}(x,x^{\prime})∀ italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_X : italic_d start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ( italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ italic_L ⋅ italic_d start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (2)

and the smallest L𝐿Litalic_L satisfying Equation 2 is called the Lipschitz constant of f𝑓fitalic_f.

Lemma 4.3.

There exists an oracle classifier O𝑂Oitalic_O which is Lipschitz continuous if the metric space associated with the instances 𝒳𝒳\mathcal{X}caligraphic_X is r𝑟ritalic_r-separated.

If we further choose the classifiers fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be Lipschitz continuous (e.g., MLP [Bartlett et al., 2017], ResNet [Gouk et al., 2021], Lipschitz continuous Transformer [Qi et al., 2023]), we can show that the success probability function SPi(x)𝑆subscript𝑃𝑖𝑥SP_{i}(x)italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) (i.e., the likelihood that a classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT successfully predicts the ground truth label for query x𝑥xitalic_x) is also Lipschitz continuous.

Lemma 4.4.

The success probability function SPi(x)=fi(x)[O(x)]𝑆subscript𝑃𝑖𝑥subscript𝑓𝑖𝑥delimited-[]𝑂𝑥SP_{i}(x)=f_{i}(x)[O(x)]italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] is Lipschitz continuous if fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) and O(x)𝑂𝑥O(x)italic_O ( italic_x ) are Lipschitz continuous.

An important implication of Lemma 4.4 is that, given a classifier, we can estimate its success probability on query x𝑥xitalic_x by its success probability on a similar query xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Let Li>0subscript𝐿𝑖0L_{i}>0italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 denote the Lipschitz constant for SPi𝑆subscript𝑃𝑖SP_{i}italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For any x,xX𝑥superscript𝑥𝑋x,x^{\prime}\in Xitalic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_X, we have the estimation error bounded by |SPi(x)SPi(x)|Lidist(x,x)𝑆subscript𝑃𝑖superscript𝑥𝑆subscript𝑃𝑖𝑥subscript𝐿𝑖𝑑𝑖𝑠𝑡𝑥superscript𝑥|SP_{i}(x^{\prime})-SP_{i}(x)|\leq L_{i}\cdot dist(x,x^{\prime})| italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) | ≤ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_d italic_i italic_s italic_t ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), which monotonically decreases as dist(x,x)𝑑𝑖𝑠𝑡𝑥superscript𝑥dist(x,x^{\prime})italic_d italic_i italic_s italic_t ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) approaches 00 555We evaluate the nearest neighbour distance and estimation error in Sections A.2 and A.3. In practice, we can pre-compute a labelled sample S𝒳𝑆𝒳S\subset\mathcal{X}italic_S ⊂ caligraphic_X (e.g., pre-compute classifier outputs on sampled queries from the validation set) and compute NNS(x)𝑁subscript𝑁𝑆𝑥NN_{S}(x)italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ), the nearest neighbour of x𝑥xitalic_x in S𝑆Sitalic_S, for success probability estimation. We show that the estimator is asymptotically unbiased, as sample size increases.

Lemma 4.5 (Asymptotically Unbiased Estimator).

Given query x𝑥xitalic_x, a classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and uniformly sampled S𝒳𝑆𝒳S\subset\mathcal{X}italic_S ⊂ caligraphic_X,

lims𝔼[SPi(NNS(x))]=SPi(x)subscript𝑠𝔼delimited-[]𝑆subscript𝑃𝑖𝑁subscript𝑁𝑆𝑥𝑆subscript𝑃𝑖𝑥\lim_{s\to\infty}\mathbb{E}[SP_{i}(NN_{S}(x))]=SP_{i}(x)roman_lim start_POSTSUBSCRIPT italic_s → ∞ end_POSTSUBSCRIPT blackboard_E [ italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ) ] = italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) (3)

where s𝑠sitalic_s is the sample size and NNS(x):=argminxSdist(x,x)assign𝑁subscript𝑁𝑆𝑥subscriptsuperscript𝑥𝑆𝑑𝑖𝑠𝑡𝑥superscript𝑥NN_{S}(x):=\arg\min_{x^{\prime}\in S}dist(x,x^{\prime})italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) := roman_arg roman_min start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the nearest neighbour of x𝑥xitalic_x in sample S𝑆Sitalic_S.

In practice, we draw K𝐾Kitalic_K i.i.d. samples, S1,S2,,SKsubscript𝑆1subscript𝑆2subscript𝑆𝐾S_{1},S_{2},\cdots,S_{K}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, and compute the average sample accuracy SP^i(x):=1Kk=1KSPi(NNSk(x))assignsubscript^𝑆𝑃𝑖𝑥1𝐾superscriptsubscript𝑘1𝐾𝑆subscript𝑃𝑖𝑁subscript𝑁subscript𝑆𝑘𝑥\widehat{SP}_{i}(x):=\frac{1}{K}\sum_{k=1}^{K}SP_{i}(NN_{S_{k}}(x))over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) := divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) as the estimator of the test accuracy on query x𝑥xitalic_x, for each classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It follows from Lemma 4.5 that SP^isubscript^𝑆𝑃𝑖\widehat{SP}_{i}over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is also an asymptotically unbiased estimator. We further show below that SP^isubscript^𝑆𝑃𝑖\widehat{SP}_{i}over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an asymptotically low-variance estimator to SPi𝑆subscript𝑃𝑖SP_{i}italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as K𝐾Kitalic_K increases.

Lemma 4.6 (Asymptotically Low-Variance Estimator).

Given query x𝑥xitalic_x, a classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and K𝐾Kitalic_K i.i.d. uniformly drawn samples S1,S2,,SKsubscript𝑆1subscript𝑆2subscript𝑆𝐾S_{1},S_{2},\cdots,S_{K}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT of size s𝑠sitalic_s, let σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denote the variance of the estimator SP^i(x)subscript^𝑆𝑃𝑖𝑥\widehat{SP}_{i}(x)over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ). We have that σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is asymptotically proportional to 1K1𝐾\frac{1}{\sqrt{K}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG as both s𝑠sitalic_s and K𝐾Kitalic_K increase.

4.2 Computing μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with Accuracy(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ )

In the previous section, we show how to estimate the accuracy for a given model portfolio. For each classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and query x𝑥xitalic_x, we propose to estimate its success probability SPi(x)𝑆subscript𝑃𝑖𝑥{SP}_{i}(x)italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) based on similar queries from labelled samples SP^i(x)subscript^𝑆𝑃𝑖𝑥\widehat{SP}_{i}(x)over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), which can be efficiently pre-computed.

With the estimator in place, we formulate the problem of finding the optimal model portfolio as an integer linear programming (ILP) problem as follows. Given a set of M𝑀Mitalic_M classifiers f1,f2,,fMsubscript𝑓1subscript𝑓2subscript𝑓𝑀f_{1},f_{2},\cdots,f_{M}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, user queries X={x1,x2,,xN}𝑋subscript𝑥1subscript𝑥2subscript𝑥𝑁X=\{x_{1},x_{2},\cdots,x_{N}\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, pre-computed samples S1,S2,,SKsubscript𝑆1subscript𝑆2subscript𝑆𝐾S_{1},S_{2},\cdots,S_{K}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, and budget B+𝐵superscriptB\in\mathbb{R}^{+}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we have the following ILP problem 666Our problem can be rephrased as “selecting for each query image, one item (i.e., ML classifier) from a collection (the set of all classifiers) so as to maximize the total value (accuracy) while adhering to a predefined weight limit (cost budget)”, which is a classic multiple choice knapsack problem (MCKP) [Kellerer et al., 2004] and the ILP formulation is the natural choice..

maxi=1Mj=1NSP^i(xj)ti,js.t.i=1Mj=1Nbiti,jBi=1Mti,j=1 for j=1,2,,N, and ti,j{0,1}formulae-sequencesuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscript^𝑆𝑃𝑖subscript𝑥𝑗subscript𝑡𝑖𝑗s.t.superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscript𝑏𝑖subscript𝑡𝑖𝑗𝐵superscriptsubscript𝑖1𝑀subscript𝑡𝑖𝑗1 for 𝑗12𝑁 and subscript𝑡𝑖𝑗01\begin{split}\max\quad&\sum_{i=1}^{M}\sum_{j=1}^{N}\widehat{SP}_{i}(x_{j})% \cdot t_{i,j}\\ \textrm{s.t.}\quad&\sum_{i=1}^{M}\sum_{j=1}^{N}b_{i}\cdot t_{i,j}\leq B\\ &\sum_{i=1}^{M}t_{i,j}=1\text{ for }j=1,2,\cdots,N,\text{ and }t_{i,j}\in\{0,1% \}\\ \end{split}start_ROW start_CELL roman_max end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≤ italic_B end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 for italic_j = 1 , 2 , ⋯ , italic_N , and italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } end_CELL end_ROW (4)

where ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are boolean variables and ti,j=1subscript𝑡𝑖𝑗1t_{i,j}=1italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 iff the classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assigned to query xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Clearly, the optimal model portfolio μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be efficiently computed as μ(xj)=isuperscript𝜇subscript𝑥𝑗𝑖\mu^{*}(x_{j})=iitalic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_i iff ti,j=1subscriptsuperscript𝑡𝑖𝑗1t^{*}_{i,j}=1italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1, for i[M]𝑖delimited-[]𝑀i\in[M]italic_i ∈ [ italic_M ] and j[N]𝑗delimited-[]𝑁j\in[N]italic_j ∈ [ italic_N ], where ti,jsubscriptsuperscript𝑡𝑖𝑗t^{*}_{i,j}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the optimal solution to the ILP problem above. While ILP problems are NP-hard in general, we can use standard ILP solvers (e.g., HiGHS [Huangfu and Hall, 2018]) to efficiently compute the optimal solution in practice.

The optimization problem aims to maximize the estimated model portfolio accuracy and is subject to the risk of overestimation due to selection bias, especially on large-scale problems. Intuitively, a poor classifier with high-variance estimates can be mistakenly assigned to some queries if its performance on those queries is overestimated. We address this by regularizing the accuracy estimate for each classifier by the corresponding estimator variance. Specifically, we optimize the objective i=1Mj=1N(SP^i(xj)λσi)ti,jsuperscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscript^𝑆𝑃𝑖subscript𝑥𝑗𝜆subscript𝜎𝑖subscript𝑡𝑖𝑗\sum_{i=1}^{M}\sum_{j=1}^{N}(\widehat{SP}_{i}(x_{j})-\lambda\cdot\sigma_{i})% \cdot t_{i,j}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_λ ⋅ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in Equation 4, where σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the standard deviation of the estimator SP^isubscript^𝑆𝑃𝑖\widehat{SP}_{i}over^ start_ARG italic_S italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is unknown a priori, we use a validation set to estimate σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each classifier fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tune λ𝜆\lambdaitalic_λ for the highest validation accuracy.

5 Evaluation

5.1 Evaluation Setup

Task. We consider the image classification task: given an image, predict a class label from a set of predefined class categories. We assume that each image has a unique ground-truth class label.

Datasets. We consider 4 widely studied datasets for image classification: CIFAR-10 (10 classes) [Krizhevsky et al., 2009], CIFAR-100 (100 classes) [Krizhevsky et al., 2009], Tiny ImageNet (200 classes) [CS231n, ], and ImageNet-1K (1000 classes) [Russakovsky et al., 2015]. Both CIFAR-10 and CIFAR-100 contain 50,0005000050,00050 , 000 training images and 10,0001000010,00010 , 000 test images. Tiny ImageNet contains 100,000100000100,000100 , 000 training images and 10,0001000010,00010 , 000 validation images, and ImageNet-1K has 1,281,167 training images and 50,000 validation images. We use the test splits of CIFAR-10 and CIFAR-100 as well as the validation splits of Tiny ImageNet and ImageNet-1K for evaluation purposes. Details of those datasets are in LABEL:{sec:app_image_dset}.

Models. We consider a total of 7 classifiers: ResNet-[18, 34, 50, 101] [He et al., 2016]777Numbers in bracket indicate the model’s layer number. and SwinV2-[T, S, B] [Liu et al., 2022]888Letters in bracket indicate the Swin Transformer V2’s size. T/S/B means tiny/small/base. Among these classifiers, ResNet-18 is the smallest (in terms of number of model parameters and training/inference time) and thus has the least capacity, while SwinV2-B is the largest and with the highest accuracy in general (see Figure 1(a)). We take the classifiers pre-trained on the ImageNet-1K dataset [Russakovsky et al., 2015]. We directly use the pre-trained models on ImageNet-1K, while on other datasets, we freeze everything but train only the last layer from scratch. The output dimension of the last layer is set to be the same as the number of image classes on the test dataset. We implement the soft classifier (see Section 4.1) by sampling w.r.t. the probability distribution output by the softmax layer, i.e., Pr[fi(x)=j]=exp(zj/τ)kexp(zk/τ)𝑃𝑟delimited-[]subscript𝑓𝑖𝑥𝑗𝑒𝑥𝑝subscript𝑧𝑗𝜏subscript𝑘𝑒𝑥𝑝subscript𝑧𝑘𝜏Pr[f_{i}(x)=j]=\frac{exp(z_{j}/\tau)}{\sum_{k}exp(z_{k}/\tau)}italic_P italic_r [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_j ] = divide start_ARG italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_τ ) end_ARG, where zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the logit for k[C]𝑘delimited-[]𝐶k\in[C]italic_k ∈ [ italic_C ] and τ𝜏\tauitalic_τ is the hyper-parameter temperature controlling the randomness of predictions. We choose a small τ𝜏\tauitalic_τ (1e-3) to reduce the variance in predictions. At test time, to obtain consistent results, all classifiers make predictions by outputting the most likely class labels (i.e., argmaxjfi(x)[j]subscript𝑗subscript𝑓𝑖𝑥delimited-[]𝑗\arg\max_{j}f_{i}(x)[j]roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_j ]), equivalently to having soft classifiers with τ0𝜏0\tau\to 0italic_τ → 0. Model training details are in Section C.2. All experiments are conducted with one NVIDIA V100 GPU of 32GB GPU RAM. Codes will be released upon acceptance.

Inference Cost. The absolute costs of running a model may be expressed using a variety of metrics, including FLOPs, latency, dollars, etc. While FLOPS is an important metric that has the advantage of being hardware independent, it has been found to not correlate well with wall-clock latency, energy consumption, and dollar costs, which are of more practical interest to end users [Dao et al., 2022]. In practice, dollar costs usually highly correlate with inference latency on GPUs. In our work, we define the cost of model inference in USD. We approximate the inference cost of computation by taking the cost per hour ($3.06) of the Azure Machine Learning (AML) NC6s v3 instance [AzureML, 2024], as summarized in Table 1. The AML NC6s v3 instance contains a single V100 GPU and is commonly used for deep learning. Since CPU resources are significantly cheaper than GPU (e.g., D2s v3 instance, equipped with two 2 CPUs and no GPU, costs $0.096 per hour [AzureML, 2024]) and all methods studied in this work typically finish in several CPU-seconds, incurring negligible expenses, we ignore the costs incurred by CPU in our comparison. In addition, since larger models typically have higher accuracy as well as higher costs (see Figure 1(a)), a practically interesting setting is to study how to deliver high quality answers with reduced costs in comparison to solely using the largest model (e.g., SwinV2-B). Normalized cost directly indicates the percentage cost saved and has been widely adopted in previous works [Ding et al., 2024, Kag et al., 2022], following which we report all results in terms of the normalized cost of each classifier.

Models Latency (s) Prices ($) Normalized Cost
ResNet-18 88.9 0.076 0.15
ResNet-34 135.9 0.116 0.22
ResNet-50 174.5 0.148 0.29
ResNet-101 317.4 0.270 0.52
SwinV2-T 326.4 0.277 0.53
SwinV2-S 600.7 0.511 0.98
SwinV2-B 610.6 0.519 1
Table 1: Model costs on the image classification task. Latency and prices are measured for 10,000 queries. Normalized cost is the fraction of the price w.r.t. SwinV2-B.

ILP Solver. While our approach is agnostic to the choice of the ILP solver, we choose the high-performance ILP solver, HiGHS [Huangfu and Hall, 2018] to solve the problem in Equation 4, given its well-demonstrated efficiency and effectiveness on public benchmarks [Gleixner et al., 2021]. In a nutshell, HiGHS solves ILP problems with branch-and-cut algorithms [Fischetti and Monaci, 2020] and stops whenever the gap between the current solution and the global optimum is small enough (e.g., 1e-6).

Baselines. We compare our approach with three baselines: single best, random, and FrugalMCT [Chen et al., 2022]. Single best always chooses the strongest (i.e., most expensive) model for a given cost budget. Random estimates classifier accuracy with random guesses (i.e., uniform samples from [0,1]01[0,1][ 0 , 1 ]) and solves the problem in Equation 4 with the same ILP solver as ours. FrugalMCT [Chen et al., 2022] is a recent work which selects the best ML models for given user budgets in an online setting, using model-based accuracy estimation. Following the same setting in [Chen et al., 2022], we train random forest regressors on top of the model-extracted features (e.g., ResNet-18 features), as the accuracy predictor. The predicted accuracy is used in Equation 4, which is solved by the same ILP solver as ours.

Our Method. We evaluate OCCAM (see Section 4) under various metrics (i.e., l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norms) and cost budgets. We consider images represented by model-based embeddings. Specifically, we extract the image feature999The image feature is the last layer output of a ML model (e.g., ResNet-18) trained on the target dataset, given an input image. of the query image and all the validation images. The costs incurred by feature extraction are deducted from the user budget B𝐵Bitalic_B before we compute the optimal model portfolio. We report the test accuracy under different cost budgets for OCCAM and all baselines in Section 5.2 (Figure 3 and Table 2), validate that OCCAM is cost-aware and indeed selecting the most profitable ML models to deliver high accuracy solutions in Section 5.3 (Figure 4(a)), demonstrate the effectiveness of OCCAM with limited samples in Section 5.4 (Figure 4(b)), investigate the nearest neighbour distance with different sample sizes in Section A.2, show that the estimation error of our accuracy estimator quickly decreases as the sample size increases in Section A.3, test the generalizability of OCCAM with different feature extractors in Section A.4, and provide more performance results under different metrics in Section A.5.

For simplicity, unless otherwise stated, we report OCCAM performance using ResNet-18 features and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT metric with K=40𝐾40K=40italic_K = 40 for all datasets (s=500𝑠500s=500italic_s = 500 for CIFAR10, CIFAR100, and s=1000𝑠1000s=1000italic_s = 1000 for Tiny ImageNet, ImageNet-1K). We choose λ=100𝜆100\lambda=100italic_λ = 100 for ImageNet-1K and λ=5𝜆5\lambda=5italic_λ = 5 for all other datasets because ImageNet-1K contains a high variety of image classes (1000 classes) that leads to relatively high estimation errors and requires more regularization penalty via large λ𝜆\lambdaitalic_λ values.

5.2 Performance Results

Accuracy Drop (%)
Cost Reduction (%) CIFAR10 CIFAR100
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 2.22 2.86 0.97 0.56 3.18 3.29 0.52 0.34
20 2.22 2.86 1.13 0.50 3.18 3.29 0.79 0.36
40 2.22 2.86 1.22 0.51 3.18 3.29 1.98 0.62
Cost Reduction (%) Tiny-ImageNet-200 ImageNet-1K
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 4.01 7.03 0.86 0.17 2.53 5.98 0.59 0.51
20 4.01 7.03 1.49 0.61 2.53 5.98 1.12 1.05
40 4.01 7.03 3.88 2.75 2.53 5.98 2.35 2.24
Table 2: Cost reduction v.s. accuracy drop by OCCAM and baselines. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries. For example, on Tiny ImageNet, using SwinV2-B to classify all 10,0001000010,00010 , 000 test images achieves an accuracy of 82.5%percent82.582.5\%82.5 % and incurs a total cost of $0.519currency-dollar0.519\$0.519$ 0.519 (we take it as the normalized cost 1, see Table 1). A 10%percent1010\%10 % cost reduction equals a cost budget of $0.467currency-dollar0.467\$0.467$ 0.467 (i.e., a normalized cost 0.9), under which we evaluate the achieved accuracy of OCCAM and all baselines and report the relative accuracy drops.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Accuracy-cost tradeoffs achieved by OCCAM and baselines, for different cost budgets.

We investigate the test accuracy achieved by OCCAM and all baselines under different cost budgets and depict the results in Figure 3. We can see that by trading little to no accuracy drop, OCCAM achieves significant cost savings and outperforms all baselines across a majority of experiment settings. Results on cost reduction vs accuracy drop for all approaches are summarized in Table 2. On easy classification task (CIFAR-10 of 10101010 classes), OCCAM consistently outperforms all baselines by achieving 40%percent4040\%40 % cost reduction with up to 0.56%percent0.560.56\%0.56 % accuracy drop. Cost reduction and accuracy drop are computed w.r.t. using the strongest model (i.e., SwinV2-B) for all queries. On moderate classification task (CIFAR-100 of 100100100100 classes), OCCAM outperforms all baselines by trading up to 0.62%percent0.620.62\%0.62 % accuracy drop for 40%percent4040\%40 % cost reduction. On hard classification task (Tiny ImageNet of 200200200200 classes), OCCAM significantly outperforms all three baselines with at least 0.5%percent0.50.5\%0.5 % higher accuracy. Notably, on aggressive cost regimes (e.g., 40%percent4040\%40 % cost reduction), the achieved accuracy of OCCAM is 1.1%percent1.11.1\%1.1 % higher than FrugalMCT, 4.3%percent4.34.3\%4.3 % higher than random, and 1.3%percent1.31.3\%1.3 % higher than single best. On the most challenging classification task (ImageNet-1K of 1000100010001000 classes), OCCAM still consistently outperforms all three baselines with higher accuracy at all cost budget levels. We believe that the above results demonstrate the generalized effectiveness of OCCAM in achieving non-trivial cost reduction for a small accuracy drop on classification tasks of different difficulty levels.

5.3 Validation Results

Refer to caption
(a) Model usage.
Refer to caption
(b) Sample size.
Figure 4: (a) Model usage breakdown by OCCAM under different cost budgets, (b) OCCAM accuracy with different sample sizes.

We validate that OCCAM is functioning as intended, that is, it does select small-yet-profitable classifiers when budgets are limited and gradually switches to large-but-accurate classifiers as cost budgets increase. In Figure 4(a) we plot the model usage for each classifier under different cost budgets on the Tiny ImageNet dataset. From the figure, it can be seen that when cost budgets are restricted, OCCAM mainly chooses ResNet-18 to resolve queries given its cheap prices and good accuracy (as seen in Table 1 and Figure 1(a)). As budgets increase, OCCAM gradually switches to SwinV2-S and SwinV2-B, given their predominantly high accuracy (82%percent8282\%82 % as seen in Figure 1(a)).

5.4 Stability Analysis

OCCAM pre-computes K𝐾Kitalic_K labelled samples of size s𝑠sitalic_s to estimate the test accuracy at inference time. We investigate OCCAM performance with different total sample sizes (Ks𝐾𝑠K\cdot sitalic_K ⋅ italic_s) by setting s=1000𝑠1000s=1000italic_s = 1000 and changing K𝐾Kitalic_K from 10101010 to 40404040 (see Figure 4(b)). We report results on the Tiny ImageNet dataset. In Figure 4(b), we plot the achieved accuracy of OCCAM under different total sample sizes (Ks𝐾𝑠K\cdot sitalic_K ⋅ italic_s) and normalized cost budgets (B𝐵Bitalic_B). We also report FrugalMCT performance using a maximum of 40,0004000040,00040 , 000 sampled images to train its accuracy predictor. With budget B=0.8𝐵0.8B=0.8italic_B = 0.8 (i.e., 20% cost reduction), OCCAM achieves comparable performance to FrugalMCT at 25%percent2525\%25 % samples and continues to outperform FrugalMCT as the total sample size increases. With budget B=0.6𝐵0.6B=0.6italic_B = 0.6 (i.e., 40% cost reduction), OCCAM outperforms FrugalMCT by 0.7%percent0.70.7\%0.7 % higher accuracy with only 25%percent2525\%25 % samples and achieves up to 1.3%percent1.31.3\%1.3 % higher accuracy as the total sample size increases, which demonstrates the sustained effectiveness of OCCAM even with limited samples.

6 Discussion and Conclusion

Motivated by the need to optimize the classifier assignment to different image classification queries with pre-defined cost budgets, we have formulated the optimal model portfolio problem and proposed a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively deliver high accuracy solutions. We present an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees, and compute an optimal classifier assignment with novel regularization techniques mitigating overestimation risks. Our experimental results on a variety of real-world datasets show that we can achieve up to 40% cost reduction with no significant drop in classification accuracy.

While we mainly demonstrate the effectiveness of OCCAM on the image classification task, we argue that OCCAM is a generic approach to solve a wide range of classification problems carried out by various ML classifiers. We identify the following possible extensions: (i) Extension to other classification tasks. At the heart of our approach is the requirement that the classification task is well separated (see Section 4.1), meaning intuitively that instances (e.g., images) of the problem should not sharply change their class labels under minor modification. A wide range of classification problems (e.g., sentiment analysis in NLP) appear to naturally satisfy this precondition. The challenge is how to choose the most suitable numeric representation so that the separation property is preserved. Recent advances in representation learning like contrastive learning are likely to help. (ii) Extension to other ML classifiers/services. In addition to open-sourced models, it is intriguing to see how to apply OCCAM on online classification APIs (e.g., Google Prediction API) and to which extent it can boost accuracy with cost savings in production settings. We will explore these extensions in our future work.

References

  • Alghofaili et al. [2020] Y. Alghofaili, A. Albattah, and M. A. Rassam. A financial fraud detection model based on lstm deep learning technique. Journal of Applied Security Research, 15(4):498–516, 2020.
  • Antol et al. [2015] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  • AzureML [2024] AzureML. Azure machine learning pricing, Feb. 2024. URL https://azure.microsoft.com/en-ca/pricing/details/machine-learning/.
  • Bartlett et al. [2017] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
  • Chang et al. [2024] J. Chang, X. Chen, and M. Wu. Central limit theorems for high dimensional dependent data. Bernoulli, 30(1):712–742, 2024.
  • Chen et al. [2020] L. Chen, M. Zaharia, and J. Y. Zou. Frugalml: How to use ml prediction apis more accurately and cheaply. Advances in neural information processing systems, 33:10685–10696, 2020.
  • Chen et al. [2022] L. Chen, M. Zaharia, and J. Zou. Efficient online ml api selection for multi-label classification tasks. In International Conference on Machine Learning, pages 3716–3746. PMLR, 2022.
  • [8] S. CS231n. Tiny imagenet dataset. URL http://cs231n.stanford.edu/tiny-imagenet-200.zip.
  • Dalal and Triggs [2005] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
  • Dao et al. [2022] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  • Ding et al. [2022] D. Ding, S. Amer-Yahia, and L. Lakshmanan. On efficient approximate queries over machine learning models. Proceedings of the VLDB Endowment, 16(4):918–931, 2022.
  • Ding et al. [2024] D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V. Rühle, L. V. S. Lakshmanan, and A. H. Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3mUtqnM.
  • Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Elsken et al. [2019] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019.
  • Eriksson et al. [2004] K. Eriksson, D. Estep, C. Johnson, K. Eriksson, D. Estep, and C. Johnson. Lipschitz continuity. Applied Mathematics: Body and Soul: Volume 1: Derivatives and Geometry in IR 3, pages 149–164, 2004.
  • Fischetti and Monaci [2020] M. Fischetti and M. Monaci. A branch-and-cut algorithm for mixed-integer bilinear programming. European Journal of Operational Research, 282(2):506–514, 2020.
  • Gleixner et al. [2021] A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P. M. Christophel, K. Jarck, T. Koch, J. Linderoth, M. Lübbecke, H. D. Mittelmann, D. Ozyurt, T. K. Ralphs, D. Salvagnin, and Y. Shinano. MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library. Mathematical Programming Computation, 2021. doi: 10.1007/s12532-020-00194-3. URL https://doi.org/10.1007/s12532-020-00194-3.
  • Gouk et al. [2021] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
  • Hassibi et al. [1993] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hinton et al. [2015] G. Hinton, O. Vinyals, J. Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  • Howard et al. [2017] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Huangfu and Hall [2018] Q. Huangfu and J. J. Hall. Parallelizing the dual revised simplex method. Mathematical Programming Computation, 10(1):119–142, 2018.
  • Jacob et al. [2018] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  • Kag et al. [2022] A. Kag, I. Fedorov, A. Gangrade, P. Whatmough, and V. Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2022.
  • Kellerer et al. [2004] H. Kellerer, U. Pferschy, D. Pisinger, H. Kellerer, U. Pferschy, and D. Pisinger. The multiple-choice knapsack problem. Knapsack Problems, pages 317–347, 2004.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • LeCun et al. [1989] Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Liu et al. [2021] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Liu et al. [2022] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  • Lowe [2004] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  • Miotto et al. [2018] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246, 2018.
  • Qi et al. [2023] X. Qi, J. Wang, Y. Chen, Y. Shi, and L. Zhang. Lipsformer: Introducing lipschitz continuity to vision transformers. arXiv preprint arXiv:2304.09856, 2023.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Su et al. [2018] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European conference on computer vision (ECCV), pages 631–648, 2018.
  • Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • Tang et al. [2021] J. Tang, S. Li, and P. Liu. A review of lane detection methods based on deep learning. Pattern Recognition, 111:107623, 2021.
  • Urban et al. [2016] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose, and M. Richardson. Do deep convolutional nets really need to be deep and convolutional? arXiv preprint arXiv:1603.05691, 2016.
  • Vanhoucke et al. [2011] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. 2011.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vinyals et al. [2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  • Wang et al. [2021] R. Wang, M. B. Alazzam, F. Alassery, A. Almulihi, and M. White. Innovative research of trajectory prediction algorithm based on deep learning in car network collision detection and early warning system. Mobile information systems, 2021:1–8, 2021.
  • Yang et al. [2020] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhutdinov, and K. Chaudhuri. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
  • Zhu et al. [2020] W. Zhu, L. Xie, J. Han, and X. Guo. The application of deep learning in cancer prognosis prediction. Cancers, 12(3):603, 2020.
  • Zoph and Le [2016] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Appendix A Additional Experiments

A.1 Real Image Datasets Are Well Separated.

In [Yang et al., 2020], authors have shown that many real image classification tasks comprise of separated classes in RGB-valued space. In this section, we provide further empirical evidence to show that real image datasets (e.g., Tiny ImageNet) are well separated (see Definition 4.1) in different feature spaces under various metrics (Figures 5 and 6).

In Figure 5, we provide an intuitive example to illustrate that images from different classes (e.g., “goldfish” and “bullfrog”) are typically well separated by a non-zero distance. In Figure 6, we investigate the distance distribution for images of different classes from Tiny ImageNet (200200200200 classes). We observe that images of different classes are typically far from each other by a non-zero distance under different metrics (e.g., l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT) in different feature spaces (e.g., image features extracted by ResNet-18, ResNet-50, and SwinV2-T).

Refer to caption
Figure 5: Intuitive example of well separated images from Tiny ImageNet under lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT distance metric using ResNet-18 features. Images from different classes (e.g., “goldfish” and “bullfrog”) are typically well separated by a non-zero distance.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Distance distribution between images of different classes from Tiny Imagenet. We consider image representation derived by different feature extractors (ResNet-18, ResNet-50, and SwinV2-T) as well as different metrics (l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT). Images of different classes are typically far from each other by non-zero distances.

In addition, we note that real image datasets are subject to little to no label noises. For example, on Tiny ImageNet, we investigate 40,0004000040,00040 , 000 images from the training split and only find 4444 duplicate images of different class labels. We also consider more standard image datasets (see Section 5.1). It turns out that CIFAR-10 contains no label noise, CIFAR-100 contains 3333 duplicate images of different class labels (out of 20,0002000020,00020 , 000 images), and the noise frequency on ImageNet-1K is 8888 out of 40,0004000040,00040 , 000 images. Our observation suggests that standard image datasets are quite clean (aligned with the observation in [Yang et al., 2020]) that justifies the adoption of well-separation assumption.

A.2 Nearest Neighbour Distance Approaches 0 As Sample Size Increases.

In this section, we conduct experiments to investigate the changes of nearest neighbour distance (dist(x,NNS(x))𝑑𝑖𝑠𝑡𝑥𝑁subscript𝑁𝑆𝑥dist(x,NN_{S}(x))italic_d italic_i italic_s italic_t ( italic_x , italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) )) as sample size (s𝑠sitalic_s) increases. We report results using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T) as well as different metrics (l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT) on the validation split of Tiny ImageNet dataset (Figure 7).

Refer to caption
Refer to caption
Refer to caption
Figure 7: Nearest neighbour distance quickly approaches 0 as the sample size increases using different image feature extractors (ResNet-18, ResNet-50, and SwinV2-T) and metrics (l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT).

It can be clearly seen in Figure 7 that the distance to the sampled nearest neighbour quickly approaches 0 as sample size increases. This could be attributable to the fact that we are sampling from real images. With properly pre-trained feature extractors, the possible image embeddings could be restricted to a subspace rather than pervade the whole high-dimensional space, which can significantly reduce the required number of samples and give us meaningfully small distances to the sampled nearest neighbours.

Another interesting observation is that, in all investigated feature space, lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT always provides the smallest nearest neighbour distance with different sample sizes, followed by l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Such distinction mainly results from the fact that we use normalized image features where each dimension of the feature vector x𝑥xitalic_x is between 0 and 1, that is, 0x[i]10𝑥delimited-[]𝑖10\leq x[i]\leq 10 ≤ italic_x [ italic_i ] ≤ 1 for any x[i]x𝑥delimited-[]𝑖𝑥x[i]\in xitalic_x [ italic_i ] ∈ italic_x. Consequently, we have the inequality that the l(x)=max{|x[i]||x[i]x}l2(x)=x[i]x|x[i]|2l1(x)=x[i]x|x[i]|subscript𝑙𝑥conditional𝑥delimited-[]𝑖𝑥delimited-[]𝑖𝑥subscript𝑙2𝑥subscript𝑥delimited-[]𝑖𝑥superscript𝑥delimited-[]𝑖2subscript𝑙1𝑥subscript𝑥delimited-[]𝑖𝑥𝑥delimited-[]𝑖l_{\infty}(x)=\max\{|x[i]||x[i]\in x\}\leq l_{2}(x)=\sqrt{\sum_{x[i]\in x}|x[i% ]|^{2}}\leq l_{1}(x)=\sum_{x[i]\in x}|x[i]|italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_x ) = roman_max { | italic_x [ italic_i ] | | italic_x [ italic_i ] ∈ italic_x } ≤ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_x [ italic_i ] ∈ italic_x end_POSTSUBSCRIPT | italic_x [ italic_i ] | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_x [ italic_i ] ∈ italic_x end_POSTSUBSCRIPT | italic_x [ italic_i ] |. Recall that the OCCAM employs the classifier accuracy estimator which is asymptotically unbiased as nearest neighbour distance approaches 0. The above observation suggests that lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT is likely to provide smaller nearest neighbour distance and reduce the estimation error that leads to higher overall performance, especially in scenarios when sampling is expensive or labelled data is scarce.

A.3 Estimation Error Decreases As Sample Size Increases.

In this section, we investigate the estimation error (difference between real classifier accuracy and our estimator results) for different ML classifiers, using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T). For brevity, on Tiny ImageNet, we report the estimation error in the accuracy of all 7 classifiers (ResNet-[18, 34, 50, 101], and SwinV2-[T, S, B]), under lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT metric (Figure 8). The patterns are similar with other metrics and feature extractors.

Refer to caption
Refer to caption
Refer to caption
Figure 8: Estimation error for each ML classifier quickly decreases as the sample size increases using different image feature extractors (ResNet-18, ResNet-50, and SwinV2-T) under lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT metrics.

It is clear from Figure 8 that the estimation error of our accuracy estimator continues to decrease for all ML classifiers as the sample size increases, which demonstrates the effectiveness our accuracy estimator design (see Section 4.1).

Accuracy Drop (%)
Cost Reduction (%) Tiny-ImageNet-200
Single
Best
Rand
FrugalMCT
(ResNet-18)
FrugalMCT
(ResNet-50)
FrugalMCT
(SwinV2-T)
OCCAM
(ResNet-18)
OCCAM
(ResNet-50)
OCCAM
(SwinV2-T)
10 4.01 7.03 0.86 0.84 1.18 0.48 0.40 0.29
20 4.01 7.03 1.49 1.45 1.60 1.02 0.74 0.58
40 4.01 7.03 3.88 4.12 3.22 3.24 2.56 2.81
Table 3: Cost reduction v.s. accuracy drop by baselines and OCCAM using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T) and lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.

A.4 Generalizing to Different Feature Extractors

We further report the performance of OCCAM with different feature extractors (ResNet-18, ResNet-50, and SwinV2-T), on TinyImageNet. As in illustrated in Section 5.1, the costs incurred by feature extraction are “deducted from the user budget before we compute the optimal model portfolio”. Results are summarized in Table 3. It can be seen that OCCAM outperforms all baselines on all experimental settings, which demonstrates the effectiveness and generalizability of OCCAM with different feature extractors.

A.5 More OCCAM Performance Results.

In this section, we provide more OCCAM performance results using l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm metrics, as shown in Figures 9 and 10. Qualitative comparison results are summarized in Tables 4 and 5, which resemble our analysis in Section 5.2. Typically, by trading little to no performance drop, OCCAM can achieve significant cost reduction and outperform all baselines across a majority of experiment settings.

However, we also note that FrugalMCT can sometimes outperform OCCAM on ImageNet-1K using l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT metrics, while OCCAM outperforms FrugalMCT across all experiment settings using lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT metric (see Section 5.2). This could be explained by the fact that l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT metrics are likely to provide higher nearest neighbour distance than lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT metric (see Section A.2) that implicitly increases OCCAM estimator error and leads to reduced overall performance, especially when the classification task is challenging and labelled data is scarce. Provided that, in practice, we would recommend applying OCCAM with lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT to achieve significant cost reduction with little to no performance drop (see Section 5.2).

Accuracy Drop (%)
Cost Reduction (%) CIFAR10 CIFAR100
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 2.22 2.86 0.97 0.38 3.18 3.29 0.52 0.50
20 2.22 2.86 1.13 0.38 3.18 3.29 0.79 0.50
40 2.22 2.86 1.22 0.37 3.18 3.29 1.98 0.99
Cost Reduction (%) Tiny-ImageNet-200 ImageNet-1K
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 4.01 7.03 0.86 0.48 2.53 5.98 0.59 0.86
20 4.01 7.03 1.49 1.02 2.53 5.98 1.12 1.51
40 4.01 7.03 3.88 3.24 2.53 5.98 2.35 3.32
Table 4: Cost reduction v.s. accuracy drop by OCCAM and baselines using ResNet-18 features and l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Accuracy-cost tradeoffs achieved by OCCAM and baselines using l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metric and ResNet-18 features, for different cost budgets.
Accuracy Drop (%)
Cost Reduction (%) CIFAR10 CIFAR100
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 2.22 2.86 0.97 0.24 3.18 3.29 0.52 0.34
20 2.22 2.86 1.13 0.25 3.18 3.29 0.79 0.40
40 2.22 2.86 1.22 0.27 3.18 3.29 1.98 0.71
Cost Reduction (%) Tiny-ImageNet-200 ImageNet-1K
Single
Best
Rand
Frugal
-MCT
OCCAM
Single
Best
Rand
Frugal
-MCT
OCCAM
10 4.01 7.03 0.86 0.21 2.53 5.98 0.59 1.06
20 4.01 7.03 1.49 0.81 2.53 5.98 1.12 1.65
40 4.01 7.03 3.88 2.75 2.53 5.98 2.35 3.10
Table 5: Cost reduction v.s. accuracy drop by OCCAM and baselines using ResNet-18 features and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Accuracy-cost tradeoffs achieved by OCCAM and baselines using l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT metric and ResNet-18 features, for different cost budgets.

Appendix B Proofs

In this section, we provide proofs to Lemmas 4.3, 4.4, 4.5 and 4.6.

Proof to Lemma 4.3

Proof.

The proof is straightforward. Without loss of generality, we consider the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metric and assume (𝒳,l1)𝒳subscript𝑙1(\mathcal{X},l_{1})( caligraphic_X , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is a r𝑟ritalic_r-separated metric space. For brevity, we abuse the notation and let O(x)𝑂𝑥O(x)italic_O ( italic_x ) denote the one-hot output distribution over all labels. For any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, if x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT belong to the same class, then O(x)O(x)1=02rxx1subscriptnorm𝑂𝑥𝑂superscript𝑥102𝑟subscriptnorm𝑥superscript𝑥1\|O(x)-O(x^{\prime})\|_{1}=0\leq\frac{2}{r}\cdot\|x-x^{\prime}\|_{1}∥ italic_O ( italic_x ) - italic_O ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ≤ divide start_ARG 2 end_ARG start_ARG italic_r end_ARG ⋅ ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; otherwise, O(x)O(x)1=22rxx1subscriptnorm𝑂𝑥𝑂superscript𝑥122𝑟subscriptnorm𝑥superscript𝑥1\|O(x)-O(x^{\prime})\|_{1}=2\leq\frac{2}{r}\cdot\|x-x^{\prime}\|_{1}∥ italic_O ( italic_x ) - italic_O ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 ≤ divide start_ARG 2 end_ARG start_ARG italic_r end_ARG ⋅ ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The Lipschitiz constant for O𝑂Oitalic_O is 2r2𝑟\frac{2}{r}divide start_ARG 2 end_ARG start_ARG italic_r end_ARG. ∎

Proof to Lemma 4.4

Proof.

Similarly, without loss of generality, we consider the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metric and let fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), O(x)𝑂𝑥O(x)italic_O ( italic_x ) denote the output distribution over all labels. Let Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and LOsubscript𝐿𝑂L_{O}italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT denote the Lipschitz constants for fi(x)subscript𝑓𝑖𝑥f_{i}(x)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) and O(x)𝑂𝑥O(x)italic_O ( italic_x ) respectively. For any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, if x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT belong to the same class, then SPi(x)SPi(x)1=|fi(x)[O(x)]fi(x)[O(x)]|fi(x)fi(x)1Lixx1subscriptnorm𝑆subscript𝑃𝑖𝑥𝑆subscript𝑃𝑖superscript𝑥1subscript𝑓𝑖𝑥delimited-[]𝑂𝑥subscript𝑓𝑖superscript𝑥delimited-[]𝑂𝑥subscriptnormsubscript𝑓𝑖𝑥subscript𝑓𝑖superscript𝑥1subscript𝐿𝑖subscriptnorm𝑥superscript𝑥1\|SP_{i}(x)-SP_{i}(x^{\prime})\|_{1}=|f_{i}(x)[O(x)]-f_{i}(x^{\prime})[O(x)]|% \leq\|f_{i}(x)-f_{i}(x^{\prime})\|_{1}\leq L_{i}\cdot\|x-x^{\prime}\|_{1}∥ italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) [ italic_O ( italic_x ) ] | ≤ ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; otherwise, SPi(x)SPi(x)1=|fi(x)[O(x)]fi(x)[O(x)]|1=12O(x)O(x)1LO2xx1subscriptnorm𝑆subscript𝑃𝑖𝑥𝑆subscript𝑃𝑖superscript𝑥1subscript𝑓𝑖𝑥delimited-[]𝑂𝑥subscript𝑓𝑖superscript𝑥delimited-[]𝑂superscript𝑥112subscriptnorm𝑂𝑥𝑂superscript𝑥1subscript𝐿𝑂2subscriptnorm𝑥superscript𝑥1\|SP_{i}(x)-SP_{i}(x^{\prime})\|_{1}=|f_{i}(x)[O(x)]-f_{i}(x^{\prime})[O(x^{% \prime})]|\leq 1=\frac{1}{2}\|O(x)-O(x^{\prime})\|_{1}\leq\frac{L_{O}}{2}\cdot% \|x-x^{\prime}\|_{1}∥ italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) [ italic_O ( italic_x ) ] - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) [ italic_O ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] | ≤ 1 = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_O ( italic_x ) - italic_O ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ divide start_ARG italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ⋅ ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The Lipschitiz constant for SPi(x)𝑆subscript𝑃𝑖𝑥SP_{i}(x)italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is max{Li,LO2}subscript𝐿𝑖subscript𝐿𝑂2\max\{L_{i},\frac{L_{O}}{2}\}roman_max { italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , divide start_ARG italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG }. ∎

Proof to Lemma 4.5

Proof.

The proof leverages the fact that, as the sample size increases, the expected distance between x𝑥xitalic_x and its nearest neighbour monotonically decreases. Letting Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the Lipschitz constant of SPi𝑆subscript𝑃𝑖SP_{i}italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we have the estimation error |𝔼[SPi(NNS(x))]SPi(x)|=𝔼[|SPi(NNS(x))SPi(x)|]Li𝔼[dist(NNS(x),x)]𝔼delimited-[]𝑆subscript𝑃𝑖𝑁subscript𝑁𝑆𝑥𝑆subscript𝑃𝑖𝑥𝔼delimited-[]𝑆subscript𝑃𝑖𝑁subscript𝑁𝑆𝑥𝑆subscript𝑃𝑖𝑥subscript𝐿𝑖𝔼delimited-[]𝑑𝑖𝑠𝑡𝑁subscript𝑁𝑆𝑥𝑥|\mathbb{E}[SP_{i}(NN_{S}(x))]-SP_{i}(x)|=\mathbb{E}[|SP_{i}(NN_{S}(x))-SP_{i}% (x)|]\leq L_{i}\cdot\mathbb{E}[dist(NN_{S}(x),x)]| blackboard_E [ italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ) ] - italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) | = blackboard_E [ | italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ) - italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) | ] ≤ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_E [ italic_d italic_i italic_s italic_t ( italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_x ) ], which approaches 00 as 𝔼[dist(NNS(x),x)]𝔼delimited-[]𝑑𝑖𝑠𝑡𝑁subscript𝑁𝑆𝑥𝑥\mathbb{E}[dist(NN_{S}(x),x)]blackboard_E [ italic_d italic_i italic_s italic_t ( italic_N italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_x ) ] decreases. ∎

Proof to Lemma 4.6

Proof.

Lemma 4.5 shows that each SPi(NNSk(x))𝑆subscript𝑃𝑖𝑁subscript𝑁subscript𝑆𝑘𝑥SP_{i}(NN_{S_{k}}(x))italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) is an unbiased estimator of SPi(x)𝑆subscript𝑃𝑖𝑥SP_{i}(x)italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K, as s𝑠sitalic_s approaches infinity. Let σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{\prime 2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT denote the variance of SPi(NNSk(x))𝑆subscript𝑃𝑖𝑁subscript𝑁subscript𝑆𝑘𝑥SP_{i}(NN_{S_{k}}(x))italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) for each k𝑘kitalic_k. By the Central Limit Theorem, the distribution of the estimator 1Kk=1KSPi(NNSk(x))1𝐾superscriptsubscript𝑘1𝐾𝑆subscript𝑃𝑖𝑁subscript𝑁subscript𝑆𝑘𝑥\frac{1}{K}\sum_{k=1}^{K}SP_{i}(NN_{S_{k}}(x))divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_N italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) approaches a normal distribution with variance σi2Ksuperscriptsubscript𝜎𝑖2𝐾\frac{\sigma_{i}^{\prime 2}}{\sqrt{K}}divide start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG [Chang et al., 2024]. ∎

Appendix C Experiment Details

C.1 Datasets

CIFAR-10101010https://www.cs.toronto.edu/~kriz/cifar.html. CIFAR-10 [Krizhevsky et al., 2009] contains 60,0006000060,00060 , 000 images of resolution 32×32323232\times 3232 × 32, evenly divided into 10101010 classes, where 50,0005000050,00050 , 000 images are for training and 10,0001000010,00010 , 000 images are for testing. We randomly sample 20,0002000020,00020 , 000 images from the training set as our validation set, and we use the remaining 30,0003000030,00030 , 000 images to train our models.

CIFAR-100111111https://www.cs.toronto.edu/~kriz/cifar.html. Same as CIFAR-10, CIFAR-100 [Krizhevsky et al., 2009] has 50,0005000050,00050 , 000 training and 10,0001000010,00010 , 000 testing images. But they are evenly separated into 100100100100 classes. We randomly sample 20,0002000020,00020 , 000 training images as our validation set.

Tiny ImageNet121212http://cs231n.stanford.edu/tiny-imagenet-200.zip. Tiny ImageNet [CS231n, ] is a subset of the ImageNet-1K dataset [Russakovsky et al., 2015]. It covers 200200200200 class labels and all images are in resolution 64×64646464\times 6464 × 64. It includes 100,000100000100,000100 , 000 training, 10,0001000010,00010 , 000 validation, and 10,0001000010,00010 , 000 testing images. The given test split does not have ground-truth labels, thus we discard this set and use the validation split as our testing data. We randomly sample 40,0004000040,00040 , 000 training images as the validation data and use the remaining 60,0006000060,00060 , 000 ones to train the models.

ImageNet-1K131313https://image-net.org/download.php. We use the image classification dataset in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 [Russakovsky et al., 2015]. This dataset contains 1,281,167 training, 50,000 validation, and 100,000 testing images, covering 1,000 classes. Images are of various resolutions. Since the models we use are pre-trained on this dataset, we do not train the last linear layer of the models. The given test split comes without ground-truth labels; thus we use the validation split to evaluate our method and baselines. Among the 50,000 validation images, we randomly select 10,000 of them as our testing data and the remaining ones are treated as the validation data.

C.2 Models

We use ResNet [He et al., 2016] and Swin Transformer V2 (SwinV2) [Liu et al., 2022] models on the image classification task because they are popular models for the task and many of their pre-trained weights on the ImageNet-1K dataset [Russakovsky et al., 2015] are available online141414For example, the pre-trained models we use are from https://pytorch.org/vision/stable/models.html. Specifically, the pre-trained weights we use are as follows. ResNet-18: ResNet18_Weights.IMAGENET1K_V1 ResNet-34: ResNet34_Weights.IMAGENET1K_V1 ResNet-52: ResNet50_Weights.IMAGENET1K_V1 ResNet-101: ResNet101_Weights.IMAGENET1K_V1 SwinV2-T: Swin_V2_T_Weights.IMAGENET1K_V1 SwinV2-S: Swin_V2_S_Weights.IMAGENET1K_V1 SwinV2-B: Swin_V2_B_Weights.IMAGENET1K_V1

, where reasonable performance are achieved. On CIFAR-10, CIFAR-100, and Tiny ImageNet, we freeze everything of the pre-trained models but only train the last linear layer of each model from scratch. For all seven models, we use the Adam optimizer [Kingma and Ba, 2015] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, constant learning rate 0.000010.000010.000010.00001, and a batch size of 500 for training. Models are trained till convergence.