OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

Dujian Ding University of British Columbia Bicheng Xu University of British Columbia Laks V.S. Lakshmanan University of British Columbia

Abstract

Image classification is a fundamental building block for a majority of computer vision applications. With the growing popularity and capacity of machine learning models, people can easily access trained image classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over image classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves $40\%$ cost reduction with little to no accuracy drop.

1 Introduction

With the breakthroughs in AI and advances in computer hardware (e.g., GPUs and TPUs) in recent decades, applications of computer vision have permeated our daily lives, ranging from face recognition systems to autonomous driving technologies. Among all the day-to-day computer vision applications, a fundamental building block is the task of image classification, where given an image, an algorithm needs to recognize the object content inside the image.

The task of image classification has a long history in the computer vision literature. Before the emergence of deep learning, people mainly focused on designing handcrafted features or descriptors for images, such as HOG [Dalal and Triggs, 2005] and SIFT [Lowe, 2004]. With the growing capability of deep learning models, many neural network architectures including convolutional neural networks (CNNs) [LeCun et al., 1998] and Transformers [Vaswani et al., 2017] have been proposed, e.g., AlexNet [Krizhevsky et al., 2012], ResNet [He et al., 2016], Vision Transformer [Dosovitskiy et al., 2020], and Swin Transformer [Liu et al., 2021]. Though larger neural network models are equipped with higher capacity, they often come with higher costs as well, e.g., hardware usage and latency (time), for both training and inference. This can potentially impose an enormous cost on both end users of image classification services and the service providers (e.g., Google¹¹1https://cloud.google.com/prediction, Amazon²²2https://aws.amazon.com/machine-learning, and Microsoft³³3https://studio.azureml.net). In response to this challenge, there has been a notable surge in interest directed towards the development of smaller, cost-effective image classifiers, e.g., MobileNet [Howard et al., 2017], where depthwise separable convolutions are used to trade classification accuracy for efficiency. However, empirical evaluations conducted in [Su et al., 2018], as well as our own independent assessment (see Figure 1(a)), consistently indicate that smaller models tend to exhibit a gap in classification accuracy compared to their larger counterparts.

Refer to caption — (a) Accuracy v/s classifier sizes.

Confronted with the general tradeoff between classification accuracy and inference cost, we advocate a hybrid inference framework which seeks to combine the advantages of both small and large models. Specifically, we study the problem, given a user specified cost budget and a group of ML classifiers of different capacity and cost, assign classifiers to resolve different image classification queries so that the aggregated accuracy is maximized and the overall cost is under the budget. We formally define it as the optimal model portfolio problem (details in Section 3). Our approach is motivated by the observation that while small classifiers typically have reduced accuracy over the population, they can still agree with large classifiers on certain queries a large proportion of the time, which suggests the existence of a subset of “easy” queries on which even small classifiers can make the right prediction. This is also illustrated in Figure 1(b) where we plot the frequency with which different classifiers successfully make the right prediction on the same image queries. For instance, ResNet-18 [He et al., 2016] can correctly classify $75\%$ of the images on which SwinV2-B [Liu et al., 2022] makes the right prediction, suggesting that we can replace SwinV2-B with ResNet-18 on these image queries, saving significant inference costs without any accuracy drop (details in Section 5).

With this insight, we propose a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively identify easy queries and assign classifiers to different user queries to maximize the overall classification accuracy subject to the given cost budgets. We present an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees. The intuition is that for well-separated classification problems such as image classification [Yang et al., 2020], we can learn robust classifiers that have similar performance on similar queries. For each query image, we compute its nearest neighbours in pre-computed samples to estimate the test accuracy for each classifier. Previous work [Chen et al., 2022] trains ML models to predict the accuracy, which requires sophisticated configuration and lacks performance guarantees that are critical in real-world scenarios. To our best knowledge, we are the first to open up the black box by develo** a white-box accuracy estimator for ML classifiers with statistical guarantees. Next, armed with our classifier accuracy estimator, we compute the optimal classifier assignment strategy over all query images (optimal model portfolio) subject to a given cost budget by solving an integer linear programming (ILP) problem (see Section 4). As a preview, Figure 1(c) shows that OCCAM can achieve $20\%$ cost reduction with less than $1\%$ accuracy drop. We show even higher cost reduction with little to no accuracy drop on various real-world datasets in Section 5. Figure 2 depicts the overall pipeline of OCCAM.

Our main technical contributions are: (1) we formally define the optimal model portfolio problem to reduce overall inference costs while maintaining high performance subject to user-specified cost budgets (Section 3); (2) we propose a novel and principled approach, OCCAM, to effectively compute the optimal model portfolio with statistical guarantees (Section 4); and (3) we provide an extensive experimental evaluation on a variety of real-world datasets on the image classification task (Section 5) demonstrating the effectiveness of OCCAM.

2 Related Work

Image Classification. Image classification is a fundamental task in computer vision, where given an image, a label needs to be predicted. It serves as an essential building block for many high-level AI tasks, e.g., image captioning [Vinyals et al., 2015] and visual question answering [Antol et al., 2015], where objects need to be first recognized. Before the deep learning era, researchers mainly adopted statistical methods with handcrafted features for the task, e.g., SIFT [Lowe, 2004]. With the growing capacity of deep learning models, from convolutional neural networks (CNN) [Krizhevsky et al., 2012, Simonyan and Zisserman, 2014, He et al., 2016, Szegedy et al., 2016] to Transformer architectures [Dosovitskiy et al., 2020, Liu et al., 2021], the classification accuracy on standard image classification benchmarks [Krizhevsky et al., 2009, Russakovsky et al., 2015] has been greatly improved. In this work, we utilize both CNN (e.g., ResNet models) and Transformer (e.g., Swin Transformers) image classifiers to illustrate and evaluate our proposed approach, OCCAM.

Efficient Machine Learning (ML) Inference. Efficient ML inference is crucial for real-time decision-making in various applications such as autonomous vehicles [Tang et al., 2021], healthcare [Miotto et al., 2018], and fraud detection [Alghofaili et al., 2020]. It involves applying a pre-trained ML model to make predictions, where the inference cost is expected to dominate the overall cost incurred by the model. Model compression, which replaces a large model with a smaller model of comparable accuracy, is the most common approach employed for enhancing ML inference efficiency. Common techniques for model compression include model pruning [Hassibi et al., 1993, LeCun et al., 1989], quantization [Jacob et al., 2018, Vanhoucke et al., 2011], knowledge distillation [Hinton et al., 2015, Urban et al., 2016], neural architecture search [Elsken et al., 2019, Zoph and Le, 2016], and so on. These static efficiency optimizations typically lead to a fixed model with lower inference cost but also reduced accuracy compared to its larger counterpart, which may not suffice in highly sensitive applications like collision detection [Wang et al., 2021] and prognosis prediction [Zhu et al., 2020]. This shortcoming is already evident in the inference platforms discussed in Section 1, highlighting the need for more dynamic optimizations to effectively address the diverse demands of users.

Hybrid ML Inference. Recent works [Kag et al., 2022, Ding et al., 2022, 2024] have introduced a novel inference paradigm termed hybrid inference, which invokes models of different sizes on different queries, as opposed to employing a single model on all inference queries. The smaller model generally incurs a lower inference cost but also exhibits reduced accuracy compared to the larger model. The key idea is to identify easy inference queries on which the small models are likely to make correct predictions and invoke small models on them when cost budgets are limited, thereby reducing overall inference costs while preserving solution accuracy. By adjusting the cost budgets, users can dynamically trade off between accuracy and cost within the same inference setup. [Kag et al., 2022, Ding et al., 2022, 2024] consider a simple setting of only one large and one small model and do not allow for explicit cost budget specification, which could be necessary for production scenarios. [Chen et al., 2020] studies a setup with multiple ML models and learns an adaptive strategy to generate predictions by calling a base model and sometimes an add-on model when the base model quality scores are lower than the learned thresholds. However, both base and add-on models are selected in a probabilistic manner and this approach fails to satisfy the user-specified cost budgets deterministically. [Chen et al., 2022] studies a similar setup with multiple ML models and allocates cost budgets according to model-based accuracy prediction. This approach requires a separate training phase for the accuracy predictor, which needs a large amount of training data, and provides no guarantee on the prediction quality. Unlike previous works, we propose an unbiased and low-variance accuracy estimator with asymptotic guarantees, based on which we present a novel approach, OCCAM, to effectively compute the optimal assignment of classifiers to given queries, under the cost budgets given by users.

3 Problem Definition

Let $\mathcal{X}\subseteq\mathbb{R}^{d}$ be an instance space (e.g., images) equipped with a metric $dist$ : $\mathcal{X}\times\mathcal{X}\to\mathbb{R}^{\geq 0}$ , and $[C]=\{1,2,\cdots,C\}$ be the set of possible labels with $C\geq 2$ . Let $\mathcal{X}$ contain $C$ disjoint classes, $\mathcal{X}^{(1)},\mathcal{X}^{(2)},\cdots,\mathcal{X}^{(C)}$ where for each $i\in[C]$ , all $x\in\mathcal{X}^{(i)}$ have label $i$ . Let $f_{1},f_{2},\cdots,f_{M}$ be a set of classifiers, with $b_{i}$ being the cost of a single inference call of $f_{i}$ . Given a query $x\in\mathcal{X}$ , each classifier $f_{i}$ outputs a single label from $[C]$ at the cost $b_{i}$ . We define a model portfolio as follows.

Definition 3.1 (Model Portfolio).

Given queries $X\subseteq\mathcal{X}$ to be classified and classifiers $f_{1},f_{2},\cdots,f_{M}$ , a model portfolio $\mu$ is a map** $\mu:X\to[M]$ such that each $x\in X$ is classified by the classifier $f_{\mu(x)}$ .

We assume an oracle classifier $O$ : $\mathcal{X}\to[C]$ which outputs the ground truth label $O(x)$ for all queries $x\in X$ . Given a finite set of queries $X\subseteq\mathcal{X}$ , the accuracy of a model portfolio $\mu$ on $X$ is the frequency of the ground truth labels correctly predicted by $\mu$ , i.e., $Accuracy_{X}(\mu)=\frac{\sum_{x\in X}\mathbbm{1}\{f_{\mu(x)}(x)=O(x)\}}{|X|}$ , where $\mathbbm{1}\{condition\}$ is an indicator function that outputs $1$ iff $condition$ is satisfied. Similarly, the cost of model portfolio $\mu$ on $X$ is the sum of all inference costs incurred by executing $\mu$ on $X$ , i.e., $Cost_{X}(\mu)=\sum\nolimits_{x\in X}b_{\mu(x)}$ . We will use the notation $Accuracy(\mu)$ , $Cost(\mu)$ when $X$ is clear from the context. We define our problem as follows.

Definition 3.2 (Optimal Model Portfolio).

Given queries $X\subseteq\mathcal{X}$ , a cost budget $B\in\mathbb{R}^{+}$ , and classifiers $f_{1},f_{2},\cdots,f_{M}$ , find the optimal model portfolio $\mu^{*}$ such that $Cost(\mu^{*})\leq B$ and $Accuracy(\mu^{*})\geq Accuracy(\mu),$ for all model portfolios $\mu$ with $Cost(\mu)\leq B$ .

4 Methodology

We describe the general framework to solve the optimal model portfolio problem in the next sections. Our overall strategy consists of two steps. Firstly, we propose an unbiased low-variance estimator for the accuracy of any given model portfolio $\mu$ , with asymptotic guarantees. Next, we describe how to determine the optimal model portfolio $\mu^{*}$ by formulating it as an integer linear programming (ILP) problem, subject to user-specified budget constraints. All proofs can be found in Appendix B.

4.1 Estimating $Accuracy(\mu)$

Previous work on Hybrid ML [Kag et al., 2022, Ding et al., 2022, 2024, Chen et al., 2022] typically relies on training a neural router to predict the accuracy of a given set of classifiers for given user queries, based on which queries are routed to different classifiers. Such a paradigm not only involves a non-trivial training configuration but also lacks estimation guarantees which can be critical in scientific and production settings. We propose a principled approach to estimate the test accuracy of a given model portfolio for given user queries. By leveraging the specific structure of well-separated classification problems like image classification, we propose an unbiased low-variance estimator for the test accuracy with asymptotic guarantees.

Without loss of generality, we consider the wide class of soft classifiers in this study. Given query $x\in\mathcal{X}$ , a soft classifier first outputs a distribution over all labels $[C]$ , based on which it then makes prediction at random. Given a soft classifier $f_{i}$ , we abuse the notation and let $f_{i}(x)[j]$ denote the likelihood that $f_{i}$ predicts label $j\in[C]$ , that is, $f_{i}(x)[j]:=Pr[f_{i}(x)=j]$ , for query $x\in\mathcal{X}$ . Deterministic classifiers (e.g., the oracle $O$ ) can be seen as a special case of soft classifiers with one-hot distribution over all labels. In practice, from softmax classifiers (e.g., ResNet), soft classifiers can be constructed by simply sampling w.r.t. the probability distribution output by the softmax layer.

Clearly, given a model portfolio $\mu$ , $Accuracy(\mu)$ is a random variable due to the random nature of the soft classifiers. The expected accuracy of any given model portfolio $\mu$ is,

$\mathbb{E}[Accuracy(\mu)]=\mathbb{E}[\frac{\sum_{x\in X}\mathbbm{1}\{f_{\mu(x)% }(x)=O(x)\}}{|X|}]=\frac{\sum_{x\in X}\mathbb{E}[\mathbbm{1}\{f_{\mu(x)}(x)=O(% x)\}]}{|X|}=\frac{\sum_{x\in X}f_{\mu(x)}(x){[O(x)]}}{|X|}$

(1)

where the last equality follows from $\mathbb{E}[\mathbbm{1}\{f_{\mu(x)}(x)=O(x)\}]=1\cdot Pr[f_{\mu(x)}(x)=O(x)]=f_% {\mu(x)}(x){[O(x)]}$ . Note that $f_{i}(x){[O(x)]}$ is the success probability that the classifier $f_{i}$ correctly predicts the ground truth label for query $x$ . For brevity, we define $SP_{i}(x):=f_{i}(x){[O(x)]}$ and rewrite the expected accuracy as $\mathbb{E}[Accuracy(\mu)]=\frac{\sum_{x\in X}SP_{\mu(x)}(x)}{|X|}$ .

The exact computation of success probability is intractable since the ground truth of user queries is unknown a priori. We propose a novel data-driven approach to estimate it for any classifier and show that our estimator is unbiased and low-variance with asymptotic guarantees, for well-separated classification problems like image classification. Based on this, we develop a principled approach for estimating the expected accuracy of a given model portfolio.

Definition 4.1 ( $r$ -separation [Yang et al., 2020]).

We say a metric space $(\mathcal{X},dist)$ where $\mathcal{X}=\cup_{i\in[C]}\mathcal{X}^{(i)}$ is $r$ -separated, if there exists a constant $r>0$ such that $dist(\mathcal{X}^{(i)},\mathcal{X}^{(j)})\geq r$ , $\forall i\neq j,$ where $dist(\mathcal{X}^{(i)},\mathcal{X}^{(j)})=\min_{x\in\mathcal{X}^{(i)},x^{% \prime}\in\mathcal{X}^{(j)}}dist(x,x^{\prime})$ .

In words, in an $r$ -separated metric space, there is a constant $r>0$ , such that the distance between instances from different classes is at least $r$ . The key observation is that many real-world classification tasks comprise of distinct classes. For instance, images of different categories (e.g., gold fish, bullfrog, etc.) are very unlikely to sharply change their classes under minor image modification. It has been widely observed [Yang et al., 2020] that the classification problem on real-world images empirically satisfies $r$ -separation under standard metrics (e.g., $l_{\infty}$ norm). We also observe similar patterns on a number of standard image datasets (e.g., Tiny ImageNet) and provide more empirical evidence in Section A.1. With this observation, we can show that the oracle classifier $O$ is Lipschitz continuous ⁴⁴4The Lipschitz continuity for soft classifiers is defined w.r.t. the output distribution. [Eriksson et al., 2004].

Definition 4.2 (Lipschitz Continuity).

Given two metric spaces $(\mathbf{X},d_{\mathbf{X}})$ and $(\mathbf{Y},d_{\mathbf{Y}})$ where $d_{\mathbf{X}}$ (resp. $d_{\mathbf{Y}}$ ) is the metric on the set $\mathbf{X}$ (resp. $\mathbf{Y}$ ), a function $f$ : $\mathbf{X}\to\mathbf{Y}$ is Lipschitz continuous if there exists a constant $L\geq 0$ s.t.

\forall x,x^{\prime}\in\mathbf{X}:\;\;d_{\mathbf{Y}}(f(x),f(x^{\prime}))\leq L% \cdot d_{\mathbf{X}}(x,x^{\prime})

(2)

and the smallest $L$ satisfying Equation 2 is called the Lipschitz constant of $f$ .

Lemma 4.3.

There exists an oracle classifier $O$ which is Lipschitz continuous if the metric space associated with the instances $\mathcal{X}$ is $r$ -separated.

If we further choose the classifiers $f_{i}$ to be Lipschitz continuous (e.g., MLP [Bartlett et al., 2017], ResNet [Gouk et al., 2021], Lipschitz continuous Transformer [Qi et al., 2023]), we can show that the success probability function $SP_{i}(x)$ (i.e., the likelihood that a classifier $f_{i}$ successfully predicts the ground truth label for query $x$ ) is also Lipschitz continuous.

Lemma 4.4.

The success probability function $SP_{i}(x)=f_{i}(x)[O(x)]$ is Lipschitz continuous if $f_{i}(x)$ and $O(x)$ are Lipschitz continuous.

An important implication of Lemma 4.4 is that, given a classifier, we can estimate its success probability on query $x$ by its success probability on a similar query $x^{\prime}$ . Let $L_{i}>0$ denote the Lipschitz constant for $SP_{i}$ . For any $x,x^{\prime}\in X$ , we have the estimation error bounded by $|SP_{i}(x^{\prime})-SP_{i}(x)|\leq L_{i}\cdot dist(x,x^{\prime})$ , which monotonically decreases as $dist(x,x^{\prime})$ approaches $0$ ⁵⁵5We evaluate the nearest neighbour distance and estimation error in Sections A.2 and A.3. In practice, we can pre-compute a labelled sample $S\subset\mathcal{X}$ (e.g., pre-compute classifier outputs on sampled queries from the validation set) and compute $NN_{S}(x)$ , the nearest neighbour of $x$ in $S$ , for success probability estimation. We show that the estimator is asymptotically unbiased, as sample size increases.

Lemma 4.5 (Asymptotically Unbiased Estimator).

Given query $x$ , a classifier $f_{i}$ , and uniformly sampled $S\subset\mathcal{X}$ ,

\lim_{s\to\infty}\mathbb{E}[SP_{i}(NN_{S}(x))]=SP_{i}(x)

(3)

where $s$ is the sample size and $NN_{S}(x):=\arg\min_{x^{\prime}\in S}dist(x,x^{\prime})$ is the nearest neighbour of $x$ in sample $S$ .

In practice, we draw $K$ i.i.d. samples, $S_{1},S_{2},\cdots,S_{K}$ , and compute the average sample accuracy $\widehat{SP}_{i}(x):=\frac{1}{K}\sum_{k=1}^{K}SP_{i}(NN_{S_{k}}(x))$ as the estimator of the test accuracy on query $x$ , for each classifier $f_{i}$ . It follows from Lemma 4.5 that $\widehat{SP}_{i}$ is also an asymptotically unbiased estimator. We further show below that $\widehat{SP}_{i}$ is an asymptotically low-variance estimator to $SP_{i}$ , as $K$ increases.

Lemma 4.6 (Asymptotically Low-Variance Estimator).

Given query $x$ , a classifier $f_{i}$ , and $K$ i.i.d. uniformly drawn samples $S_{1},S_{2},\cdots,S_{K}$ of size $s$ , let $\sigma_{i}^{2}$ denote the variance of the estimator $\widehat{SP}_{i}(x)$ . We have that $\sigma_{i}^{2}$ is asymptotically proportional to $\frac{1}{\sqrt{K}}$ as both $s$ and $K$ increase.

4.2 Computing $\mu^{*}$ with $Accuracy(\mu)$

In the previous section, we show how to estimate the accuracy for a given model portfolio. For each classifier $f_{i}$ and query $x$ , we propose to estimate its success probability ${SP}_{i}(x)$ based on similar queries from labelled samples $\widehat{SP}_{i}(x)$ , which can be efficiently pre-computed.

With the estimator in place, we formulate the problem of finding the optimal model portfolio as an integer linear programming (ILP) problem as follows. Given a set of $M$ classifiers $f_{1},f_{2},\cdots,f_{M}$ , user queries $X=\{x_{1},x_{2},\cdots,x_{N}\}$ , pre-computed samples $S_{1},S_{2},\cdots,S_{K}$ , and budget $B\in\mathbb{R}^{+}$ , we have the following ILP problem ⁶⁶6Our problem can be rephrased as “selecting for each query image, one item (i.e., ML classifier) from a collection (the set of all classifiers) so as to maximize the total value (accuracy) while adhering to a predefined weight limit (cost budget)”, which is a classic multiple choice knapsack problem (MCKP) [Kellerer et al., 2004] and the ILP formulation is the natural choice..

\begin{split}\max\quad&\sum_{i=1}^{M}\sum_{j=1}^{N}\widehat{SP}_{i}(x_{j})% \cdot t_{i,j}\\ \textrm{s.t.}\quad&\sum_{i=1}^{M}\sum_{j=1}^{N}b_{i}\cdot t_{i,j}\leq B\\ &\sum_{i=1}^{M}t_{i,j}=1\text{ for }j=1,2,\cdots,N,\text{ and }t_{i,j}\in\{0,1% \}\\ \end{split}

(4)

where $t_{i,j}$ are boolean variables and $t_{i,j}=1$ iff the classifier $f_{i}$ is assigned to query $x_{j}$ . Clearly, the optimal model portfolio $\mu^{*}$ can be efficiently computed as $\mu^{*}(x_{j})=i$ iff $t^{*}_{i,j}=1$ , for $i\in[M]$ and $j\in[N]$ , where $t^{*}_{i,j}$ is the optimal solution to the ILP problem above. While ILP problems are NP-hard in general, we can use standard ILP solvers (e.g., HiGHS [Huangfu and Hall, 2018]) to efficiently compute the optimal solution in practice.

The optimization problem aims to maximize the estimated model portfolio accuracy and is subject to the risk of overestimation due to selection bias, especially on large-scale problems. Intuitively, a poor classifier with high-variance estimates can be mistakenly assigned to some queries if its performance on those queries is overestimated. We address this by regularizing the accuracy estimate for each classifier by the corresponding estimator variance. Specifically, we optimize the objective $\sum_{i=1}^{M}\sum_{j=1}^{N}(\widehat{SP}_{i}(x_{j})-\lambda\cdot\sigma_{i})% \cdot t_{i,j}$ in Equation 4, where $\sigma_{i}$ is the standard deviation of the estimator $\widehat{SP}_{i}$ . As $\sigma_{i}$ is unknown a priori, we use a validation set to estimate $\sigma_{i}$ for each classifier $f_{i}$ and tune $\lambda$ for the highest validation accuracy.

5 Evaluation

5.1 Evaluation Setup

Task. We consider the image classification task: given an image, predict a class label from a set of predefined class categories. We assume that each image has a unique ground-truth class label.

Datasets. We consider 4 widely studied datasets for image classification: CIFAR-10 (10 classes) [Krizhevsky et al., 2009], CIFAR-100 (100 classes) [Krizhevsky et al., 2009], Tiny ImageNet (200 classes) [CS231n, ], and ImageNet-1K (1000 classes) [Russakovsky et al., 2015]. Both CIFAR-10 and CIFAR-100 contain $50,000$ training images and $10,000$ test images. Tiny ImageNet contains $100,000$ training images and $10,000$ validation images, and ImageNet-1K has 1,281,167 training images and 50,000 validation images. We use the test splits of CIFAR-10 and CIFAR-100 as well as the validation splits of Tiny ImageNet and ImageNet-1K for evaluation purposes. Details of those datasets are in LABEL:{sec:app_image_dset}.

Models. We consider a total of 7 classifiers: ResNet-[18, 34, 50, 101] [He et al., 2016]⁷⁷7Numbers in bracket indicate the model’s layer number. and SwinV2-[T, S, B] [Liu et al., 2022]⁸⁸8Letters in bracket indicate the Swin Transformer V2’s size. T/S/B means tiny/small/base. Among these classifiers, ResNet-18 is the smallest (in terms of number of model parameters and training/inference time) and thus has the least capacity, while SwinV2-B is the largest and with the highest accuracy in general (see Figure 1(a)). We take the classifiers pre-trained on the ImageNet-1K dataset [Russakovsky et al., 2015]. We directly use the pre-trained models on ImageNet-1K, while on other datasets, we freeze everything but train only the last layer from scratch. The output dimension of the last layer is set to be the same as the number of image classes on the test dataset. We implement the soft classifier (see Section 4.1) by sampling w.r.t. the probability distribution output by the softmax layer, i.e., $Pr[f_{i}(x)=j]=\frac{exp(z_{j}/\tau)}{\sum_{k}exp(z_{k}/\tau)}$ , where $z_{k}$ is the logit for $k\in[C]$ and $\tau$ is the hyper-parameter temperature controlling the randomness of predictions. We choose a small $\tau$ (1e-3) to reduce the variance in predictions. At test time, to obtain consistent results, all classifiers make predictions by outputting the most likely class labels (i.e., $\arg\max_{j}f_{i}(x)[j]$ ), equivalently to having soft classifiers with $\tau\to 0$ . Model training details are in Section C.2. All experiments are conducted with one NVIDIA V100 GPU of 32GB GPU RAM. Codes will be released upon acceptance.

Inference Cost. The absolute costs of running a model may be expressed using a variety of metrics, including FLOPs, latency, dollars, etc. While FLOPS is an important metric that has the advantage of being hardware independent, it has been found to not correlate well with wall-clock latency, energy consumption, and dollar costs, which are of more practical interest to end users [Dao et al., 2022]. In practice, dollar costs usually highly correlate with inference latency on GPUs. In our work, we define the cost of model inference in USD. We approximate the inference cost of computation by taking the cost per hour ($3.06) of the Azure Machine Learning (AML) NC6s v3 instance [AzureML, 2024], as summarized in Table 1. The AML NC6s v3 instance contains a single V100 GPU and is commonly used for deep learning. Since CPU resources are significantly cheaper than GPU (e.g., D2s v3 instance, equipped with two 2 CPUs and no GPU, costs $0.096 per hour [AzureML, 2024]) and all methods studied in this work typically finish in several CPU-seconds, incurring negligible expenses, we ignore the costs incurred by CPU in our comparison. In addition, since larger models typically have higher accuracy as well as higher costs (see Figure 1(a)), a practically interesting setting is to study how to deliver high quality answers with reduced costs in comparison to solely using the largest model (e.g., SwinV2-B). Normalized cost directly indicates the percentage cost saved and has been widely adopted in previous works [Ding et al., 2024, Kag et al., 2022], following which we report all results in terms of the normalized cost of each classifier.

Models	Latency (s)	Prices ($)	Normalized Cost
ResNet-18	88.9	0.076	0.15
ResNet-34	135.9	0.116	0.22
ResNet-50	174.5	0.148	0.29
ResNet-101	317.4	0.270	0.52
SwinV2-T	326.4	0.277	0.53
SwinV2-S	600.7	0.511	0.98
SwinV2-B	610.6	0.519	1

Table 1: Model costs on the image classification task. Latency and prices are measured for 10,000 queries. Normalized cost is the fraction of the price w.r.t. SwinV2-B.

ILP Solver. While our approach is agnostic to the choice of the ILP solver, we choose the high-performance ILP solver, HiGHS [Huangfu and Hall, 2018] to solve the problem in Equation 4, given its well-demonstrated efficiency and effectiveness on public benchmarks [Gleixner et al., 2021]. In a nutshell, HiGHS solves ILP problems with branch-and-cut algorithms [Fischetti and Monaci, 2020] and stops whenever the gap between the current solution and the global optimum is small enough (e.g., 1e-6).

Baselines. We compare our approach with three baselines: single best, random, and FrugalMCT [Chen et al., 2022]. Single best always chooses the strongest (i.e., most expensive) model for a given cost budget. Random estimates classifier accuracy with random guesses (i.e., uniform samples from $[0,1]$ ) and solves the problem in Equation 4 with the same ILP solver as ours. FrugalMCT [Chen et al., 2022] is a recent work which selects the best ML models for given user budgets in an online setting, using model-based accuracy estimation. Following the same setting in [Chen et al., 2022], we train random forest regressors on top of the model-extracted features (e.g., ResNet-18 features), as the accuracy predictor. The predicted accuracy is used in Equation 4, which is solved by the same ILP solver as ours.

Our Method. We evaluate OCCAM (see Section 4) under various metrics (i.e., $l_{1}$ , $l_{2}$ , and $l_{\infty}$ norms) and cost budgets. We consider images represented by model-based embeddings. Specifically, we extract the image feature⁹⁹9The image feature is the last layer output of a ML model (e.g., ResNet-18) trained on the target dataset, given an input image. of the query image and all the validation images. The costs incurred by feature extraction are deducted from the user budget $B$ before we compute the optimal model portfolio. We report the test accuracy under different cost budgets for OCCAM and all baselines in Section 5.2 (Figure 3 and Table 2), validate that OCCAM is cost-aware and indeed selecting the most profitable ML models to deliver high accuracy solutions in Section 5.3 (Figure 4(a)), demonstrate the effectiveness of OCCAM with limited samples in Section 5.4 (Figure 4(b)), investigate the nearest neighbour distance with different sample sizes in Section A.2, show that the estimation error of our accuracy estimator quickly decreases as the sample size increases in Section A.3, test the generalizability of OCCAM with different feature extractors in Section A.4, and provide more performance results under different metrics in Section A.5.

For simplicity, unless otherwise stated, we report OCCAM performance using ResNet-18 features and $l_{\infty}$ metric with $K=40$ for all datasets ( $s=500$ for CIFAR10, CIFAR100, and $s=1000$ for Tiny ImageNet, ImageNet-1K). We choose $\lambda=100$ for ImageNet-1K and $\lambda=5$ for all other datasets because ImageNet-1K contains a high variety of image classes (1000 classes) that leads to relatively high estimation errors and requires more regularization penalty via large $\lambda$ values.

5.2 Performance Results

Accuracy Drop (%)

Cost Reduction (%)

CIFAR10

CIFAR100

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

2.22

2.86

0.97

0.56

3.18

3.29

0.52

0.34

2.22

2.86

1.13

0.50

3.18

3.29

0.79

0.36

2.22

2.86

1.22

0.51

3.18

3.29

1.98

0.62

Cost Reduction (%)

Tiny-ImageNet-200

ImageNet-1K

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

4.01

7.03

0.86

0.17

2.53

5.98

0.59

0.51

4.01

7.03

1.49

0.61

2.53

5.98

1.12

1.05

4.01

7.03

3.88

2.75

2.53

5.98

2.35

2.24

Table 2: Cost reduction v.s. accuracy drop by OCCAM and baselines. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries. For example, on Tiny ImageNet, using SwinV2-B to classify all

10,000

test images achieves an accuracy of

82.5\%

and incurs a total cost of

\$0.519

(we take it as the normalized cost 1, see Table 1). A

10\%

cost reduction equals a cost budget of

\$0.467

(i.e., a normalized cost 0.9), under which we evaluate the achieved accuracy of OCCAM and all baselines and report the relative accuracy drops.

We investigate the test accuracy achieved by OCCAM and all baselines under different cost budgets and depict the results in Figure 3. We can see that by trading little to no accuracy drop, OCCAM achieves significant cost savings and outperforms all baselines across a majority of experiment settings. Results on cost reduction vs accuracy drop for all approaches are summarized in Table 2. On easy classification task (CIFAR-10 of $10$ classes), OCCAM consistently outperforms all baselines by achieving $40\%$ cost reduction with up to $0.56\%$ accuracy drop. Cost reduction and accuracy drop are computed w.r.t. using the strongest model (i.e., SwinV2-B) for all queries. On moderate classification task (CIFAR-100 of $100$ classes), OCCAM outperforms all baselines by trading up to $0.62\%$ accuracy drop for $40\%$ cost reduction. On hard classification task (Tiny ImageNet of $200$ classes), OCCAM significantly outperforms all three baselines with at least $0.5\%$ higher accuracy. Notably, on aggressive cost regimes (e.g., $40\%$ cost reduction), the achieved accuracy of OCCAM is $1.1\%$ higher than FrugalMCT, $4.3\%$ higher than random, and $1.3\%$ higher than single best. On the most challenging classification task (ImageNet-1K of $1000$ classes), OCCAM still consistently outperforms all three baselines with higher accuracy at all cost budget levels. We believe that the above results demonstrate the generalized effectiveness of OCCAM in achieving non-trivial cost reduction for a small accuracy drop on classification tasks of different difficulty levels.

5.3 Validation Results

We validate that OCCAM is functioning as intended, that is, it does select small-yet-profitable classifiers when budgets are limited and gradually switches to large-but-accurate classifiers as cost budgets increase. In Figure 4(a) we plot the model usage for each classifier under different cost budgets on the Tiny ImageNet dataset. From the figure, it can be seen that when cost budgets are restricted, OCCAM mainly chooses ResNet-18 to resolve queries given its cheap prices and good accuracy (as seen in Table 1 and Figure 1(a)). As budgets increase, OCCAM gradually switches to SwinV2-S and SwinV2-B, given their predominantly high accuracy ( $82\%$ as seen in Figure 1(a)).

5.4 Stability Analysis

OCCAM pre-computes $K$ labelled samples of size $s$ to estimate the test accuracy at inference time. We investigate OCCAM performance with different total sample sizes ( $K\cdot s$ ) by setting $s=1000$ and changing $K$ from $10$ to $40$ (see Figure 4(b)). We report results on the Tiny ImageNet dataset. In Figure 4(b), we plot the achieved accuracy of OCCAM under different total sample sizes ( $K\cdot s$ ) and normalized cost budgets ( $B$ ). We also report FrugalMCT performance using a maximum of $40,000$ sampled images to train its accuracy predictor. With budget $B=0.8$ (i.e., 20% cost reduction), OCCAM achieves comparable performance to FrugalMCT at $25\%$ samples and continues to outperform FrugalMCT as the total sample size increases. With budget $B=0.6$ (i.e., 40% cost reduction), OCCAM outperforms FrugalMCT by $0.7\%$ higher accuracy with only $25\%$ samples and achieves up to $1.3\%$ higher accuracy as the total sample size increases, which demonstrates the sustained effectiveness of OCCAM even with limited samples.

6 Discussion and Conclusion

Motivated by the need to optimize the classifier assignment to different image classification queries with pre-defined cost budgets, we have formulated the optimal model portfolio problem and proposed a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively deliver high accuracy solutions. We present an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees, and compute an optimal classifier assignment with novel regularization techniques mitigating overestimation risks. Our experimental results on a variety of real-world datasets show that we can achieve up to 40% cost reduction with no significant drop in classification accuracy.

While we mainly demonstrate the effectiveness of OCCAM on the image classification task, we argue that OCCAM is a generic approach to solve a wide range of classification problems carried out by various ML classifiers. We identify the following possible extensions: (i) Extension to other classification tasks. At the heart of our approach is the requirement that the classification task is well separated (see Section 4.1), meaning intuitively that instances (e.g., images) of the problem should not sharply change their class labels under minor modification. A wide range of classification problems (e.g., sentiment analysis in NLP) appear to naturally satisfy this precondition. The challenge is how to choose the most suitable numeric representation so that the separation property is preserved. Recent advances in representation learning like contrastive learning are likely to help. (ii) Extension to other ML classifiers/services. In addition to open-sourced models, it is intriguing to see how to apply OCCAM on online classification APIs (e.g., Google Prediction API) and to which extent it can boost accuracy with cost savings in production settings. We will explore these extensions in our future work.

References

Alghofaili et al. [2020] Y. Alghofaili, A. Albattah, and M. A. Rassam. A financial fraud detection model based on lstm deep learning technique. Journal of Applied Security Research, 15(4):498–516, 2020.
Antol et al. [2015] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
AzureML [2024] AzureML. Azure machine learning pricing, Feb. 2024. URL https://azure.microsoft.com/en-ca/pricing/details/machine-learning/.
Bartlett et al. [2017] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
Chang et al. [2024] J. Chang, X. Chen, and M. Wu. Central limit theorems for high dimensional dependent data. Bernoulli, 30(1):712–742, 2024.
Chen et al. [2020] L. Chen, M. Zaharia, and J. Y. Zou. Frugalml: How to use ml prediction apis more accurately and cheaply. Advances in neural information processing systems, 33:10685–10696, 2020.
Chen et al. [2022] L. Chen, M. Zaharia, and J. Zou. Efficient online ml api selection for multi-label classification tasks. In International Conference on Machine Learning, pages 3716–3746. PMLR, 2022.
[8] S. CS231n. Tiny imagenet dataset. URL http://cs231n.stanford.edu/tiny-imagenet-200.zip.
Dalal and Triggs [2005] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
Dao et al. [2022] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
Ding et al. [2022] D. Ding, S. Amer-Yahia, and L. Lakshmanan. On efficient approximate queries over machine learning models. Proceedings of the VLDB Endowment, 16(4):918–931, 2022.
Ding et al. [2024] D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V. Rühle, L. V. S. Lakshmanan, and A. H. Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3mUtqnM.
Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Elsken et al. [2019] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997–2017, 2019.
Eriksson et al. [2004] K. Eriksson, D. Estep, C. Johnson, K. Eriksson, D. Estep, and C. Johnson. Lipschitz continuity. Applied Mathematics: Body and Soul: Volume 1: Derivatives and Geometry in IR 3, pages 149–164, 2004.
Fischetti and Monaci [2020] M. Fischetti and M. Monaci. A branch-and-cut algorithm for mixed-integer bilinear programming. European Journal of Operational Research, 282(2):506–514, 2020.
Gleixner et al. [2021] A. Gleixner, G. Hendel, G. Gamrath, T. Achterberg, M. Bastubbe, T. Berthold, P. M. Christophel, K. Jarck, T. Koch, J. Linderoth, M. Lübbecke, H. D. Mittelmann, D. Ozyurt, T. K. Ralphs, D. Salvagnin, and Y. Shinano. MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library. Mathematical Programming Computation, 2021. doi: 10.1007/s12532-020-00194-3. URL https://doi.org/10.1007/s12532-020-00194-3.
Gouk et al. [2021] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
Hassibi et al. [1993] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hinton et al. [2015] G. Hinton, O. Vinyals, J. Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
Howard et al. [2017] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Huangfu and Hall [2018] Q. Huangfu and J. J. Hall. Parallelizing the dual revised simplex method. Mathematical Programming Computation, 10(1):119–142, 2018.
Jacob et al. [2018] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
Kag et al. [2022] A. Kag, I. Fedorov, A. Gangrade, P. Whatmough, and V. Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2022.
Kellerer et al. [2004] H. Kellerer, U. Pferschy, D. Pisinger, H. Kellerer, U. Pferschy, and D. Pisinger. The multiple-choice knapsack problem. Knapsack Problems, pages 317–347, 2004.
Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
LeCun et al. [1989] Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Liu et al. [2021] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Liu et al. [2022] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
Lowe [2004] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
Miotto et al. [2018] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236–1246, 2018.
Qi et al. [2023] X. Qi, J. Wang, Y. Chen, Y. Shi, and L. Zhang. Lipsformer: Introducing lipschitz continuity to vision transformers. arXiv preprint arXiv:2304.09856, 2023.
Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Su et al. [2018] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European conference on computer vision (ECCV), pages 631–648, 2018.
Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Tang et al. [2021] J. Tang, S. Li, and P. Liu. A review of lane detection methods based on deep learning. Pattern Recognition, 111:107623, 2021.
Urban et al. [2016] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose, and M. Richardson. Do deep convolutional nets really need to be deep and convolutional? arXiv preprint arXiv:1603.05691, 2016.
Vanhoucke et al. [2011] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. 2011.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Vinyals et al. [2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
Wang et al. [2021] R. Wang, M. B. Alazzam, F. Alassery, A. Almulihi, and M. White. Innovative research of trajectory prediction algorithm based on deep learning in car network collision detection and early warning system. Mobile information systems, 2021:1–8, 2021.
Yang et al. [2020] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhutdinov, and K. Chaudhuri. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
Zhu et al. [2020] W. Zhu, L. Xie, J. Han, and X. Guo. The application of deep learning in cancer prognosis prediction. Cancers, 12(3):603, 2020.
Zoph and Le [2016] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Appendix A Additional Experiments

A.1 Real Image Datasets Are Well Separated.

In [Yang et al., 2020], authors have shown that many real image classification tasks comprise of separated classes in RGB-valued space. In this section, we provide further empirical evidence to show that real image datasets (e.g., Tiny ImageNet) are well separated (see Definition 4.1) in different feature spaces under various metrics (Figures 5 and 6).

In Figure 5, we provide an intuitive example to illustrate that images from different classes (e.g., “goldfish” and “bullfrog”) are typically well separated by a non-zero distance. In Figure 6, we investigate the distance distribution for images of different classes from Tiny ImageNet ( $200$ classes). We observe that images of different classes are typically far from each other by a non-zero distance under different metrics (e.g., $l_{1}$ , $l_{2}$ , and $l_{\infty}$ ) in different feature spaces (e.g., image features extracted by ResNet-18, ResNet-50, and SwinV2-T).

In addition, we note that real image datasets are subject to little to no label noises. For example, on Tiny ImageNet, we investigate $40,000$ images from the training split and only find $4$ duplicate images of different class labels. We also consider more standard image datasets (see Section 5.1). It turns out that CIFAR-10 contains no label noise, CIFAR-100 contains $3$ duplicate images of different class labels (out of $20,000$ images), and the noise frequency on ImageNet-1K is $8$ out of $40,000$ images. Our observation suggests that standard image datasets are quite clean (aligned with the observation in [Yang et al., 2020]) that justifies the adoption of well-separation assumption.

A.2 Nearest Neighbour Distance Approaches 0 As Sample Size Increases.

In this section, we conduct experiments to investigate the changes of nearest neighbour distance ( $dist(x,NN_{S}(x))$ ) as sample size ( $s$ ) increases. We report results using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T) as well as different metrics ( $l_{1}$ , $l_{2}$ , and $l_{\infty}$ ) on the validation split of Tiny ImageNet dataset (Figure 7).

It can be clearly seen in Figure 7 that the distance to the sampled nearest neighbour quickly approaches 0 as sample size increases. This could be attributable to the fact that we are sampling from real images. With properly pre-trained feature extractors, the possible image embeddings could be restricted to a subspace rather than pervade the whole high-dimensional space, which can significantly reduce the required number of samples and give us meaningfully small distances to the sampled nearest neighbours.

Another interesting observation is that, in all investigated feature space, $l_{\infty}$ always provides the smallest nearest neighbour distance with different sample sizes, followed by $l_{2}$ and $l_{1}$ . Such distinction mainly results from the fact that we use normalized image features where each dimension of the feature vector $x$ is between 0 and 1, that is, $0\leq x[i]\leq 1$ for any $x[i]\in x$ . Consequently, we have the inequality that the $l_{\infty}(x)=\max\{|x[i]||x[i]\in x\}\leq l_{2}(x)=\sqrt{\sum_{x[i]\in x}|x[i% ]|^{2}}\leq l_{1}(x)=\sum_{x[i]\in x}|x[i]|$ . Recall that the OCCAM employs the classifier accuracy estimator which is asymptotically unbiased as nearest neighbour distance approaches 0. The above observation suggests that $l_{\infty}$ is likely to provide smaller nearest neighbour distance and reduce the estimation error that leads to higher overall performance, especially in scenarios when sampling is expensive or labelled data is scarce.

A.3 Estimation Error Decreases As Sample Size Increases.

In this section, we investigate the estimation error (difference between real classifier accuracy and our estimator results) for different ML classifiers, using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T). For brevity, on Tiny ImageNet, we report the estimation error in the accuracy of all 7 classifiers (ResNet-[18, 34, 50, 101], and SwinV2-[T, S, B]), under $l_{\infty}$ metric (Figure 8). The patterns are similar with other metrics and feature extractors.

It is clear from Figure 8 that the estimation error of our accuracy estimator continues to decrease for all ML classifiers as the sample size increases, which demonstrates the effectiveness our accuracy estimator design (see Section 4.1).

Accuracy Drop (%)

Cost Reduction (%)

Tiny-ImageNet-200

Single

Best

Rand

FrugalMCT

(ResNet-18)

FrugalMCT

(ResNet-50)

FrugalMCT

(SwinV2-T)

OCCAM

(ResNet-18)

OCCAM

(ResNet-50)

OCCAM

(SwinV2-T)

4.01

7.03

0.86

0.84

1.18

0.48

0.40

0.29

4.01

7.03

1.49

1.45

1.60

1.02

0.74

0.58

4.01

7.03

3.88

4.12

3.22

3.24

2.56

2.81

Table 3: Cost reduction v.s. accuracy drop by baselines and OCCAM using different feature extractors (ResNet-18, ResNet-50, and SwinV2-T) and

l_{\infty}

distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.

A.4 Generalizing to Different Feature Extractors

We further report the performance of OCCAM with different feature extractors (ResNet-18, ResNet-50, and SwinV2-T), on TinyImageNet. As in illustrated in Section 5.1, the costs incurred by feature extraction are “deducted from the user budget before we compute the optimal model portfolio”. Results are summarized in Table 3. It can be seen that OCCAM outperforms all baselines on all experimental settings, which demonstrates the effectiveness and generalizability of OCCAM with different feature extractors.

A.5 More OCCAM Performance Results.

In this section, we provide more OCCAM performance results using $l_{1}$ and $l_{2}$ norm metrics, as shown in Figures 9 and 10. Qualitative comparison results are summarized in Tables 4 and 5, which resemble our analysis in Section 5.2. Typically, by trading little to no performance drop, OCCAM can achieve significant cost reduction and outperform all baselines across a majority of experiment settings.

However, we also note that FrugalMCT can sometimes outperform OCCAM on ImageNet-1K using $l_{1}$ and $l_{2}$ metrics, while OCCAM outperforms FrugalMCT across all experiment settings using $l_{\infty}$ metric (see Section 5.2). This could be explained by the fact that $l_{1}$ and $l_{2}$ metrics are likely to provide higher nearest neighbour distance than $l_{\infty}$ metric (see Section A.2) that implicitly increases OCCAM estimator error and leads to reduced overall performance, especially when the classification task is challenging and labelled data is scarce. Provided that, in practice, we would recommend applying OCCAM with $l_{\infty}$ to achieve significant cost reduction with little to no performance drop (see Section 5.2).

Accuracy Drop (%)

Cost Reduction (%)

CIFAR10

CIFAR100

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

2.22

2.86

0.97

0.38

3.18

3.29

0.52

0.50

2.22

2.86

1.13

0.38

3.18

3.29

0.79

0.50

2.22

2.86

1.22

0.37

3.18

3.29

1.98

0.99

Cost Reduction (%)

Tiny-ImageNet-200

ImageNet-1K

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

4.01

7.03

0.86

0.48

2.53

5.98

0.59

0.86

4.01

7.03

1.49

1.02

2.53

5.98

1.12

1.51

4.01

7.03

3.88

3.24

2.53

5.98

2.35

3.32

Table 4: Cost reduction v.s. accuracy drop by OCCAM and baselines using ResNet-18 features and

l_{1}

distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.

Accuracy Drop (%)

Cost Reduction (%)

CIFAR10

CIFAR100

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

2.22

2.86

0.97

0.24

3.18

3.29

0.52

0.34

2.22

2.86

1.13

0.25

3.18

3.29

0.79

0.40

2.22

2.86

1.22

0.27

3.18

3.29

1.98

0.71

Cost Reduction (%)

Tiny-ImageNet-200

ImageNet-1K

Single

Best

Rand

Frugal

-MCT

OCCAM

Single

Best

Rand

Frugal

-MCT

OCCAM

4.01

7.03

0.86

0.21

2.53

5.98

0.59

1.06

4.01

7.03

1.49

0.81

2.53

5.98

1.12

1.65

4.01

7.03

3.88

2.75

2.53

5.98

2.35

3.10

Table 5: Cost reduction v.s. accuracy drop by OCCAM and baselines using ResNet-18 features and

l_{2}

distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., SwinV2-B) for all queries.

Appendix B Proofs

In this section, we provide proofs to Lemmas 4.3, 4.4, 4.5 and 4.6.

Proof to Lemma 4.3

Proof.

The proof is straightforward. Without loss of generality, we consider the $l_{1}$ metric and assume $(\mathcal{X},l_{1})$ is a $r$ -separated metric space. For brevity, we abuse the notation and let $O(x)$ denote the one-hot output distribution over all labels. For any $x,x^{\prime}\in\mathcal{X}$ , if $x$ and $x^{\prime}$ belong to the same class, then $\|O(x)-O(x^{\prime})\|_{1}=0\leq\frac{2}{r}\cdot\|x-x^{\prime}\|_{1}$ ; otherwise, $\|O(x)-O(x^{\prime})\|_{1}=2\leq\frac{2}{r}\cdot\|x-x^{\prime}\|_{1}$ . The Lipschitiz constant for $O$ is $\frac{2}{r}$ . ∎

Proof to Lemma 4.4

Proof.

Similarly, without loss of generality, we consider the $l_{1}$ metric and let $f_{i}(x)$ , $O(x)$ denote the output distribution over all labels. Let $L_{i}$ and $L_{O}$ denote the Lipschitz constants for $f_{i}(x)$ and $O(x)$ respectively. For any $x,x^{\prime}\in\mathcal{X}$ , if $x$ and $x^{\prime}$ belong to the same class, then $\|SP_{i}(x)-SP_{i}(x^{\prime})\|_{1}=|f_{i}(x)[O(x)]-f_{i}(x^{\prime})[O(x)]|% \leq\|f_{i}(x)-f_{i}(x^{\prime})\|_{1}\leq L_{i}\cdot\|x-x^{\prime}\|_{1}$ ; otherwise, $\|SP_{i}(x)-SP_{i}(x^{\prime})\|_{1}=|f_{i}(x)[O(x)]-f_{i}(x^{\prime})[O(x^{% \prime})]|\leq 1=\frac{1}{2}\|O(x)-O(x^{\prime})\|_{1}\leq\frac{L_{O}}{2}\cdot% \|x-x^{\prime}\|_{1}$ . The Lipschitiz constant for $SP_{i}(x)$ is $\max\{L_{i},\frac{L_{O}}{2}\}$ . ∎

Proof to Lemma 4.5

Proof.

The proof leverages the fact that, as the sample size increases, the expected distance between $x$ and its nearest neighbour monotonically decreases. Letting $L_{i}$ denote the Lipschitz constant of $SP_{i}$ , we have the estimation error $|\mathbb{E}[SP_{i}(NN_{S}(x))]-SP_{i}(x)|=\mathbb{E}[|SP_{i}(NN_{S}(x))-SP_{i}% (x)|]\leq L_{i}\cdot\mathbb{E}[dist(NN_{S}(x),x)]$ , which approaches $0$ as $\mathbb{E}[dist(NN_{S}(x),x)]$ decreases. ∎

Proof to Lemma 4.6

Proof.

Lemma 4.5 shows that each $SP_{i}(NN_{S_{k}}(x))$ is an unbiased estimator of $SP_{i}(x)$ , $1\leq k\leq K$ , as $s$ approaches infinity. Let $\sigma_{i}^{\prime 2}$ denote the variance of $SP_{i}(NN_{S_{k}}(x))$ for each $k$ . By the Central Limit Theorem, the distribution of the estimator $\frac{1}{K}\sum_{k=1}^{K}SP_{i}(NN_{S_{k}}(x))$ approaches a normal distribution with variance $\frac{\sigma_{i}^{\prime 2}}{\sqrt{K}}$ [Chang et al., 2024]. ∎

Appendix C Experiment Details

C.1 Datasets

CIFAR-10¹⁰¹⁰10https://www.cs.toronto.edu/~kriz/cifar.html. CIFAR-10 [Krizhevsky et al., 2009] contains $60,000$ images of resolution $32\times 32$ , evenly divided into $10$ classes, where $50,000$ images are for training and $10,000$ images are for testing. We randomly sample $20,000$ images from the training set as our validation set, and we use the remaining $30,000$ images to train our models.

CIFAR-100¹¹¹¹11https://www.cs.toronto.edu/~kriz/cifar.html. Same as CIFAR-10, CIFAR-100 [Krizhevsky et al., 2009] has $50,000$ training and $10,000$ testing images. But they are evenly separated into $100$ classes. We randomly sample $20,000$ training images as our validation set.

Tiny ImageNet¹²¹²12http://cs231n.stanford.edu/tiny-imagenet-200.zip. Tiny ImageNet [CS231n, ] is a subset of the ImageNet-1K dataset [Russakovsky et al., 2015]. It covers $200$ class labels and all images are in resolution $64\times 64$ . It includes $100,000$ training, $10,000$ validation, and $10,000$ testing images. The given test split does not have ground-truth labels, thus we discard this set and use the validation split as our testing data. We randomly sample $40,000$ training images as the validation data and use the remaining $60,000$ ones to train the models.

ImageNet-1K¹³¹³13https://image-net.org/download.php. We use the image classification dataset in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 [Russakovsky et al., 2015]. This dataset contains 1,281,167 training, 50,000 validation, and 100,000 testing images, covering 1,000 classes. Images are of various resolutions. Since the models we use are pre-trained on this dataset, we do not train the last linear layer of the models. The given test split comes without ground-truth labels; thus we use the validation split to evaluate our method and baselines. Among the 50,000 validation images, we randomly select 10,000 of them as our testing data and the remaining ones are treated as the validation data.

C.2 Models

We use ResNet [He et al., 2016] and Swin Transformer V2 (SwinV2) [Liu et al., 2022] models on the image classification task because they are popular models for the task and many of their pre-trained weights on the ImageNet-1K dataset [Russakovsky et al., 2015] are available online¹⁴¹⁴14For example, the pre-trained models we use are from https://pytorch.org/vision/stable/models.html. Specifically, the pre-trained weights we use are as follows. • ResNet-18: ResNet18_Weights.IMAGENET1K_V1 • ResNet-34: ResNet34_Weights.IMAGENET1K_V1 • ResNet-52: ResNet50_Weights.IMAGENET1K_V1 • ResNet-101: ResNet101_Weights.IMAGENET1K_V1 • SwinV2-T: Swin_V2_T_Weights.IMAGENET1K_V1 • SwinV2-S: Swin_V2_S_Weights.IMAGENET1K_V1 • SwinV2-B: Swin_V2_B_Weights.IMAGENET1K_V1

, where reasonable performance are achieved. On CIFAR-10, CIFAR-100, and Tiny ImageNet, we freeze everything of the pre-trained models but only train the last linear layer of each model from scratch. For all seven models, we use the Adam optimizer [Kingma and Ba, 2015] with

\beta_{1}=0.9

and

\beta_{2}=0.999

, constant learning rate

0.00001

, and a batch size of 500 for training. Models are trained till convergence.

OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

Abstract

1 Introduction

2 Related Work

3 Problem Definition

Definition 3.1 (Model Portfolio).

Definition 3.2 (Optimal Model Portfolio).

4 Methodology

4.1 Estimating A⁢c⁢c⁢u⁢r⁢a⁢c⁢y⁢(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ )

Definition 4.1 (r𝑟ritalic_r-separation [Yang et al., 2020]).

Definition 4.2 (Lipschitz Continuity).

Lemma 4.3.

Lemma 4.4.

Lemma 4.5 (Asymptotically Unbiased Estimator).

Lemma 4.6 (Asymptotically Low-Variance Estimator).

4.2 Computing μ∗superscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with A⁢c⁢c⁢u⁢r⁢a⁢c⁢y⁢(μ)𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝜇Accuracy(\mu)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_μ )

5 Evaluation

5.1 Evaluation Setup

5.2 Performance Results

5.3 Validation Results

5.4 Stability Analysis

6 Discussion and Conclusion

References

Appendix A Additional Experiments

A.1 Real Image Datasets Are Well Separated.

A.2 Nearest Neighbour Distance Approaches 0 As Sample Size Increases.

A.3 Estimation Error Decreases As Sample Size Increases.

A.4 Generalizing to Different Feature Extractors

A.5 More OCCAM Performance Results.

Appendix B Proofs

Proof.

Proof.

Proof.

Proof.

Appendix C Experiment Details

C.1 Datasets

C.2 Models

4.1 Estimating $Accuracy(\mu)$

Definition 4.1 ( $r$ -separation [Yang et al., 2020]).

4.2 Computing $\mu^{*}$ with $Accuracy(\mu)$