NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

Jukka I. Ahonen 132, Nam Le132, Honglei Zhang1, Antti Hallapuro1, Francesco Cricri1,
Hamed Rezazadegan Tavakoli1, Miska M. Hannuksela1, Esa Rahtu3
{jukka.1.ahonen, nam.le, honglei.1.zhang, antti.hallapuro, francesco.cricri,
hamed.rezazadegan_tavakoli, miska.hannuksela}@nokia.com, [email protected] 1Nokia Technologies, 3Tampere University, 2Equally contributed

Abstract

The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to $-43.20\%$ and $-26.8\%$ Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.

Index Terms:

video coding for machines, computer vision, video coding, neural networks, hybrid codec

I Introduction

Image and video data consumed by machines have been increasing rapidly in recent years. In this paper, we refer to machines as any algorithm that may analyze an input image or video, in order to obtain analysis results. Examples include object detection, image segmentation, instance segmentation, object tracking, person tracking, etc. Cisco Annual Internet Report [1] gave an estimate that by the year 2023, half of the internet traffic will be solely between machines. Thus, it is highly desired to compress images and videos targeted to machine consumption more efficiently than when applying traditional codecs for the benefits in terms of bandwidth savings. Thus, the Video Coding for Machines (VCM) Ad-hoc group of Moving Picture Experts Group (MPEG) [2], as well as the JPEG-AI group of JPEG [3] have been actively investigating new technologies for machine-oriented image and video coding standardization. In this regard, the VCM group lists some of the most important use cases in one of their documents [4], which include surveillance, intelligent transportation, smart cities, intelligent industry, intelligent content, and consumer electronics, which all have a demand for efficient image and video codecs specifically tailored for machines. While existing state-of-the-art codecs such as the High Efficiency Video Coding (HEVC) [5] or the Versatile Video Coding (VVC) [6] may be used for machine vision tasks, they are ultimately developed to optimize the compression gains for humans as the end user and are not the most optimal solution when the end user is a machine. Responses to the call for evidence (CfE) and to the call for proposals (CfP) of MPEG VCM have shown new technologies that compress images and videos targeted for machine consumption much more efficiently than traditional video codecs, such as VVC, optimized for user viewing.

In this paper, we present a complete system for compressing images and videos for machines. The proposed system was submitted to the MPEG VCM as a CfP response and it is being studied as a prominent candidate. We combine an end-to-end self-supervisedly learned intra-frame codec with a conventional inter-frame codec, in order to leverage the benefits of these two approaches. Thanks to this combination, we are able to achieve substantial coding gains over VVC in all tested datasets for machine consumption.

The paper is organized as follows: Section 2 reviews the prior works on learned codecs and in general on video codecs for machines; Section 3 describes the details of the proposed codec; Section 4 provides information on experimental setup and results, including ablation studies; finally, Section 5 draws the conclusions of our paper.

II Related work

Since the rise of end-to-end (E2E) learned image codecs [7, 8, 9, 10, 11, 12, 13] that rival or outperform the state-of-the-art traditional codecs HEVC [5] and VVC [14] in terms of rate–distortion trade-off according to many quality metrics and coding conditions, video coding with neural network (NN) based components has been an attractive research topic. End-to-end learned video codecs have been explored for the possibility they would inherit the success of the learned image codecs. Agustsson et al.[15] introduced scale field as an additional flow field dimension for more flexibility in motion compensation. Furthermore, in [16] and [17], motion compensation is handled by conditional autoencoder and Transformer [18]-based components, respectively. In a different aspect, the authors of [19, 20] seek to enhance the traditional codecs with the aid of NN-based modules, in particular for frame prediction. In comparison, our proposed method instead aims to harmonize the procedures of state-of-the-art conventional video codec and E2E learned image codec to achieve consistent gains in a wide range of test cases and input data.

Similar to the achievements of neural networks in the codecs for human vision, superiority against traditional codecs has been observed in the field of Image coding for Machine vision (ICM). Le et al.[21, 22] proposed an E2E learned image codec and domain adaptation techniques that save almost half of the bitstream size compared to the state-of-the-art video codec VVC. The authors of [23, 24] proposed rate-distortion optimization methods for the VVC to achieve further coding gains. On the other hand, for many applications, offloading the computational demands on the cloud is crucial. To achieve that, instead of compressing images and videos directly, the devices on the video-acquiring side may extract intermediate features from the input pictures, compress the features into a bitstream, and send the bitstream to the cloud for further analysis. This compression technique is known as “feature compression” for machines. Yamazaki et al.[25] presented an E2E learned system for feature compression, while [26, 27, 28] proposed scalable image feature coding schemes for human and machine visions. With the advantage of optimizing the codec with the targeting task network, these coding methods achieve significant gains over the VVC codec. The authors of [29, 30] extended the scalable feature coding approach to the video domain. Our work, in contrast to “feature compression”, operates on the picture domain, i.e., it encodes and decodes pictures and task networks take pictures as their input.

Another related topic to enhance the coding efficiency is NN-based filters. Generally, they can be divided into two categories, in-loop filters [31, 32, 33] and post-processing filters [34, 35, 36]. An in-loop filter is located inside the codec and processes an input picture to generate an enhanced picture which is often used as a reference for other pictures. A post-processing filter is located after the main decoder and enhances the reconstructed output picture. In order to make the NN filters adaptive to the input content, authors in [37] first trained an NN-based post-processing filter. At the inference stage, the pretrained filter is finetuned by overfitting the bias terms of the decoder by minimizing the rate-distortion loss given an input content. In addition to the finetuning concept, the authors in [38, 39] proposed to finetune scaling factors that determine the strength of the filtering. Authors of [40, 41] proposed using additional information such as quantization parameters for modulating the input features of the NN-layers during adaptation.

III Proposed method

III-A The NN-VVC system

Refer to caption — Figure 1: The NN-VVC coding system, light blue color indicates a neural network component

End-to-end learned video coding targeting human viewing has been the subject of intense research in recent years [15, 16, 17, 29]. However, despite the impressive progress over just a few years, end-to-end learned video codecs are still not able to outperform the latest traditional codecs, such as VVC [14]. On the other hand, end-to-end learned image codecs have been able to surpass the coding efficiency of these tools by large margins for both human consumption (in some specific settings, such as when using RGB color space and/or the multiscale structural similarity index quality metric) [42, 12, 13, 43, 44, 20, 11] and machine vision consumption [21, 28]. For this reason, we propose to harness the capability of Learned Image Codec (LIC) and the mature, widely adopted techniques of Conventional Video Codec (CVC) tools with a hybrid system that can deliver all-around higher coding performance for machine consumption against the state-of-the-art video codec VVC. We refer to our proposed hybrid codec as NN-VVC.

In NN-VVC, the LIC is used to perform intra-frame coding. For inter-frames, it takes advantage of the well-developed traditional coding tools of VVC, using the LIC-coded frames as the reference pictures. However, VVC may also be used to encode intra-frames in some cases (this is referred to as fallback mode, more information in the next sections). As shown in Fig. 1, at encoding time, the intra-frames are coded by the LIC. The Intra Human Adapter (IHA) then processes the LIC-reconstructed intra-frames to obtain filtered reconstructed intra-frames that are better suited as reference pictures for VVC (subsection III-C). The filtered reference frames are used by a CVC encoder to code the inter frames in a lossy fashion. Finally, the bitstream multiplexer (muxer) merges the intra-bitstream to the CVC bitstream, resulting in the VCM bitstream for transmission.

At decoding time, the VCM bitstream is decomposed to intra and inter-bitstreams using a bitstream demuxer. The intra-bitstreams are decoded by the LIC decoder, thus obtaining reconstructed intra-frames, which are then given as inputs to IHA. The outputs of IHA are used as reference frames to decode inter frames of the inter-bitstream. This can be achieved by modifying a standard CVC decoder to input decoded reference frames, which has the disadvantage that legacy CVC decoder implementations cannot be used as such. Another possibility, which is enabled by the lossless coding capability of state-of-the-art video codecs, such as VVC, is depicted in Fig. 1 and enables the use of a CVC decoder without modifications. The outputs of IHA are losslessly encoded by a CVC encoder to produce the CVC intra-bitstream. From there, a CVC-compliant bitstream is obtained by multiplexing the CVC intra-bitstream and the inter-bitstream using a bitstream muxer. The CVC decoder then decodes the CVC-compliant bitstream. The output inter frames are further enhanced for task performance with an Inter Machine Adapter (IMA - subsection III-D). Finally, the video for machine consumption is formed based on the intra-frames and the enhanced inter-frames.

III-B Self-supervisedly Learned Image Codec (LIC)

The superiority of the ICM systems over VVC in [21, 22, 28] motivates us to replace the intra coding in VVC with a learned ICM codec in order to get better machine task performance. We use the self-supervised image coding for machines system proposed in [45], where the coding system is trained using a task network without annotations for the training data. More specifically, this system comprises a convolutional neural network (CNN) based encoder, a CNN-based decoder, a CNN-based probability model, and an Asymmetric Numeral Systems (ANS) entropy codec [46]. Fig. 2 shows an overview of the LIC. The input image $\boldsymbol{x}$ is transformed by the encoder $\boldsymbol{E}{}$ (parametrized by $\boldsymbol{\theta_{E}}$ ) to a latent tensor $\boldsymbol{y}=\boldsymbol{E}({\boldsymbol{x}};\boldsymbol{\theta}_{% \boldsymbol{E}})$ , then quantized and compressed to a bitstream by the entropy encoder. At decoding time, the entropy decoder decompresses the bitstream to the quantized latent tensor $\boldsymbol{\hat{y}}$ . Next, $\boldsymbol{\hat{y}}$ is dequantized and restored to the image domain by the decoder $\boldsymbol{\hat{x}}=\boldsymbol{D}({\boldsymbol{\hat{y}}};\boldsymbol{\theta}% _{\boldsymbol{D}})$ . The entropy coding process requires prior distributions of $\boldsymbol{\hat{y}}$ , which are provided by a progressive probability model proposed by Zhang et al.[47].

Training method: We follow the same training strategy as proposed in [21] to obtain multiple image compression model checkpoints that achieve different qualities and bitrates. The LIC is trained to minimize three quantities: bitrate estimation, task loss, and distortion loss. The bitrate (or rate for simplicity) estimation is defined as the cross-entropy between the true distribution $q_{\boldsymbol{\hat{y}}}$ of $\boldsymbol{\hat{y}}$ and its estimation $p_{\boldsymbol{\hat{y}}}$ made by the probability model:

\mathcal{L}_{rate}=\mathbb{E}_{\boldsymbol{\hat{y}}\sim q_{\boldsymbol{\hat{y}% }}}\left[-\log_{2}p_{\boldsymbol{\hat{y}}}(\boldsymbol{\hat{y}})\right]\\

(1)

We employ the Mean-Squared Error (MSE) as the distortion loss to make the training more stable. In order to train the LIC to be task-agnostic, we use the feature-domain, multi-layer distortion loss $\mathcal{L}_{proxy}$ proposed in [45] as a proxy for the real task loss. This training technique comes with important advantages that unlock the practicality of our method. Firstly, with a surrogate loss term, the optimization objectives are not tied to any particular vision task or network architecture, therefore the codec can have significantly better generalizability to different downstream tasks. Secondly, the training data is not constrained to the availability of annotations for multiple vision tasks, which is a requirement in training with task losses. This enables self-supervised learning of the codec on a large quantity of data. Lastly, when using the proxy loss it is easier to train the LIC with small patches of images instead of full images, leading to a significant reduction in computational resource consumption. The final training objective is given as a linear combination of the loss terms:

\mathcal{L}_{total}=w_{rate}\mathcal{L}_{rate}+w_{mse}\mathcal{L}_{mse}+w_{% task}\mathcal{L}_{task}

(2)

where $w_{rate},w_{mse},w_{task}$ are scalar numbers whose values are decided by functions of epoch number, described as Loss Weighting Strategy (LWS) [21] in Equation 5. By using LWS we are able to obtain model checkpoints that offer a wide range of output bitrates.

III-C Intra human adapter (IHA)

Especially on lower bitrates, the reconstructed intra frames coded with the LIC may contain different types of artefacts, such as the checkerboard artefacts that can be seen in Fig. 3 and were studied in [48, 45]. While these artefacts do not affect the machine task performance when coding images, they might cause a significant degradation of compression efficiency for the CVC, as the LIC-reconstructed intra-frames are used as reference frames for inter-frame prediction. To remove these artefacts, we use IHA to enhance the LIC-reconstructed intra-frames in terms of the peak signal-to-noise ratio with respect to the corresponding uncompressed intra-frames. IHA is formulated as $H$ in $\boldsymbol{\hat{x}}_{H}=H(\boldsymbol{\hat{x}}$ ), where $\boldsymbol{\hat{x}}$ and $\boldsymbol{\hat{x}}_{H}$ are the LIC reconstructed image and Intra Human Adapted image, respectively. The structure of the IHA is based on the enhancement filter structure proposed in [34], which is essentially a convolutional autoencoder with skip connections. The differences with respect to [34] consist of an extra skip connection from input to output tensor and combined Quantization Parameter (QP) and resolution injection blocks before every up- and downsampling convolutional layer. Injection blocks concatenate the QP and resolution information of the processed frames together and feed them to a simple linear layer followed by a parametric rectified linear unit (PReLU). After this, the output is repeated to match the size of the filter’s features where the injection is performed, to which it is then concatenated.

III-D Inter machine adapter (IMA)

In order to adapt the CVC reconstructed inter frames $\boldsymbol{\hat{x}}_{cvc}$ to perform better on machine tasks, we use the IMA, formulated as $M$ in $\boldsymbol{\hat{x}}_{M}=M(\boldsymbol{\hat{x}}_{cvc}$ ), where $\boldsymbol{\hat{x}}_{M}$ is the machine adapted inter frame. The structure of the IMA is similar to that of the IHA, except that it does not contain the QP and resolution injections which, based on empirical evaluation, did not bring any benefits to the IMA.

III-E Fallback mode and spatial re-sampling

When coding a video with an extremely low LIC quality, even the IHA cannot suppress the LIC artefacts well enough for the CVC compression to remain efficient. To overcome this problem, we introduce the fallback mode, which is activated when a certain threshold is reached for the expected quality of LIC. In fallback mode, the whole LIC branch including the IHA is switched off and only the CVC is used to code the video (including the intra frames). The CVC by itself is able to handle the low bitrate coding efficiently and by adapting both intra- and inter-frames with fallback mode designated IMA (F-IMA), the machine task performance of the reconstructed video will be increased over the plain CVC. The structure of the F-IMA is equivalent to that of the IHA.

Since the LIC, IHA, and IMA are all trained with images having resolutions less than 1920 $\times$ 1080, to efficiently handle data that has a higher resolution, we apply a simple spatial down-sampling to input images/videos that have a resolution higher than 1920 $\times$ 1080, by using a downsampling factor of 3/4. Then, the reconstructed output of the LIC decoder and IMA are upsampled by a factor of 4/3 to restore the original resolution. Another possible option, that might be part of our future work, would be to expand the training data to include images up to 4K and 8K to achieve better performance on higher-resolution images compared to spatial resampling.

III-F Adapter training

To train the different types of adapters (namely IHA, IMA, F-IMA), for each adapter type, we use the following training loss with different proxy loss weights $w_{proxy_{A}}$ :

	$\displaystyle\mathcal{L}_{total_{A}}$	$\displaystyle=\mathcal{L}_{mse_{A}}+w_{proxy_{A}}\mathcal{L}_{proxy_{A}}$		(3)
	$\displaystyle\mathcal{L}_{mse_{A}}$	$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\lVert\boldsymbol{x}_{1}-\boldsymbol{% \hat{x}}_{2}\rVert^{2}$		(4)

where $\boldsymbol{x}_{1}$ is the uncompressed frame, and $\boldsymbol{\hat{x}}_{2}$ is the adapted output frame. For the IHA, we set the $w_{proxy_{A}}$ as 0, resulting in the use of only the $\mathcal{L}_{mse_{A}}$ . For both IMA and F-IMA, we set a positive scalar as $w_{proxy_{A}}$ , which enables the use of the $\mathcal{L}_{proxy_{A}}$ in training. $\mathcal{L}_{proxy_{A}}$ is defined as a proxy loss similar to the LIC training, but the backbone part of the maskrcnn_resnet50_fpn¹¹1The pre-trained models can be found at https://pytorch.org/docs/stable/torchvision/models.html by torchvision [49] is used to extract the features from the adapted and uncompressed frames.

III-G Bitrate control mechanism

One of the most important aspects of any video codec is the ability to control the size of the produced bitstream and the corresponding output quality. We use a couple of different techniques to achieve an efficient bitrate control mechanism. Several LIC models were trained to achieve different intra-frame bitrates as described in subsection III-B. Specifically, six different LIC models were trained to achieve similar bitrates as CVC when configured to operate with six different QPs in the simulation conditions. When an intra-frame is to be coded to a certain target bitrate, we select the LIC model that achieves the closest bitrate to the target bitrate, based on a look-up table. For example, if we want to code a certain intra-frame to produce a bitstream size close to the bitstream size of a CVC-encoded intra-frame with QP 33, and the closest model of LIC corresponds to QP 32, it is then used to perform the encoding. When performing video coding, we set a target $QP_{inter}\in[0,63]$ for CVC to code the inter-frames, while the intra-frames are coded with an -5 offset: $QP_{intra}=QP_{inter}-5$ . If more compression is needed, the QP of CVC can be increased.

III-H Bit-exact reconstruction

By default, convolutional layers in common NN-based systems operate in the floating point domain. When executed in different computing environments, the results from these operations may be different. In critical situations, such as for the components of the probability model, the discrepancies lead to a total corruption of the decoded data. Therefore, to make a codec useful in the real world, it is crucial to make sure some operations produce the same results regardless of the processing environments. In order to achieve bit-exactness in different computing environments, we perform the convolutional operations of critical components in the quantized domain, as proposed in [50]. It is required that the quantized convolutions are applied in the decoder and probability model of LIC, and IHA. Quantized convolutions are optional for IMA and F-IMA, but when applied, the decoding results will be deterministic in different environments, with negligible effects on the coding performance. All the experimental results that are shown in subsection IV-B use the quantized convolutions only on LIC and IHA.

IV Experiments

IV-A Model training

LIC training: We applied similar training techniques with the Loss Weighting Strategy (LWS) as proposed in [21]. On top of that, in order to get a better starting point, we trained the LIC model in the first phase with a weighted summation of $\mathcal{L}_{rate}$ and $\mathcal{L}_{mse}$ . The training data preparation for the first phase was the same procedure described in [47], which uses a random subset of 340K images from the training set of Open Images V6 [51]. The training data for the following phases is a random subset of 6K images from the Open Images V6 train set every epoch. Additionally, we reduced the training time by using approximately half the number of epochs for the phases after warming up, which are specified by $p_{2},p_{3},p_{4}$ [26], compared to [21]. The final LWS formulation is specified by functions of the epoch number $n$ :

\begin{split}w_{mse}&=1,\\ w_{task}&=\begin{cases}0,&n<p_{1}\\ 4\boldsymbol{\psi}(n-p_{1},1.01),&n\geq p_{1}\end{cases},\\ w_{rate}&=\begin{cases}0.01,&n<p_{1}\\ 0,&n<p_{2}\\ 2\boldsymbol{\psi}(n-p_{2},1.01),&p_{2}\leq n<p_{3}\\ c,&p_{3}\leq n<p_{4}\\ c+2\boldsymbol{\psi}(n-p_{4},1.02),&n\geq p_{4}\\ \end{cases},\end{split}

(5)

where $\boldsymbol{\psi}(x,y)=10^{-3}(y^{x}-1),p_{1}=50,p_{2}=62,p_{3}=85,p_{4}=107$ and $c=2\boldsymbol{\psi}(p_{3}-p_{2}-1,1.01)$ . We collected the 6 model checkpoints that offer close average bitrates to QPs [22, 27, 32, 37, 42, 47] of VTM 12.0 [52] (reference software of VVC). The collected checkpoints correspond to epoch numbers $n$ [68, 80, 170, 220, 270, 320], respectively. Models trained for a small number of epochs were not found to be sufficiently optimized, thus we performed an additional finetuning process of the checkpoint corresponding to epoch 68 (QP 22) for another 50 epochs with the same training settings except for fixed loss weights which could be obtained with Equation 5 where $n=68$ .

IHA training: 30K images were randomly selected from the training split of Open Images, and encoded and decoded by using all the available LIC models. Multiple random patches of size $256\times 256$ were cropped out from each reconstructed image and the IHA was trained for 88 epochs with a batch size of 50.

IMA training: we used the training split of the BVI-DVC dataset [53] to generate the training data, which was obtained by running the NN-VVC system for target QPs [22, 27, 32, 37, 42, 47, 52] with IMA turned off. The IMA model was trained for 50 epochs with patches of size $240\times 240$ extracted similarly to IHA training. The training data for the F-IMA was generated by coding and reconstructing the training split of BVI-DVC using only the CVC with QPs [22, 27, 32, 37, 42, 47, 52, 57, 62]. Even though the fallback mode was used only in cases where LIC QP $>$ 49, it was empirically noted that including the training data generated with lower QPs makes the model more robust when used in the fallback mode. F-IMA was trained for 28 epochs with batches of 96 patches of size $256\times 256$ . Following the total loss defined in Equation 4, proxy loss weight $w_{proxy_{A}}$ was set to 0, 0.1 and 0.015 for IHA, IMA and F-IMA, respectively. Adam optimizer [54] with a learning rate of $2\mathrm{e}{-4}$ was used for every adapter.

TABLE I: Codec performance results compared to VVC/H.266 on different vision tasks. The results are evaluated using the Bjøntegaard Delta [55] against rate (BD-rate) and task performance metric (BD-task). The scores are presented in “BD-rate

|

BD-task” format, where BD-task represents BD-MOTA for TVD dataset and BD-mAP for the rest of the datasets.

		Object detection	Instance segmentation	Object tracking	Runtime ratio to VVC
		Object detection	Instance segmentation	Object tracking	Encoding	Decoding
Image	Open Images[51]	-53.04 % $\|$ 4.64(*)	-51.76 % $\|$ 4.84	–	0.09	26.63
Image	TVD image[56]	-30.04 % $\|$ 3.32	-38.07 % $\|$ 3.60	–	0.09	19.92
Video	SFU AB[57]	-13.94 % $\|$ 1.48	–	–	0.53	16.75
	SFU C[57]	-32.76 % $\|$ 3.66	–	–	0.39	21.39
	SFU D[57]	-34.55 % $\|$ 3.07(*)	–	–	0.37	37.58
	TVD-01[56]	–	–	-7.84 % $\|$ 0.56	0.27	33.12
	TVD-02[56]	–	–	-14.31 % $\|$ 1.28(*)	0.24	31.76
	TVD-03[56]	–	–	-57.38 % $\|$ 2.67	0.37	27.10

( $\ast$ ) marks the cases where the task performance scores of certain proposed datapoints are lowered to have a monotonic rate-performance curve in order to make BD-rate calculation possible. BD-task is calculated with original values.

IV-B Evaluation setup and results

We used the same environment for all the benchmarks and evaluations in this work. Our testing hardware was an NVIDIA DGX1 machine with 8 Tesla V100 GPUs and an 80-threaded Intel Xeon CPU E5-2698 v4 CPU. We evaluated our method based on the Common Test Conditions for evaluating the Call for Proposal (CfP) responses, issued by the MPEG VCM group [58]. Following this evaluation framework, the performance of the codecs was measured for three vision tasks, on two image datasets, one of which is developed for MPEG VCM activities based on Open Images V6 [51], and 17 video sequences (3 from TVD dataset [56] and 14 from SFU dataset[57]). The video sequences were categorized into classes for class-wise performance. Table I reports the performance of our codec in terms of BD-rate and BD-task [55] calculated from 6 different rate points against the VVC anchor as the performance metric. The BD-mAP and BD-MOTA indicate the mean average precision (mAP [59]) gain and the multiple object tracking accuracy (MOTA [60]) gain, respectively, at an equivalent bitrate. We use BD-task to refer collectively to BD-mAP and BD-MOTA. Ideally, a complete comparison would consider also other machine-oriented video codecs in addition to the state-of-the-art conventional video codec VVC. However, the codecs in [23, 30, 29] are “feature compression” methods that partially employ the task networks in their pipeline. Additionally, they were not evaluated following the MPEG common test conditions for video coding for machines (VCM CTC). In these works, the evaluation datasets and QPs differ from each other as well as from the VCM CTC. In our evaluation, we followed the VCM CTC as it represents a structured and extensive approach for evaluating codecs that target machine analysis. In addition, this enables future works that follow VCM CTC to be easily compared to ours.

Visual quality: Targeting vision task performance over pixel-fidelity, our codec seeks to conserve the semantic features of the input. In high bitrates, both these features and pixel fidelity can be preserved at the same time. In order to demonstrate the differences in bit-allocation priority of our codec, we deliberately coded the input sequence in a low bitrate setting (QP 52), in comparison to VVC. The outputs can be seen in Fig. 3. The intra-frames show the coding artifact patterns due to the low bit budget. At this bitrate, VVC intra codec suffered from traditional coding artifacts, whereas our LIC codec suffered from the “checkerboard” patterns which are commonly found in CNN-based image codecs. The IHA then heavily attenuated the patterns, making the intra-frame a more appropriate reference for inter-coding. The investigated inter-frames in the figure depict how these patterns propagated to inter-frames coded by CVC. Note that the IMA might also introduce these checkerboard artifacts to the inter-frames. Besides the differences in coding artifact patterns, compared to the output inter-frame from VVC, more edges, better-defined shapes, and well-preserved texts in the foreground objects can be observed in the output of our NN-VVC codec. These features are critical information to many vision tasks.

Task performance benchmark: Our codec outperformed VVC by a significant margin. On average, NN-VVC achieved a BD-task gain of 4.1 and 2.12 over the anchor for the tested image and video datasets, respectively. Corresponding average BD-rate reductions were $-43.20\%$ for images and $-26.8\%$ for videos. Fig. 4 shows the rate-distortion curves from where the BD metrics were calculated. The most significant part of the gains over VVC came from the lower half of the bitrate range. Fig. 5 shows two examples of the prediction accuracy gain on TVD-03 object tracking sequence by comparing the bounding boxes detected from the inter frames reconstructed by VVC and NN-VVC. Both examples illustrated that, because of the heavy compression, the task network had difficulties predicting some of the bounding boxes for the VVC reconstructed frames. This is especially noticeable with instances that are harder to predict, such as the person further on the background in the right side of the frame, as well as most of the persons partly occluded by a tree. However, for the NN-VVC reconstructed frames, the task network was able to predict correctly as illustrated with the green bounding boxes.

TABLE II: System complexity measured by the number of multiply–accumulate (MAC) operations per pixel.

Process	Complexity	Number of parameters
	(kMACs/pixel)
Intra encoding	1631.31	4.3M
Intra decoding	1709.06	6.5M
IHA	163.62	792K
IMA	161.68	782K
IMA - fallback mode	163.22	792K

Complexity and coding runtime: Table II shows the complexity of every NN-based component in our system. Similar to most of the NN-based image codecs, the design of our intra codec (LIC) distributes the computation between the encoder and decoder fairly evenly, unlike traditional codecs such as VVC which are designed and heavily optimized for the shortest decoding runtime. As a result, compared to VVC when tested on the same hardware and configurations as previously described for the evaluation environment, our encoder could be 2 to 10 times faster, while the decoder was 17 to 38 times slower in image and video coding as shown in Table I. On the other hand, the decoding time of our system was $2-19\%$ and $79-92\%$ of the encoding time in the aforementioned video and image coding tests, respectively, even though the LIC decoder has a higher complexity than the LIC encoder. This was because of two main reasons: i) the inheritance of the VVC decoder in our codec and ii) the use of the progressive probability model [47] in our LIC, which enabled a parallel decoding process of the intra-frames.

IV-C Ablation study

In addition to the main results, an ablation study was conducted to show the importance of each main component of the NN-VVC system for machine task performance. Specifically, LIC, IHA and IMA were tested for coding videos. Table III contains BD-rate and BD-mAP/BD-MOTA results for SFU object detection [57] and TVD object tracking [56] with different configurations. Note that the IMA in the table implies that either IMA or F-IMA was applied.

Starting from the base configuration (”No adapters”), where only LIC + CVC was used, we note that the LIC was able to introduce characteristics important to the machine tasks, especially with the SFU C and TVD-03. The IMA improved the performance significantly, except for SFU AB, because the LIC-coded intra images had introduced too much distortion to the reference images of the CVC, and eventually to the pictures that IMA is trying to adapt, which might introduce even more distortions in some cases. While the IHA itself did not generally improve the machine task performance as it was optimized only for MSE, its importance can be seen when being used together with the IMA and compared to the configuration where only IMA was used. Especially the SFU class AB, which is a problematic class for the IMA without IHA, was improved significantly. A general conclusion from this ablation study is that while some of the components worked better than others by themselves, they complemented each other and should be used together for achieving the best machine task performance gains.

TABLE III: Ablation study of machine task performances with different system configurations. BD-rate was used as a performance metric for object detection and object tracking. BD-mAP and BD-MOTA were used for object detection and object tracking, respectively.

Metric	Configuration	Object detection			Object tracking			Average
Metric	Configuration	SFU AB	SFU C	SFU D	TVD-01	TVD-02	TVD-03	Average
BD-rate $\downarrow$	No adapters	2.82 %	-25.50 %	-4.32 %	8.84 %	9.35 %	-14.94 %	-3.96 %
	IHA	2.04 %	-21.59 %	-2.21 %	8.83 %	17.70 %	-30.45 %	-4.28 %
	IMA	1.22 %	-31.02 %	-31.08 %	-8.95 %	-16.54 %*	-58.21 %	-24.10 %
	IHA + IMA	-13.94 %	-32.76 %	-34.55 %*	-7.84 %	-14.31 %*	-57.38 %	-26.80 %
BD-mAP $\uparrow$ BD-MOTA $\uparrow$	No adapters	-0.36	2.87	0.41	-0.87	-0.59	-0.05	0.23
	IHA	-0.24	2.44	0.06	-0.86	-1.08	0.37	0.11
	IMA	-0.18	3.52	2.98	0.74	1.40	2.61	1.84
	IHA + IMA	1.48	3.66	3.07	0.56	1.28	2.67	2.12

$\ast$ marks the cases where the task performance scores of certain proposed data points are lowered to have a monotonic rate-performance curve in order to make BD-rate calculation possible. BD-task is calculated with original values.

V Conclusions

In this paper, we proposed a hybrid coding system NN-VVC, which combines the high performance of a machine-task-optimized learned image codec (LIC) and a state-of-the-art conventional video codec (CVC) conforming to the Versatile Video Coding (VVC) standard. It was shown that the important characteristics for machine task in the reconstructed images generated by the LIC could be transferred to the inter-frames when the LIC encoded intra-frames are used as reference frames in the CVC encoding. Furthermore, the Intra Human Adapter (IHA) is applied to the LIC encoded intra-frames to reduce the artefacts introduced by the LIC, resulting in a more efficient inter-frame coding, while kee** the machine-oriented characteristics. The decoded inter-frames are further adapted for machine consumption with a learned Intra Machine Adapter (IMA). The NN-VVC showed significant coding gains over the VVC codec in terms of machine task performance on similar bitrates. Future research will focus on optimizing the NN-VVC system for both machine and human consumption.

References

[1] Cisco annual internet report (2018–2023) white paper. Accessed: Feb. 2023. [Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html
[2] “Call for evidence for video coding for machines,” in ISO/IEC JTC 1/SC29/WG 2, m55065, Oct 2020.
[3] J. Ascenso, “JPEG AI use cases and requirements,” in ISO/IEC JTC1/SC29/WG1 M90014, Jan 2021.
[4] “Use cases and requirements for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 2 N190, April 2022.
[5] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[6] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, Aug 2021.
[7] W. Duan, K. Lin, C. Jia, X. Zhang, S. Ma, and W. Gao, “End-to-End Image Compression via Attention-Guided Information-Preserving Module,” in 2022 IEEE International Conference on Multimedia and Expo (ICME), Jul. 2022, pp. 1–6.
[8] N. Zou, H. Zhang, F. Cricri, H. Tavakoli, J. Lainema, M. Hannuksela, E. Aksu, and E. Rahtu, “L ${}^{2}$ C – learning to learn to compress,” in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), ser. IEEE International Workshop on Multimedia Signal Processing. IEEE, Sep. 2020, pp. 1–6.
[9] B. Li, J. Liang, and J. Han, “Variable-Rate Deep Image Compression With Vision Transformers,” IEEE Access, vol. 10, pp. 50 323–50 334, 2022.
[10] Y.-H. Ho, C.-C. Chan, W.-H. Peng, H.-M. Hang, and M. Domański, “ANFIC: Image compression using augmented normalizing flows,” IEEE Open Journal of Circuits and Systems, vol. 2, pp. 613–626, 2021.
[11] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7939–7948.
[12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in International Conference on Learning Representations, 2018.
[13] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 10 771–10 780.
[14] Recommendation ITU-T H.266 | ISO/IEC 23090-3, “Versatile video coding,” 2020.
[15] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8503–8512.
[16] T. Ladune and P. Philippe, “Aivc: Artificial intelligence based video codec,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 316–320.
[17] F. Mentzer, G. Toderici, D. Minnen, S.-J. Hwang, S. Caelles, M. Lucic, and E. Agustsson, “Vct: A video compression transformer,” arXiv preprint arXiv:2206.07307, 2022.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
[19] H. Choi and I. V. Bajić, “Affine Transformation-Based Deep Frame Prediction,” IEEE Transactions on Image Processing, vol. 30, pp. 3321–3334, 2021.
[20] N. Zou, H. Zhang, F. Cricri, H. R. Tavakoli, J. Lainema, E. Aksu, M. Hannuksela, and E. Rahtu, “End-to-end learning for video frame compression with self-attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 142–143.
[21] N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: an end-to-end learned approach,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 1590–1594.
[22] N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, H. R. Tavakoli, and E. Rahtu, “Learned image coding for machines: A content-adaptive approach,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
[23] K. Fischer, F. Brand, C. Herglotz, and A. Kaup, “Video coding for machines with feature-based rate-distortion optimization,” IEEE 22nd International Workshop on Multimedia Signal Processing, p. 6, September 2020.
[24] ——, “Learning frequency-specific quantization scaling in vvc for standard-compliant task-driven image coding,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 476–480.
[25] M. Yamazaki, Y. Kora, T. Nakao, X. Lei, and K. Yokoo, “Deep Feature Compression using Rate-Distortion Optimization Guided Autoencoder,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 1216–1220.
[26] J. Seppälä, H. Zhang, N. Le, R. G. Youvalari, F. Cricri, H. R. Tavakoli, E. Aksu, M. M. Hannuksela, and E. Rahtu, “Enhancing image coding for machines with compressed feature residuals,” in 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 2021, pp. 217–225.
[27] S. Chen, J. **, L. Meng, W. Lin, Z. Chen, T.-S. Chang, Z. Li, and H. Zhang, “A New Image Codec Paradigm for Human and Machine Uses,” Dec. 2021.
[28] H. Choi and I. V. Bajić, “Scalable Image Coding for Humans and Machines,” IEEE Transactions on Image Processing, vol. 31, pp. 2739–2754, Jan. 2022.
[29] ——, “Scalable Video Coding for Humans and Machines,” Aug. 2022.
[30] Z. Huang, C. Jia, S. Wang, and S. Ma, “HMFVC: A Human-Machine Friendly Video Compression Scheme,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2022.
[31] Y. Li, L. Zhang, and K. Zhang, “Idam: Iteratively trained deep in-loop filter with adaptive model selection,” ACM Trans. Multimedia Comput. Commun. Appl., apr 2022.
[32] Z. Huang, J. Sun, X. Guo, and M. Shang, “Adaptive deep reinforcement learning-based in-loop filter for vvc,” IEEE Transactions on Image Processing, vol. 30, pp. 5439–5451, 2021.
[33] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma, “Content-aware convolutional neural network for in-loop filtering in high efficiency video coding,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3343–3356, 2019.
[34] J. I. Ahonen, R. G. Youvalari, N. Le, H. Zhang, F. Cricri, H. R. Tavakoli, M. M. Hannuksela, and E. Rahtu, “Learned enhancement filters for image coding for machines,” in 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 2021, pp. 235–239.
[35] F. Nasiri, W. Hamidouche, L. Morin, N. Dhollande, and G. Cocherel, “Model selection cnn-based vvc quality enhancement,” in 2021 Picture Coding Symposium (PCS), 2021, pp. 1–5.
[36] I. Schiopu and A. Munteanu, “Deep learning post-filtering using multi-head attention and multiresolution feature fusion for image and intra-video quality enhancement,” Sensors, vol. 22, no. 4, 2022.
[37] Y.-H. Lam, A. Zare, F. Cricri, J. Lainema, and M. M. Hannuksela, “Efficient adaptation of neural network filter for video compression,” in Proceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 358–366.
[38] M. Santamaria, Y.-H. Lam, F. Cricri, J. Lainema, R. G. Youvalari, H. Zhang, M. M. Hannuksela, E. Rahtu, and M. Gaubbuj, “Content-adaptive convolutional neural network post-processing filter,” in 2021 IEEE International Symposium on Multimedia (ISM), 2021, pp. 99–106.
[39] M. Santamaria, F. Cricri, J. Lainema, R. G. Youvalari, H. Zhang, and M. M. Hannuksela, “Content-adaptive neural network post-processing filter with nnr-coded weight-updates,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 2251–2255.
[40] C. Liu, H. Sunyz, J. Kattoz, X. Zeng, and Y. Fan, “A qp-adaptive mechanism for cnn-based filter in video coding,” in 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022, pp. 3195–3199.
[41] Z. Huang, X. Guo, M. Shang, J. Gao, and J. Sun, “An efficient qp variable convolutional neural network based in-loop filter for intra coding,” in 2021 Data Compression Conference (DCC), 2021, pp. 33–42.
[42] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Int’l Conf on Learning Representations (ICLR), Toulon, France, April 2017.
[43] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” pp. 10 771–10 780, 2018.
[44] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 913–11 924, 2020.
[45] N. Le, H. Zhang, F. Cricri, R. G. Youvalari, H. R. Tavakoli, E. Aksu, M. M. Hannuksela, and E. Rahtu, “Bridging the gap between image coding for machines and humans,” in 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 3411–3415.
[46] J. Duda, K. Tahboub, N. J. Gadgil, and E. J. Delp, “The use of asymmetric numeral systems as an accurate replacement for huffman coding,” in 2015 Picture Coding Symposium (PCS), 2015, pp. 65–69.
[47] H. Zhang, F. Cricri, H. R. Tavakoli, E. Aksu, and M. M. Hannuksela, “Leveraging progressive model and overfitting for efficient learned image compression,” 2022.
[48] R. Zhang, “Making convolutional networks shift-invariant again,” in International conference on machine learning. PMLR, 2019, pp. 7324–7334.
[49] S. Marcel and Y. Rodriguez, “Torchvision the machine-vision package of torch,” in Proceedings of the 18th ACM international conference on Multimedia, ser. MM ’10. Association for Computing Machinery, pp. 1485–1488.
[50] H. Zhang, N. Le, F. Cricri, J. Ahonen, and H. Tavakoli, “Stabilizing the convolution operations for neural network-based image and video codecs for machines,” in 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Brisbane, Australia, 2023, pp. 170–175.
[51] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, 2020.
[52] Versatile video coding (VVC) reference software VTM. [Online]. Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM
[53] D. Ma, F. Zhang, and D. Bull, “Bvi-dvc: a training database for deep video compression,” IEEE Transactions on Multimedia, 2021.
[54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
[55] G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” ITU-T Video Coding Experts Group (VCEG), 2001.
[56] W. Gao, X. Xu, M. Qin, and S. Liu, “An Open Dataset for Video Coding for Machines Standardization,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 4008–4012.
[57] H. Choi, E. Hosseini, S. Ranjbar Alvar, R. Cohen, and I. Bajić, “Sfu-hw-objects-v1: Object labelled dataset on raw video sequences,” 2020.
[58] “Common test conditions for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 04, Jan 2022.
[59] Coco evaluation. [Online]. Available: https://cocodataset.org/#detection-eval
[60] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.