HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: cuted
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.14115v2 [cs.RO] 20 Mar 2024
\useunder

\ul

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu*, Long Chen*, Jan Hünermann*,
Alice Karnsund*, Benoit Hanotte*, Prajwal Chidananda, Saurabh Nair,
Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski*
Wayve
[email protected]
*Equal contributions
Abstract

We introduce LingoQA, a novel dataset and benchmark for video question answering in autonomous driving. The dataset contains 28K unique scenarios and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 56.67% of the questions compared to 93.4% for humans. For evaluation, in addition to conducting a human study, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark111https://github.com/wayveai/LingoQA and hope that it will provide a thorough evaluation platform for future vision-language models in autonomous driving.

{strip}
[Uncaptioned image]
Figure 1: LingoQA is a comprehensive benchmark for Video Question Answering in autonomous driving. Our baseline vision-language model on this benchmark, can answer questions related to driving reasoning, object recognition, action justification, and scene description.

1 Introduction

Communication plays a pivotal role in naturally fostering trust among individuals. However, establishing trust between users and agents remains a significant challenge within the field of artificial intelligence. Recent studies have indicated that articulating explicit reasoning steps can significantly enhance user confidence [1], in addition to improving the capabilities of Large Language Models (LLMs) [52]. The need for textual justifications remains critical, particularly in safety-critical domains where technology adoption hinges upon this factor [29].

Consider the domain of end-to-end autonomous driving [11], where the driving policy is often executed through deep neural networks processing camera inputs to generate control commands. Recent strides in VLMs have solidified transformers as multimodal learners, showcasing remarkable performance in tasks such as Visual Question Answering (VQA) and underscoring their proficiency in acquiring robust representations for complex tasks [14]. Integrating Vision-Language Models (VLMs) into the field of autonomous driving holds the promise of enhancing user trust in these systems.

Our focus is on vision-only end-to-end autonomous driving, aiming to bridge the gap between data-driven decision-making and user trust. We introduce LingoQA, a benchmark designed for autonomous driving video QA, utilizing a novel dataset comprising more than 419k QA pairs. Distinguished by its free-form approach to questions and answers, this dataset broadens the scope of autonomous driving video QA, encompassing reasoning and action justifications. Additionally, we publish a comprehensive evaluation suite consisting of 1,000 examples. At the core of our benchmark lies a novel evaluation metric based on a learned text classifier called Lingo-Judge, inspired by GPT-Judge used in TruthfulQA [34]. We perform rigorous studies correlating automatic metrics to human preferences and find that Lingo-Judge achieves a 0.950 Spearman and 0.993 Pearson correlation coefficient, surpassing existing automated labelling techniques like METEOR [5], BLEU [40], CIDEr [49], and GPT-4 [39] on our benchmark, while being fast enough for frequent runs during training and development. The evaluation code and the weights for the classifier will be released with the paper to support robust benchmarking video question answering in autonomous driving.

Equipped with this evaluation toolkit, we conducted a comprehensive empirical study on key components and their ablations in VLMs for autonomous driving. Our findings in Section 5 indicate that the most effective approach involves partially fine-tuning the attention layers of our vision-language model equipped with Vicuna-1.5-7B [13], on both Action and Scenery datasets. This process involves using 5 video frames over 4 seconds and a late video fusion technique. Our collective work, spanning the LingoQA benchmark, the visual instruction-tuning dataset, and the innovative evaluation metric, aims to propel the domain of language-augmented autonomous driving, laying a robust foundation for subsequent research and development endeavors. To summarise the main contributions of this paper:

  • LingoQA Benchmark: We introduce a novel benchmark for autonomous driving video QA using a learned text classifier for evaluation. It outperforms existing metrics, including GPT-4, with a Spearman coefficient of 0.950 indicating a strong correlation with human evaluation.

  • LingoQA Dataset: Our 419.9k QA pair dataset stands out with its free-form questions and answers, covering not just perception but also driving reasoning from the drivers directly, broadening the scope of autonomous driving video QA.

  • LingoQA Baseline: Through testing of various video-language components on LingoQA, we find that the most effective approach involves partially fine-tuning the attention layers of our vision-language model equipped with Vicuna-1.5-7B [13] and a late video fusion technique. We establish a new baseline for this field with an identified model combination. Example outputs from the model are shown in Figure LingoQA: Video Question Answering for Autonomous Driving.

2 Related Work

2.1 Language in Autonomous Driving

Modern autonomous vehicle software relies heavily on artificial intelligence models [6, 18, 19, 21]. This, together with the increased number of such vehicles on the road, poses a fundamental challenge in terms of interpretability in the decision-making process [4]. Understanding why a decision is made is crucial for understanding areas of uncertainty, building trust, enabling effective human-AI collaboration, and ensuring safety [54]. In a survey conducted by Partners for Automated Vehicle Education (PAVE) in 2020 [1], 60% of participants stated that they would trust AVs more if they better understood the underlying process of the system. To establish trust with the general public, the systems must be explained in a human-interpretable way, such as through language and visual explanations [4].

The field of autonomous driving has been embracing the opportunity to make driving models more trustworthy for their users using visual attention methods [25] or textual explanations [29]. The early explorations of GPT3.5 [45, 37] and GPT4-V [53] on autonomous driving scenarios show that LLMs/VLMs demonstrate superior performance in scene understanding and causal reasoning compared to existing autonomous systems. Works such as ADAPT [27] and LLM-Driver [10] propose multi-task learning frameworks for jointly predicting language and control outputs. Inspired by progress in large language models [47, 13, 60, 39], vision-language models [50, 7, 51, 57, 58, 59, 12, 3, 30, 39, 35, 15] and multi-modal transformers for robotics [17, 9, 8] our work incorporates language to autonomous driving. Closely related to our proposed baseline is DriveGPT [56], proposing a multi-modal vision-language-action model that tokenizes videos, as well as text and control actions.

2.2 Evaluation Metrics

Progress has been relatively slow in develo** vision-language models for autonomous driving, with only a few works aiming to quantitatively improve upon prior work [28, 27, 56]. A key challenge consists of automated, reproducible evaluation metrics that are highly correlated with human ratings, particularly due to the inherent complexities in assessing natural language. ADAPT [27] reports human feedback in addition to standard natural language generation metrics, while DriveGPT [56] reports ChatGPT ratings. Automated methods such as BLEU [40], METEOR [5], ROUGE [33] show weak alignment with human feedback [49]. CIDEr [49] is also based on n-gram level similarity, as opposed to capturing the correctness of an answer based on its meaning. Newer evaluation metrics using ChatGPT have shown improvement in the area of sentence understanding, while still having limitations, such as providing high scores to elaborate, eloquent, but incorrect sentences [2]. Evaluation based on human feedback is subjective and difficult to reproduce. In this work, we address this challenge by introducing a novel video QA benchmark for autonomous driving that checks for factual correctness and is highly correlated to human correctness ratings on our proposed evaluation dataset.

2.3 Datasets for Autonomous Driving

Recent advances in generative AI have been underpinned by training with increasingly large and diverse internet-scale datasets. [3] [30] This has brought into light the need for evaluation benchmarks and datasets that focus not only on specific tasks, but on reasoning areas, such as descriptive and predictive reasoning. [41] Prior works, such as the CausalVQA benchmark [32] and the Perception Test [41], a comprehensive benchmark for vision-language foundation models, probe the validity of the model representations through question answering. Autonomous driving datasets have been focused on commentary [29, 55] or constructed around existing object detection datasets [42, 16]. Datasets such as NuScenesQA [42], contain simple language outputs of on average one word per question that do not tackle the more challenging reasoning problem.

Our proposed dataset LingoQA addresses the existing gap in autonomous driving as it contains a diverse set of questions related to driving behaviours and scenery in addition to perception questions related to object presence and positioning. The evaluation dataset probes areas such as description, counting, localisation, anticipation, attention, and action justification. This dataset has the strength of being diverse with respect to the language used while being grounded in human reasoning. Examples of the complex questions and answers existent in the dataset are provided in Appendix A.

3 LingoQA Benchmark

In this section, we introduce LingoQA, a benchmark to evaluate video question-answering models for autonomous driving.

3.1 Evaluation Dataset

We collected a small, low-density dataset from in-house human labelers, creating both the questions and the answers associated with the short videos. We labeled a small portion of held-out data on 500 human-generated questions using 20+ different evaluators to obtain our test set. Since answers are subjective and noisy, we labeled them twice, making sure the same evaluator does not receive the same question twice. After that, we manually reviewed the answers for semantic disagreements and mistakes. We relabeled such samples two more times and fixed the disagreements, preferring the semantics of the majority of responders but preserving maximal variety in the responses. Finally, we condensed this into 1k high-quality answers to 500 questions, with two correct but diverse answers per question. The dataset evaluates a range of competencies, including action and justification, attention, description, localisation, identification, counting and anticipation, as shown in Figure 2.

3.2 Evaluation Metric

Evaluating open-ended textual dialogues is a challenging task. Quite often the correct answers are ambiguous, subjective, or even not attainable. The most common language-based metrics for evaluating question-answering models in autonomous driving [56, 27, 29] are BLEU [40], METEOR [5] and CIDEr [49], despite their known limitations, such as relying heavily on the n-gram frequency as opposed to the underlying meaning of the answer. To address these limitations, we set ourselves the challenge to develop an automated, non-visual evaluation method for free-form language answers from vision-language models which checks correctness independent of phrasing against a ground truth answer and which is highly correlated with human ratings.

Pearson Spearman Val Acc. [%] Time [sec]
Lingo-Judge 0.993 0.950 95.0 10.5
GPT-4 with CoT 0.990 0.932 91.2 3016.0
GPT-4 [39] 0.988 0.941 90.6 812.4
BLEU [40] 0.881 0.835 - 0.1
METEOR [5] 0.891 0.876 - 8.0
CIDEr [49] 0.878 0.853 - 0.2
Table 1: Lingo-Judge Performance. Correlation with human ratings, validation accuracy, and time taken to run of our proposed LingoQA evaluation metric compared to previous language-based metrics. All metrics use textual ground truth and have no access to vision information. Further examples are presented in Appendix B.

GPT-4 based evaluation

Inspired by the G-Eval metric [36], we used GPT-4 to evaluate answers on a larger scale. Given a question and answer pair from the test set and a model’s answer, we ask GPT-4 to evaluate whether the model’s answer corresponds to a human’s answer. Notice that it does not make use of any visual input. We experimented with prompts and methods achieving good quality of judgements. We achieved the highest accuracy by employing chain-of-thought prompting where we ask GPT-4 to first come up with an evaluation strategy before grading a model’s answer. However, as shown in Table 1, this leads to increased inference time. Further details are provided in Appendix C. Unfortunately, we found GPT-4 based evaluation impractical to use as a main development and training metric due to the time required to evaluate answers on our relatively small evaluation dataset (from 13min up to 50min for a single evaluation due to the API rate limit).

Lingo-Judge

Given these limitations and inspired by TruthfulQA GPT-Judge [34], we pursued an alternative approach using a learned text classifier, dubbed Lingo-Judge, which estimates the correctness of model answers. We measure the correctness of model predictions as an accuracy using a small transformer-based text classifier that takes in a question, the human’s, and the model’s answer and outputs a probability that the model’s answer is correct. Please note, Lingo-Judge does not receive video input and must rely only on the supporting human’s answers. For every question, we run Lingo-Judge on all combinations of (ground-truth answer, predicted answer) and take the maximum correctness estimate, as shown in Equation 1, where S𝑆Sitalic_S is the score per sample. We found that this recipe yields the best predictive power provided enough diversity of human answers in our evaluation dataset.

S=maxj{0,1}FJudge(prediction,ground_truth[j])𝑆subscript𝑗01subscript𝐹Judgepredictionground_truthdelimited-[]𝑗S=\max_{j\in\{0,1\}}F_{\mathrm{Judge}}(\mathrm{prediction},\mathrm{ground\_% truth}[j])italic_S = roman_max start_POSTSUBSCRIPT italic_j ∈ { 0 , 1 } end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT roman_Judge end_POSTSUBSCRIPT ( roman_prediction , roman_ground _ roman_truth [ italic_j ] ) (1)

The architecture of the classifier is a DeBERTa-V3 [20] language model, fine-tuned with LoRA [22]. The classification score is predicted using a linear head on top of the class token output. We fine-tuned the model on a diverse dataset of model predictions from early experiments, where questions and ground truth answers come from our evaluation dataset and the correctness target is labeled by human annotators. On top of this initial dataset, we iteratively improved the classifier using active learning by correcting the wrong predictions of discarded models and adding corrections to the training dataset. On a held-out test set, we find that the binary classification accuracy of the classifier is 95%.

In comparison to metrics such as CIDEr, which provide a system-level performance metric, the classifier provides a probability of correctness for each of the model predictions, meaning that it provides metrics at the sample level. Examples are provided in the Appendix B. This means that 100% classifier accuracy is easy to interpret. The classifier allows us to compute metrics during training, running over our full evaluation dataset in 10 seconds using an A100 GPU.

Correlation to human ratings

We studied empirical correlation of various metrics with human judgments. Several human annotators assigned a scalar score [0, 1] to the inference outputs of 17 different models which can be interpreted as the probability that the response correctly addresses the question [34]. Notably, this process takes several days, highlighting the need for an automated evaluation metric that provides faster development feedback. The final human score of each model is the average of all inference output scores. Further details regarding the methodology for the correlation analysis are in the Appendix D.

The Spearman rank correlation coefficient of our automated metric, Lingo-Judge, with human scores is 0.95, and the Pearson correlation coefficient is 0.993. These values are considerably higher compared to other natural language evaluation metrics and GPT-4, as detailed in Table 1. Our analysis demonstrates that Lingo-Judge accurately mirrors human judgments, outperforming existing metrics such as BLEU, METEOR, and CIDEr, as well as GPT-4 with and without chain-of-thought prompting. This indicates that Lingo-Judge can effectively serve as a proxy for human labelling, which is particularly significant given the stagnant nature of metrics in autonomous driving since the introduction of the CIDEr metric in 2015. Notably, despite their limitations, prominent models like ADAPT [27] and DriveGPT [56] still use BLEU, METEOR, and CIDEr metrics and report ChatGPT ratings without analyzing their correlation to human preferences. Our work fills this gap by providing a reliable benchmark that better reflects human preferences.

Scenarios QA pairs QA per scenario
Action 24.5k 267.8k 10.9absent10.9\approx 10.9≈ 10.9
Scenery 3.5k 152.5k 43.6absent43.6\approx 43.6≈ 43.6
Eval. Dataset 100 1000 10101010
Table 2: Dataset Split. It consists of three different datasets of varying annotation densities. The Action dataset focuses on questions related to driving behaviours, the Scenery dataset focuses on perception capabilities, while the evaluation dataset is designed to probe a range of competencies.
Scenarios Annotations QA Captioning Video length [sec]
Rank2Tell [44] 118 >118absent118>118> 118 20
BDD-OIA [55] 22.9k 35k 5
BDD-X [29] 6.9k 26k 40
NuScenesQA [42] 34k 460k 20
DriveLM [46] 30k 443k 20
LingoQA 28k 419.9k 4
Table 3: Dataset Features. The dataset that we introduce alongside our benchmark consists of questions related to object presence, as well as action, justification, attention, localisation, counting, anticipation, and counterfactuals. In total, it has a similar size to other driving-related datasets such as NuScenesQA, while having a much higher diversity and not being limited to questions related to object positioning.

3.3 Datasets

Refer to caption
Figure 2: Dataset Statistics. Dataset split by the number of question-answer pairs for the competencies covered and for the objects referred. One question-answer pair might cover more than one competency or object, hence the total is higher than the size of the datasets. The Action and Scenery datasets have complementary strengths, with one focused more on action-justification competencies and one more on description and localisation.

We created a collection of datasets for bringing language to autonomous driving. The total dataset size is 419.9k question-answer pairs, where a single data sample consists of a 4-second video clip at 1Hz. The total size of the dataset is about 10x larger than BDD-X [29], as shown in Table 3. Compared to prior datasets such as NuScenesQA [42], our dataset contains reasoning pairs in addition to object presence, description, and localisation. The answers are also more free-form and more complex, with an average answer length of 17.2 words versus 1.0 words in NuScenesQA. Examples of question answers pairs from LingoQA are shown in Appendix A.

Our labeled autonomous driving training dataset consists of two complementary parts: the action dataset and the scenery dataset.

Action dataset.

This dataset was made from a recorded driving corpus of interesting events where the car’s behavior changes, such as decelerations, accelerations, lane changes, narrow gaps, and turns. Such events were succinctly labeled by driving operators with very short high-level descriptions of the situations and behavioral policies (e.g. “following lane, pedestrian on a zebra crossing, should stop”). Additionally, we added metadata for such events from various perception systems, such as traffic light presence, vehicles and pedestrian visual detectors, weather descriptors, as well as other metadata (speed, steering wheel position, and road type from the map data). Using this data, we developed prompt templates for (1) describing the current action and its justification and (2) a set of example questions and hints about what the answer should mention. Next, we used those prompts with GPT-3.5 to rephrase, answer, and extend the example questions using the provided action description and answer hints. We rebalanced events by bucketing by actions and behavioral policies and sampled up to 500 events from each bucket without replacement, leading to 24,577 video snippets with 167,774 question/answer pairs.

Scenery dataset

The scenery dataset was built to complement the action dataset by focusing on perception-related questions in addition to driving behaviours. The dataset was made by densely and thoroughly labelling three 30-minute driving sessions with the ELAN video annotation software[38]. For the entire duration of the driving sessions, we provided short captions in about 15 different categories:

  • Driver’s actions, and their justifications

  • Driver’s attention

  • Observations about relevant vehicles, pedestrians, and other road actors with their visual descriptions

  • Observations about relevant road elements such as traffic lights, traffic islands, lane and intersection structures

  • Miscellaneous observations about the environment, such as weather, tube stations, and buildings.

Then, for every keyframe every second (1fps), we collect all annotations around this frame and build a textual description containing the driver’s actions and their justifications, the objects requiring the driver’s attention, and the observations. As opposed to the Action dataset, where recommended questions were provided to GPT-3.5 for rephasing, for the Scenery dataset, we asked GPT-4 to generate questions and answers using a set of generic prompts, but also using a prompt with the chain of thought specifically targeting perception questions. This forced GPT-4 to generate many diverse questions and answers. This led to a high-quality diverse dataset with about 43 QA-pairs per video.

Our training dataset covers 9 different competencies: action (what the vehicle is doing), justification (why the action is taken), attention (what should be paid attention to in the current situation), identification (identifying an object given its description), localisation, description, counting, anticipation and reasoning given counterfactuals. The questions also cover a diverse set of objects, such as pedestrians, vehicles, cyclists, buildings, road infrastructure, signs, markings. In Figure 2, we present the number of question and answer pairs for each of the 9 competencies above, as well as for the referred objects, for our two datasets, namely Action and Scenery. The complementary strengths of the datasets are apparent, with one focused on driving behaviours and one on perception tasks.

4 Model Methodology

We propose LingoQA Baseline, a vision language model for autonomous driving based on Vicuna v1.5 [13] with 7B parameters that can answer reasoning questions grounded in video outputs. We train a model that consumes a short video segment and produces answers to autonomous driving-related questions.

4.1 Architecture

The LingoQA Baseline model architecture is based on recent VLMs [30, 35, 17] but enhances them by incorporating a video encoding strategy to process multiple frames from a video snippet, as shown in Figure 3.

Refer to caption
Figure 3: LingoQA Baseline model architecture. We first encode individual frames using CLIP and Q-Former. The Q-Former outputs tokens and we feed the tokens from all frames along with chat history and questions into the LLM, which then predicts an answer.

Vision encoder.

We use CLIP [43], a Vision Transformer (ViT) pre-trained contrastively on image-language pairs, to encode images into features. The inputs to the vision encoder are RGB images from the front camera. We squash the input images to a size of 224×224224224224\times 224224 × 224 as opposed to crop** them in order to keep the full image context. Subsequently, we pass the features through a transformer network, the Querying Transformer (Q-Former), that akin to BLIP-2 [30] acts as a bridge between the vision and language feature spaces. The embeddings are then projected into the large language model (LLM) space using a linear projection layer. We repeat this process for T=5𝑇5T=5italic_T = 5 frames of the input video and concatenate the tokens from each image.

Large language model.

We leverage pretrained LLMs to give LingoQA Baseline the ability to answer general questions related to both driving scenes, as well as general knowledge. We use Vicuna v1.5 [13] with 7B parameters built on top of Llama-2 [47]. The language model is auto-regressive and hence can be conditioned on textual inputs, as well as image tokens. The training objective is to predict the next language token in a sequence. We mask all tokens from the training loss that belong to the text prompt, including question and chat history.

4.2 Training Recipe

Our training uses a two-step approach to better utilise video features and improve learning when answering questions based on video data. Through this two-step training, we aim to better understand and use video data. In the first stage, we train the self-attention layers for the LLM and the vision encoder (QKV), together with all the Q-Former parameters and the linear language projection layer. In the second stage, we fine-tune the same parameters as in the previous stage, kee** the vision encoder frozen. Further details regarding training parameters are presented in Appendix F.

Ablation Lingo-Judge [%] \uparrow BLEU \uparrow METEOR \uparrow CIDEr \uparrow
LingoQA Baseline 60.80 15.00 18.5618.5618.5618.56 65.61
Training recipe Instead of pre-train and fine-tune No fine-tuning 33.6033.6033.6033.60 8.338.338.338.33 14.3314.3314.3314.33 39.1639.1639.1639.16
No pre-training 56.6056.6056.6056.60 13.5313.5313.5313.53 17.9117.9117.9117.91 57.9857.9857.9857.98
Fine-tuning dataset Instead of action and scenery Action only 53.8053.8053.8053.80 11.6511.6511.6511.65 17.6817.6817.6817.68 46.5046.5046.5046.50
Scenery only 55.4055.4055.4055.40 13.0013.0013.0013.00 18.3818.3818.3818.38 55.8855.8855.8855.88
Frame count Instead of 5 frames Single frame 57.0057.0057.0057.00 14.2114.2114.2114.21 18.4018.4018.4018.40 59.4659.4659.4659.46
3 frames 59.8059.8059.8059.80 14.6114.6114.6114.61 18.4418.4418.4418.44 62.6162.6162.6162.61
7 frames 60.6060.6060.6060.60 14.4614.4614.4614.46 18.61 61.8261.8261.8261.82
Video fusion Instead of late-fusion Early-fusion 48.4048.4048.4048.40 13.9813.9813.9813.98 17.6117.6117.6117.61 61.4261.4261.4261.42
Mid-fusion 59.2059.2059.2059.20 14.4414.4414.4414.44 18.4718.4718.4718.47 63.0563.0563.0563.05
Language model Instead of Vicuna-1.5-7B[13] OPT-7B [60] 50.0050.0050.0050.00 14.9814.9814.9814.98 15.9915.9915.9915.99 60.0860.0860.0860.08
Llama-2-7B-Chat [48] 59.2059.2059.2059.20 13.5213.5213.5213.52 18.4318.4318.4318.43 59.8759.8759.8759.87
Mistral-7B-Instruct [26] 58.0058.0058.0058.00 13.8013.8013.8013.80 18.3318.3318.3318.33 64.2164.2164.2164.21
Table 4: Empirical Evaluation on LingoQA. Ablation study highlighting the impact of various modifications in training recipes, dataset composition, frame count, video processing techniques and language model.

Stage 1: Pre-training for feature alignment.

In the first stage, we pre-train the model on the GQA and SVIT datasets to align image features with the embedding space of the pretrained LLM. The GQA dataset [23] contains more than 22M questions over 113k images. The recently introduced SVIT dataset [61] contains 4.2M question-answer pairs over 108.1k images. We leverage initial weights from different models to accelerate the training process. We initialise the vision encoder using publicly available weights of OpenCLIP [24], the Q-Former from BLIP2 weights [31], and language model from Vicuna v1.5 [13].

Stage 2: Fine-tuning for video QA.

In the second stage, we fine-tune the model on our video question-answering Action and Scenery datasets described in Section 3.3. During the fine-tuning phase, each sample is composed of 5 frames taken from a 4-second span of video, accompanied by a QA-pair. To facilitate further exploration of autonomous driving QA, we open-source the dataset used to fine-tune LingoQA Baseline.

5 Empirical Evaluation on LingoQA

With the highly modular architecture of VLMs, the question remains what architectural components of the LingoQA Baseline model and dataset composition contribute the most to its performance? We conduct several ablation studies around the architecture and training paradigm described in Section 4. To this end, we investigate variations to the training strategy, training data composition, frame count, video fusion methods, and the use of different large language models. The results are obtained by having each model generate one answer per question and then compare the predicted answer to the two ground truth answers. Examples of comparisons between our baseline model’s answers and answers from other models from the ablations are presented in Appendix G.

5.1 Training Strategy

The aim of the training strategy experiments is to understand how much the pre-training and the fine-tuning steps contribute to performance. It becomes apparent that fine-tuning leads to improved answers relevant to autonomous driving. The model fine-tuned on the LingoQA dataset has double the performance of the dataset that is pre-trained on generic VQA datasets.

5.2 Training Datasets Mixture

Table 4.2 shows the contributions of our two datasets, Action Dataset and the Scenery Dataset. Both datasets proved influential in improving model performance. We show that fine-tuning on the LingoQA dataset that we open source leads to a considerable improvement compared to general pre-training only.

5.3 Impact of Frame Count

We want to investigate the variation in VQA performance with decreasing and increasing the number of video frames fed into the model. The base model contains 5 frames over a 4-second context. The performance declined when shifting from multi-frame video to a single image representation, which can be explained by the model not getting enough information to answer questions where temporal information is crucial. The performance when increasing the number of frames to 3, 5, and 7 remains relatively consistent. We hypothesize that this is due to the lack of effective video fusion, hence we highlight the opportunity for improved video encoding methods.

5.4 Impact of Video Fusion Strategy

Given how crucial the temporal context is for scenario understanding, this study explores three methods for integrating video frames into the LLM: early-fusion, mid-fusion, and late-fusion. The early-fusion method employs average pooling to condense features from the vision encoder prior to their incorporation into the Q-Former, producing a unified visual feature vector for language space projection. In contrast, the mid-fusion approach, merges video features into fixed-size tokens within the Q-Former with the cross-attention mechanism. The late-fusion method, where individual frame embeddings from Q-Former output are fed into the LLM, allows it to resolve temporal relationships. Our findings demonstrate that both mid-fusion and late-fusion are effective methods for incorporating video content into the model. Mid-fusion allows for a greater number of context tokens through the use of a predetermined number of video embeddings. Conversely, late-fusion shows a slightly enhanced performance by providing comprehensive frame information to the LLM.

5.5 Impact of Large Language Model

We investigate the impact that different Large Language Models have on the overall performance of our vision-language model. As shown in Table 4.2, the best score is achieved by Vicuna-1.5-7B [13] that our base model uses. In the same family of models, Llama-2-7B [48] achieves comparable, but slightly lower performance. Mistral-7B [26], despite its promise of superior performance over Llama-2, proved less effective in our fine-tuning task. Notably, OPT-7B [60] demonstrates significantly lower performance, despite having a similar number of parameters. This discrepancy underscores the crucial role of the pretraining phase in the base language model’s effectiveness.

5.6 Evaluation of SOTA Vision-Language Models

To demonstrate the relevance of the newly proposed benchmark, we evaluate a series of SOTA vision-language models and compare them to human performance. The models shown in Table 5 are evaluated zero-shot. The best performing models achieve an accuracy as rated by humans of 56.67%, which is below human performance at 93.4%. This highlights the challenging nature of our proposed benchmark.

Table 5: Evaluating vision-language models zero-shot. The performance of existing vision-language models is far from human capability, highlighting the challenging nature of the benchmark.
Human Lingo-J GPT-4 BLEU METEOR CIDEr
Human 93.4 96.6 4.68 81.04 52.92 361.77
GPT-4V 56.67 59.6 3.19 6.30 12.35 42.82
LLaVA 38.97 49.4 2.45 4.23 8.38 38.39
FUYU 17.69 45.4 2.28 1.90 13.00 12.04

6 Discussion

Strengths of Lingo-Judge.

The strength of our contribution comprises proposing a classifier that is highly correlated with human inputs and efficient to run. In conjunction with the evaluation dataset that we propose, it becomes a useful tool for benchmarking vision-language models for autonomous driving on the video question answering task, which has been historically challenging to evaluate in a consistent fashion. With this contribution, autonomous driving research can be accelerated by providing a reliable, efficient, and easy-to-interpret benchmark.

Limitations of Lingo-Judge.

This work still has a few limitations that we discuss and provide guidelines as to how their effects may be mitigated. The classifier has been trained for autonomous driving evaluation and has been shown effective for this purpose. Hence, the classifier can be used for evaluation when paired with the benchmark dataset that we release with the paper. Second, we optimized the classifier to evaluate responses in the style provided by human annotators in the evaluation dataset. The same response style is adopted in the LingoQA training sets and the models. Further details regarding generalisation to response styles is studied in Appendix E. Third, as the classifier is only trained to predict factual correctness, it cannot discern which answer of two equally correct answers humans prefer.

Dataset and model limitations.

One of the primary constraints is that our model operates on relatively short video segments and few frames, limiting the contextual understanding of scenarios. We also do not test for driving decisions and attention mechanisms, focusing on question-answering abilities only. We did not test the scaling in our models and focused on the most practical 7B parameter LLMs only. Our dataset and baseline are limited to information from a single front-facing car camera, excluding additional sensory inputs like LiDAR that could enrich the model’s understanding of the driving environment. Expanding the model to address the short video context, as well as adding action prediction and evaluation to the dataset and the benchmark would result in a more versatile system for autonomous driving.

7 Conclusion

In this paper, we introduced a novel benchmark for Video Question Answering for autonomous driving. The benchmark consists of a evaluation dataset, learned classifier-based metric Lingo-Judge that is highly correlated with human evaluation, a comprehensive high-quality training dataset for autonomous driving. The fast feedback from employing Lingo-Judge facilitates effective exploration in the video QA field. Additionally, the comprehensive experiments on different model combinations presented in this paper can become a foundation for further improvement of end-to-end autonomous driving systems. The LingoQA benchmark is openly released to spur further community research, providing a reliable and highly correlated evaluation method to human ratings.

Acknowledgements

This work was possible through the help of many colleagues across Wayve. We would like to thank Elahe Arani for the valuable discussions and feedback on the report. In particular we would like to acknowledge the support from: Anthony Hu, Miguel Sanchez Lohff, Lorenzo Bertoni, Charlie Lyons-Rothbart, Emma Wang, Harriett-Rose Follas, Kyle Esecson, Ben Foxall, Naama Zahavi, Ruben Diaz, Rudi Rankin, Tilly Pielichaty, Sarah Belghiti, Giulio D’Ippolito, Dan Reisman, Alex Persin, Fergal Cotter, Przemyslaw Mazur, Thomas Sajot, Giacomo Gallino, Alex Garcia Mayans, Tim Geypens, Robin Tweedie, Rebecca Hills.

References

  • [1] Partners for automated vehicle education. pave poll 2020. https://pavecampaign.org/pave-poll-americans-wary-of-avs-but-say-education-and-experience-with-technology-can-build-trust/. Accessed: 2023-10-12.
  • [2] What’s going on with the open llm leaderboard? https://huggingface.co/blog/evaluating-mmlu-leaderboard. Accessed: 2023-10-22.
  • [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  • [4] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, 2019.
  • [5] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  • [6] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018.
  • [7] Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems, 2022.
  • [8] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.
  • [9] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nachum, Carolina Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Sontakke, Austin Stone, Clayton Tan, Huong Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. Rt-1: Robotics transformer for real-world control at scale, 2023.
  • [10] Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving, 2023.
  • [11] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers, 2023.
  • [12] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, and Weicheng Kuo. Pali: A jointly-scaled multilingual language-image model. International Conference on Learnining Representation, 2023.
  • [13] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  • [14] Pranav Singh Chib and Pravendra Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey, 2023.
  • [15] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • [16] Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2088–2098, 2019.
  • [17] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied multimodal language model, 2023.
  • [18] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11525–11533, 2020.
  • [19] ”Jeffrey Hawke, Haibo E, Vijay Badrinarayanan, and Alex Kendall ”. ”reimagining an autonomous vehicle”, 2021.
  • [20] Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2023.
  • [21] Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 20703–20716. Curran Associates, Inc., 2022.
  • [22] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
  • [23] Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
  • [24] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below.
  • [25] Sarthak Jain and Byron C Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186, 2019.
  • [26] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
  • [27] Bu **, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, and **g**g Liu. Adapt: Action-aware driving caption transformer, 2023.
  • [28] **kyu Kim and John Canny. Interpretable learning for self-driving cars by visualizing causal attention, 2017.
  • [29] **kyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles, 2018.
  • [30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models, 2022.
  • [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models, 2023.
  • [32] Jiangtong Li, Li Niu, and Liqing Zhang. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  • [33] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  • [34] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  • [35] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
  • [36] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023.
  • [37] Jiageng Mao, Yuxi Qian, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  • [38] The Language Archive Nijmegen: Max Planck Institute for Psycholinguistics. Elan, 2023.
  • [39] OpenAI. Gpt-4 technical report, 2023.
  • [40] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  • [41] Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, and João Carreira. Perception test: A diagnostic benchmark for multimodal video models. In Advances in Neural Information Processing Systems, 2023.
  • [42] Tianwen Qian, **g**g Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
  • [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • [44] Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, and Mykel Kochenderfer. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning, 2023.
  • [45] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, ** Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving, 2023.
  • [46] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, ** Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
  • [47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
  • [48] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • [49] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015.
  • [50] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [51] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, and Yuan Cao Yulia Tsvetkov. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learnining Representation, 2022.
  • [52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • [53] Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, and Botian Shi. On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving, 2023.
  • [54] Wei Xu. From automation to autonomy and autonomous vehicles: Challenges and opportunities for human-computer interaction. Interactions, 28(1):48–53, dec 2020.
  • [55] Yiran Xu, Xiaoyin Yang, Lihang Gong, Hsuan-Chu Lin, Tz-Ying Wu, Yunsheng Li, and Nuno Vasconcelos. Explainable object-induced action decision for autonomous vehicles, 2020.
  • [56] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model, 2023.
  • [57] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space, 2022.
  • [58] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022.
  • [59] Haotian* Zhang, Pengchuan* Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 2022.
  • [60] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
  • [61] Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning, 2023.

Appendix A LingoQA Dataset Examples

Further examples on the capabilties existent in the training and the evaluation datasets are shown in Figure 4. The scenery dataset contains highly descriptive elements, such as object colours, junction type, construction zones, traffic lights, and the road layout. The action dataset is complementary and focused on driving competencies, such as the impact of traffic lights on driving and interactions with other road agents. The evaluation dataset contains a broad range of questions aimed to test competencies relevant for autonomous driving. Further examples from the evaluation benchmark are also included in the overview video accompanying the submission. Notably, any personal identifiable information, such as faces and plate ID’s, has been annonymised in the dataset.

Refer to caption
Figure 4: LingoQA dataset examples. From left to right: scenery dataset, action dataset, and evaluation dataset.

Appendix B Lingo-Judge Examples

We present additional qualitative examples from our evaluation dataset in Table 6, alongside predictions from our base model and corresponding metrics for each individual sample. Metrics based on n-gram matching such as CIDEr tend to be error-prone. For example, expressions that have the same meaning, but entirely different words, are marked as not similar at all, such as “None” and “There are no cars.”. Sentences with minor but significant differences are graded as highly similar, despite having opposite meanings, such as “The traffic lights are showing green” and “The traffic lights are showing red”. Lingo-Judge demonstrates robustness against these varied expressions and subtle changes. Lingo-Judge also has limitations, primarily seen when establishing the correctness of the answer would require extra context from the videos. These examples can be seen in Table 7.

We qualitatively compare our classifier to GPT-4 ratings. These examples are shown in Figure 5. In this situation, GPT-4 is misled by the fact that the model answer contains partially correct information. The GPT-4 assessment states that “The student correctly identified the presence of a traffic light” and, despite the colours not being correct, further explains that “and accurately stated its colour”. This highlights some challenges faced by GPT-4 when trying to rate the truthfulness of an answer. Lingo-Judge correctly identifies that the statements described by the model are false.

Question

Label

Prediction

CIDEr GPT4 GPT4-CoT L-J Prob. L-J Class.

How many pedestrians are crossing the road?

Zero pedestrians

There are no pedestrians crossing the road.

23.18 5 5 0.96 True

What is the road speed limit?

20 mph - it is written on the road

The road speed limit is 20 mph.

97.91 4 3 0.95 True

How many cars are driving in your direction?

None

There are no cars driving in my direction.

0.11 4 3 0.96 True

Which vehicle should you follow if any?

The motorcyclist.

If any, I should follow the motorcycle ahead.

0.42 4 5 0.95 True

What is the current action and its justification? Answer in the form “action, justification”

Slow down, there is a stationary van infront of us

I am decelerating because of the stationary truck ahead.

34.79 5 5 0.96 True

What is the current action and its justification? Answer in the form “action, justification”

Stop, Red light

I am stop** because the traffic lights to go straight are red.

18.62 5 5 0.95 True

How many cyclists can you see?

I can see 3 cyclists

I can see two cyclists.

150.29 1 2 0.05 False

What color are the traffic lights showing?

The traffic lights are showing green

The traffic lights are showing red.

329.36 0 1 0.05 False

What action are you taking with respect to the cyclist?

Overtaking them on the right and kee** the speed

I am overtaking the cyclist on the left.

349.52 2 2 0.10 False

In which direction is the bus driving?

The bus is driving in the opposite direction

The bus is driving in the oncoming direction.

404.65 4 5 0.31 False

Are there any parked car on the side of the road?

Yes, there are two cars parked on the right of the road

No, there are no parked cars on either side of the road.

142.40 0 0 0.05 False

Is acceleration necessary in this situation? If so, provide the reason.

No. We should decelerate in this situation because there is a vehicle stop** ahead of us.

No, acceleration is not necessary in this situation as I am already driving at the speed limit.

177.96 3 3 0.31 False
Table 6: Qualitative comparison of metrics. Questions and labels from our evaluation dataset along with exemplary predictions and their corresponding sample-level metrics. Notable is that the classifier (here “L-J”) can pick up slight but crucial differences between ground truth and prediction to correctly identify a wrong prediction, such as “The traffic lights are green” and “The traffic lights are red”, in which case CIDEr still shows high agreement. Green color indicates agreement with human judgement, while orange and red show disagreement.

Question

Label

Prediction

CIDEr GPT4 GPT4-CoT L-J Prob. L-J Class.

Is any vehicle stopped at the zebra crossing?

There is no zebra crossing.

No, there are no vehicles stopped at the zebra crossing.

106.19 4 2 0.93 True

What elements are capturing your attention and what is the reason behind it?

I am making sure there is no vehicle on the left lane I am making my lane change to. I am also looking at the traffic light to make sure it stays green while I am crossing the intersection.

I am paying attention to the bus lane on the left because I need to maintain a safe distance from it.

0.91 3 1 0.57 True

What is the current action and its justification? Answer in the form ”action, justification”

We are taking a left turn, because we are at a t-junction. Then we keep a safe distance from the cyclists in front.

I am decelerating to keep a safe distance from the cyclist ahead of me.

46.45 2 2 0.32 False
Table 7: Failure Cases of Lingo-Judge. Examples where Lingo-Judge makes a wrong judgement about the correctness of the model prediction. Green color indicates agreement with human judgement, while orange and red show disagreement.
Refer to caption
Figure 5: Classifier examples. Examples of Lingo-Judge outputs compared to GPT-4.

Appendix C GPT-4 Grading

In this section we provide an overview of the implementation details for the evaluation method using GPT-4 with and without Chain-of-Thought (CoT) [52] prompting.

GPT-4 with CoT. In order to evaluate a model’s answer with GPT-4 and CoT prompting, we first provide GPT-4 with the question and one or more valid answers for the questions, and ask it to come up with a strategy to evaluate new answers to this question. We then provide GPT-4 with the model’s answer and ask it to evaluate the answer using the strategy it proposed in the previous step. Finally, we ask GPT-4 to give the model a grade between 0 and 5, where 5 means the answer is perfect. The prompt used is shown in Figure 6.

GPT-4 without CoT. When evaluating model outputs without CoT prompting, we provide GPT-4 with the question, one or more valid answers for the questions, and the model predictions and we directly ask GPT-4 to give the model a grade between 0 and 5, without the intermeidate reasoning steps. The prompt used is shown in Figure 7.

We emit concurrent requests to our Azure’s GPT-4 deployment in order to max-out the limit of 40k tokens per minute. GPT-4 without CoT prompting required more than 13 minutes to perform the evaluation, and GPT-4 with CoT prompting requires more than 50 minutes.

Refer to caption
Figure 6: GPT-4 with Chain of Thought (CoT) prompting. First, GPT-4 is provided with the question and ground truth answers, and asked to come up with a strategy for testing the answer. Second, GPT-4 is provided with the model answer and is prompted to evaluate the accuracy of the response based on the previously defined strategy. Finally, GPT-4 is asked to provide a grade for the student.
Refer to caption
Figure 7: GPT-4 without Chain of Thought (CoT) prompting. GPT-4 is provided with a prompt that contains the question, the ground truth answers, and the model response, and is requested to directly provide a grade for the student.

Appendix D Lingo-Judge Correlation Study

We show that Lingo-Judge exhibits a higher correlation to human judgment than commonly-used language-based metrics, and than GPT-4. To do so, we computed the scores of 15 different models and 2 groups of human labellers on the questions in our evaluation dataset using Lingo-Judge, GPT-4, BLEU4, METEOR and CIDEr. These scores are reported in Table 9.

We then computed the Pearson and Spearman correlation coefficients between these metrics and the human evaluation. The Pearson correlation coefficient measures the strength of the linear correlation between the human evaluation and a metric score, while the Spearman rank correlation coefficient measures the monotonic relationship between the human evaluation and the metric. The higher the Spearman coefficient, the better a metric is at ranking answers in the same order as our human evaluators. To compute the confidence intervals, we use the Fisher transformation with a 95% confidence level.

In Figure 8, the metric scores are plotted against the human evaluators’ grades (from 0 to 1). In red is the least-squares regression of the linear relationship between the metric and the human-assigned grades. Figure 9 shows the value of both correlation coefficients for each of the 5 metrics, as well as their confidence interval bounds. We note that not only does Lingo-Judge provide higher correlation, it also provides tighter confidence intervals than the other metrics.

Appendix E Lingo-Judge Generalisation

To investigate the generalisation abilities of the Lingo-Judge, we examine the performance of the model on a range of answer styles. In particular, we evaluate vision-language models with varying architectures namely GPT4-V, LLaVA and FUYU. Table 8 shows the performance of the Lingo-Judge as measured by the validation accuracy. This highlights that the model performs the best on short, concise answers, akin to those of humans.

Table 8: Robustness to response styles. Investigation into the impact of the response style on validation accuracy.
Lingo-J Val. Acc.
LingoQA 89.50
GPT-4V
few-shot (FS) 83.27
concise (C) 81.63
unprompted (U) 83.06
incorrect (I) 89.59
LLaVA
concise (C) 81.43
unprompted (U) 78.12
FUYU 64.89
Lingo-Judge [%] \uparrow BLEU \uparrow METEOR \uparrow CIDEr \uparrow GPT-4 \uparrow Human \uparrow
Models Model A 59.659.659.659.6 15.4515.4515.4515.45 18.3618.3618.3618.36 66.3266.3266.3266.32 3.233.233.233.23 0.5710.5710.5710.571
Model B 59.659.659.659.6 15.1615.1615.1615.16 18.8418.8418.8418.84 65.1165.1165.1165.11 3.163.163.163.16 0.5640.5640.5640.564
Model C 57.457.457.457.4 14.8714.8714.8714.87 18.5218.5218.5218.52 65.4965.4965.4965.49 3.083.083.083.08 0.5630.5630.5630.563
Model D 58.258.258.258.2 14.5114.5114.5114.51 18.5918.5918.5918.59 66.0266.0266.0266.02 3.153.153.153.15 0.5590.5590.5590.559
Model E 59.059.059.059.0 14.4214.4214.4214.42 18.5818.5818.5818.58 66.9566.9566.9566.95 3.143.143.143.14 0.5530.5530.5530.553
Model F 58.058.058.058.0 14.8214.8214.8214.82 18.8918.8918.8918.89 65.4365.4365.4365.43 3.113.113.113.11 0.5520.5520.5520.552
Model G 54.854.854.854.8 14.4114.4114.4114.41 17.8617.8617.8617.86 64.6764.6764.6764.67 2.982.982.982.98 0.5290.5290.5290.529
Model H 50.050.050.050.0 13.2913.2913.2913.29 17.4417.4417.4417.44 59.8759.8759.8759.87 2.882.882.882.88 0.5200.5200.5200.520
Model I 53.053.053.053.0 14.6314.6314.6314.63 17.9817.9817.9817.98 64.4564.4564.4564.45 2.962.962.962.96 0.5100.5100.5100.510
Model J 52.652.652.652.6 12.1712.1712.1712.17 17.5917.5917.5917.59 50.4550.4550.4550.45 3.003.003.003.00 0.5090.5090.5090.509
Model K 53.053.053.053.0 13.2013.2013.2013.20 18.0318.0318.0318.03 54.9054.9054.9054.90 3.043.043.043.04 0.5000.5000.5000.500
Model L 51.251.251.251.2 14.6914.6914.6914.69 17.8317.8317.8317.83 64.5164.5164.5164.51 2.912.912.912.91 0.4850.4850.4850.485
Model M 43.243.243.243.2 13.7613.7613.7613.76 17.3717.3717.3717.37 60.3660.3660.3660.36 2.672.672.672.67 0.3710.3710.3710.371
Model N 35.835.835.835.8 13.1813.1813.1813.18 15.6715.6715.6715.67 56.0756.0756.0756.07 2.412.412.412.41 0.3610.3610.3610.361
Model O 33.633.633.633.6 8.338.338.338.33 14.3314.3314.3314.33 39.1639.1639.1639.16 2.072.072.072.07 0.2790.2790.2790.279
Humans Human labellers group A 96.696.696.696.6 81.0481.0481.0481.04 52.9252.9252.9252.92 361.77361.77361.77361.77 4.684.684.684.68 0.9340.9340.9340.934
Human labellers group B 91.291.291.291.2 61.7261.7261.7261.72 42.5742.5742.5742.57 267.87267.87267.87267.87 4.34.34.34.3 0.8940.8940.8940.894
Table 9: Correlation study metrics. Metrics from different models on our evaluation dataset used in the correlation study in Table 1. For reference, we also present metrics for answers provided by human labellers. “Human” is the average of inference output scores in range [0,1]01\left[0,1\right][ 0 , 1 ] where 0 is worst and 1 is best, as described in section 3.2.
Refer to caption
Figure 8: Correlation trends. Correlation trends of the average grade of models compared to the average human-grades, for different metrics.
Refer to caption
Figure 9: Correlation coefficients. Correlation coefficients of the average grade of different models vs. the average human-grades, for different metrics.

Appendix F Training Parameters

In this sections we present further details on the training parameters used for the LingoQA Baseline. The training process consists of a pre-training stage, and a fine-tuning stage. Table 10 shows the parameters for pre-training and fine-tuning respectively. The datasets are sampled with equal weight for both pre-training and fine-tuning. The overall training time was  20h for pre-training and  5h for fine-tuning on an NVIDIA A100 8GPU 80GB machine.

Parameter Pre-training Fine-tuning
Precision bf16 bf16
Warm-up steps 1000 1000
Maximum steps 100000 10000
Batch size 6 8
Gradient acc. steps 1 1
Learning rate 5*1055superscript1055*10^{-5}5 * 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5*1055superscript1055*10^{-5}5 * 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Learning rate scheduler cosine cosine
Weight decay 0.1 0.1
Table 10: Training parameters. This table shows the training parameters utilised for the pre-training and for the fine-tuning stages respectively.

Appendix G LingoQA Baseline Examples

Refer to caption
Figure 10: Examples of model outputs on the LingoQA benchmark. We compare the baseline with a model that has not been fine-tuned on the LingoQA dataset, a model fine-tuned on the action dataset only, and a model fine-tuned on the scenery dataset only. This shows qualitatively how the baseline can handle both action justification as well as descriptive tasks by combining the strengths of both datasets.

We qualitatively showcase the impact of our proposed LingoQA dataset. Figure 10 compares three models: a model that is not fine-tuned on any LingoQA datasets, one that is fine-tuned on the action dataset only, one on the scenery dataset only, and the baseline that is trained with both. Two questions are asked, one focused on perception only, and one focused on action justification. The action only model performs well at answering action-related questions, but not perception. The scenery only model performs well at perception tasks, but not action justification. The baseline exhibits good performance on both.