\useunder

\ul

LingoQA: Video Question Answering for Autonomous Driving

Ana-Maria Marcu^*, Long Chen^*, Jan Hünermann^*,
Alice Karnsund^*, Benoit Hanotte^*, Prajwal Chidananda, Saurabh Nair,
Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski^*
Wayve
[email protected]
^*Equal contributions

Abstract

We introduce LingoQA, a novel dataset and benchmark for video question answering in autonomous driving. The dataset contains 28K unique scenarios and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 56.67% of the questions compared to 93.4% for humans. For evaluation, in addition to conducting a human study, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark¹¹1https://github.com/wayveai/LingoQA and hope that it will provide a thorough evaluation platform for future vision-language models in autonomous driving.

{strip}

Figure 1: LingoQA is a comprehensive benchmark for Video Question Answering in autonomous driving. Our baseline vision-language model on this benchmark, can answer questions related to driving reasoning, object recognition, action justification, and scene description.

1 Introduction

Communication plays a pivotal role in naturally fostering trust among individuals. However, establishing trust between users and agents remains a significant challenge within the field of artificial intelligence. Recent studies have indicated that articulating explicit reasoning steps can significantly enhance user confidence [1], in addition to improving the capabilities of Large Language Models (LLMs) [52]. The need for textual justifications remains critical, particularly in safety-critical domains where technology adoption hinges upon this factor [29].

Consider the domain of end-to-end autonomous driving [11], where the driving policy is often executed through deep neural networks processing camera inputs to generate control commands. Recent strides in VLMs have solidified transformers as multimodal learners, showcasing remarkable performance in tasks such as Visual Question Answering (VQA) and underscoring their proficiency in acquiring robust representations for complex tasks [14]. Integrating Vision-Language Models (VLMs) into the field of autonomous driving holds the promise of enhancing user trust in these systems.

Our focus is on vision-only end-to-end autonomous driving, aiming to bridge the gap between data-driven decision-making and user trust. We introduce LingoQA, a benchmark designed for autonomous driving video QA, utilizing a novel dataset comprising more than 419k QA pairs. Distinguished by its free-form approach to questions and answers, this dataset broadens the scope of autonomous driving video QA, encompassing reasoning and action justifications. Additionally, we publish a comprehensive evaluation suite consisting of 1,000 examples. At the core of our benchmark lies a novel evaluation metric based on a learned text classifier called Lingo-Judge, inspired by GPT-Judge used in TruthfulQA [34]. We perform rigorous studies correlating automatic metrics to human preferences and find that Lingo-Judge achieves a 0.950 Spearman and 0.993 Pearson correlation coefficient, surpassing existing automated labelling techniques like METEOR [5], BLEU [40], CIDEr [49], and GPT-4 [39] on our benchmark, while being fast enough for frequent runs during training and development. The evaluation code and the weights for the classifier will be released with the paper to support robust benchmarking video question answering in autonomous driving.

Equipped with this evaluation toolkit, we conducted a comprehensive empirical study on key components and their ablations in VLMs for autonomous driving. Our findings in Section 5 indicate that the most effective approach involves partially fine-tuning the attention layers of our vision-language model equipped with Vicuna-1.5-7B [13], on both Action and Scenery datasets. This process involves using 5 video frames over 4 seconds and a late video fusion technique. Our collective work, spanning the LingoQA benchmark, the visual instruction-tuning dataset, and the innovative evaluation metric, aims to propel the domain of language-augmented autonomous driving, laying a robust foundation for subsequent research and development endeavors. To summarise the main contributions of this paper:

•

LingoQA Benchmark: We introduce a novel benchmark for autonomous driving video QA using a learned text classifier for evaluation. It outperforms existing metrics, including GPT-4, with a Spearman coefficient of 0.950 indicating a strong correlation with human evaluation.
•

LingoQA Dataset: Our 419.9k QA pair dataset stands out with its free-form questions and answers, covering not just perception but also driving reasoning from the drivers directly, broadening the scope of autonomous driving video QA.
•

LingoQA Baseline: Through testing of various video-language components on LingoQA, we find that the most effective approach involves partially fine-tuning the attention layers of our vision-language model equipped with Vicuna-1.5-7B [13] and a late video fusion technique. We establish a new baseline for this field with an identified model combination. Example outputs from the model are shown in Figure LingoQA: Video Question Answering for Autonomous Driving.

2 Related Work

2.1 Language in Autonomous Driving

Modern autonomous vehicle software relies heavily on artificial intelligence models [6, 18, 19, 21]. This, together with the increased number of such vehicles on the road, poses a fundamental challenge in terms of interpretability in the decision-making process [4]. Understanding why a decision is made is crucial for understanding areas of uncertainty, building trust, enabling effective human-AI collaboration, and ensuring safety [54]. In a survey conducted by Partners for Automated Vehicle Education (PAVE) in 2020 [1], 60% of participants stated that they would trust AVs more if they better understood the underlying process of the system. To establish trust with the general public, the systems must be explained in a human-interpretable way, such as through language and visual explanations [4].

The field of autonomous driving has been embracing the opportunity to make driving models more trustworthy for their users using visual attention methods [25] or textual explanations [29]. The early explorations of GPT3.5 [45, 37] and GPT4-V [53] on autonomous driving scenarios show that LLMs/VLMs demonstrate superior performance in scene understanding and causal reasoning compared to existing autonomous systems. Works such as ADAPT [27] and LLM-Driver [10] propose multi-task learning frameworks for jointly predicting language and control outputs. Inspired by progress in large language models [47, 13, 60, 39], vision-language models [50, 7, 51, 57, 58, 59, 12, 3, 30, 39, 35, 15] and multi-modal transformers for robotics [17, 9, 8] our work incorporates language to autonomous driving. Closely related to our proposed baseline is DriveGPT [56], proposing a multi-modal vision-language-action model that tokenizes videos, as well as text and control actions.

2.2 Evaluation Metrics

Progress has been relatively slow in develo** vision-language models for autonomous driving, with only a few works aiming to quantitatively improve upon prior work [28, 27, 56]. A key challenge consists of automated, reproducible evaluation metrics that are highly correlated with human ratings, particularly due to the inherent complexities in assessing natural language. ADAPT [27] reports human feedback in addition to standard natural language generation metrics, while DriveGPT [56] reports ChatGPT ratings. Automated methods such as BLEU [40], METEOR [5], ROUGE [33] show weak alignment with human feedback [49]. CIDEr [49] is also based on n-gram level similarity, as opposed to capturing the correctness of an answer based on its meaning. Newer evaluation metrics using ChatGPT have shown improvement in the area of sentence understanding, while still having limitations, such as providing high scores to elaborate, eloquent, but incorrect sentences [2]. Evaluation based on human feedback is subjective and difficult to reproduce. In this work, we address this challenge by introducing a novel video QA benchmark for autonomous driving that checks for factual correctness and is highly correlated to human correctness ratings on our proposed evaluation dataset.

2.3 Datasets for Autonomous Driving

Recent advances in generative AI have been underpinned by training with increasingly large and diverse internet-scale datasets. [3] [30] This has brought into light the need for evaluation benchmarks and datasets that focus not only on specific tasks, but on reasoning areas, such as descriptive and predictive reasoning. [41] Prior works, such as the CausalVQA benchmark [32] and the Perception Test [41], a comprehensive benchmark for vision-language foundation models, probe the validity of the model representations through question answering. Autonomous driving datasets have been focused on commentary [29, 55] or constructed around existing object detection datasets [42, 16]. Datasets such as NuScenesQA [42], contain simple language outputs of on average one word per question that do not tackle the more challenging reasoning problem.

Our proposed dataset LingoQA addresses the existing gap in autonomous driving as it contains a diverse set of questions related to driving behaviours and scenery in addition to perception questions related to object presence and positioning. The evaluation dataset probes areas such as description, counting, localisation, anticipation, attention, and action justification. This dataset has the strength of being diverse with respect to the language used while being grounded in human reasoning. Examples of the complex questions and answers existent in the dataset are provided in Appendix A.

3 LingoQA Benchmark

In this section, we introduce LingoQA, a benchmark to evaluate video question-answering models for autonomous driving.

3.1 Evaluation Dataset

We collected a small, low-density dataset from in-house human labelers, creating both the questions and the answers associated with the short videos. We labeled a small portion of held-out data on 500 human-generated questions using 20+ different evaluators to obtain our test set. Since answers are subjective and noisy, we labeled them twice, making sure the same evaluator does not receive the same question twice. After that, we manually reviewed the answers for semantic disagreements and mistakes. We relabeled such samples two more times and fixed the disagreements, preferring the semantics of the majority of responders but preserving maximal variety in the responses. Finally, we condensed this into 1k high-quality answers to 500 questions, with two correct but diverse answers per question. The dataset evaluates a range of competencies, including action and justification, attention, description, localisation, identification, counting and anticipation, as shown in Figure 2.

3.2 Evaluation Metric

Evaluating open-ended textual dialogues is a challenging task. Quite often the correct answers are ambiguous, subjective, or even not attainable. The most common language-based metrics for evaluating question-answering models in autonomous driving [56, 27, 29] are BLEU [40], METEOR [5] and CIDEr [49], despite their known limitations, such as relying heavily on the n-gram frequency as opposed to the underlying meaning of the answer. To address these limitations, we set ourselves the challenge to develop an automated, non-visual evaluation method for free-form language answers from vision-language models which checks correctness independent of phrasing against a ground truth answer and which is highly correlated with human ratings.

	Pearson	Spearman	Val Acc. [%]	Time [sec]
Lingo-Judge	0.993	0.950	95.0	10.5
GPT-4 with CoT	0.990	0.932	91.2	3016.0
GPT-4 [39]	0.988	0.941	90.6	812.4
BLEU [40]	0.881	0.835	-	0.1
METEOR [5]	0.891	0.876	-	8.0
CIDEr [49]	0.878	0.853	-	0.2

Table 1: Lingo-Judge Performance. Correlation with human ratings, validation accuracy, and time taken to run of our proposed LingoQA evaluation metric compared to previous language-based metrics. All metrics use textual ground truth and have no access to vision information. Further examples are presented in Appendix B.

GPT-4 based evaluation

Inspired by the G-Eval metric [36], we used GPT-4 to evaluate answers on a larger scale. Given a question and answer pair from the test set and a model’s answer, we ask GPT-4 to evaluate whether the model’s answer corresponds to a human’s answer. Notice that it does not make use of any visual input. We experimented with prompts and methods achieving good quality of judgements. We achieved the highest accuracy by employing chain-of-thought prompting where we ask GPT-4 to first come up with an evaluation strategy before grading a model’s answer. However, as shown in Table 1, this leads to increased inference time. Further details are provided in Appendix C. Unfortunately, we found GPT-4 based evaluation impractical to use as a main development and training metric due to the time required to evaluate answers on our relatively small evaluation dataset (from 13min up to 50min for a single evaluation due to the API rate limit).

Lingo-Judge

Given these limitations and inspired by TruthfulQA GPT-Judge [34], we pursued an alternative approach using a learned text classifier, dubbed Lingo-Judge, which estimates the correctness of model answers. We measure the correctness of model predictions as an accuracy using a small transformer-based text classifier that takes in a question, the human’s, and the model’s answer and outputs a probability that the model’s answer is correct. Please note, Lingo-Judge does not receive video input and must rely only on the supporting human’s answers. For every question, we run Lingo-Judge on all combinations of (ground-truth answer, predicted answer) and take the maximum correctness estimate, as shown in Equation 1, where $S$ is the score per sample. We found that this recipe yields the best predictive power provided enough diversity of human answers in our evaluation dataset.

S=\max_{j\in\{0,1\}}F_{\mathrm{Judge}}(\mathrm{prediction},\mathrm{ground\_% truth}[j])

(1)

The architecture of the classifier is a DeBERTa-V3 [20] language model, fine-tuned with LoRA [22]. The classification score is predicted using a linear head on top of the class token output. We fine-tuned the model on a diverse dataset of model predictions from early experiments, where questions and ground truth answers come from our evaluation dataset and the correctness target is labeled by human annotators. On top of this initial dataset, we iteratively improved the classifier using active learning by correcting the wrong predictions of discarded models and adding corrections to the training dataset. On a held-out test set, we find that the binary classification accuracy of the classifier is 95%.

In comparison to metrics such as CIDEr, which provide a system-level performance metric, the classifier provides a probability of correctness for each of the model predictions, meaning that it provides metrics at the sample level. Examples are provided in the Appendix B. This means that 100% classifier accuracy is easy to interpret. The classifier allows us to compute metrics during training, running over our full evaluation dataset in 10 seconds using an A100 GPU.

Correlation to human ratings

We studied empirical correlation of various metrics with human judgments. Several human annotators assigned a scalar score [0, 1] to the inference outputs of 17 different models which can be interpreted as the probability that the response correctly addresses the question [34]. Notably, this process takes several days, highlighting the need for an automated evaluation metric that provides faster development feedback. The final human score of each model is the average of all inference output scores. Further details regarding the methodology for the correlation analysis are in the Appendix D.

The Spearman rank correlation coefficient of our automated metric, Lingo-Judge, with human scores is 0.95, and the Pearson correlation coefficient is 0.993. These values are considerably higher compared to other natural language evaluation metrics and GPT-4, as detailed in Table 1. Our analysis demonstrates that Lingo-Judge accurately mirrors human judgments, outperforming existing metrics such as BLEU, METEOR, and CIDEr, as well as GPT-4 with and without chain-of-thought prompting. This indicates that Lingo-Judge can effectively serve as a proxy for human labelling, which is particularly significant given the stagnant nature of metrics in autonomous driving since the introduction of the CIDEr metric in 2015. Notably, despite their limitations, prominent models like ADAPT [27] and DriveGPT [56] still use BLEU, METEOR, and CIDEr metrics and report ChatGPT ratings without analyzing their correlation to human preferences. Our work fills this gap by providing a reliable benchmark that better reflects human preferences.

	Scenarios	QA pairs	QA per scenario
Action	24.5k	267.8k	$\approx 10.9$
Scenery	3.5k	152.5k	$\approx 43.6$
Eval. Dataset	100	1000	$10$

Table 2: Dataset Split. It consists of three different datasets of varying annotation densities. The Action dataset focuses on questions related to driving behaviours, the Scenery dataset focuses on perception capabilities, while the evaluation dataset is designed to probe a range of competencies.

	Scenarios	Annotations	QA	Captioning	Video length [sec]
Rank2Tell [44]	118	$>118$	✗	✓	20
BDD-OIA [55]	22.9k	35k	✗	✓	5
BDD-X [29]	6.9k	26k	✗	✓	40
NuScenesQA [42]	34k	460k	✓	✗	20
DriveLM [46]	30k	443k	✓	✓	20
LingoQA	28k	419.9k	✓	✓	4

Table 3: Dataset Features. The dataset that we introduce alongside our benchmark consists of questions related to object presence, as well as action, justification, attention, localisation, counting, anticipation, and counterfactuals. In total, it has a similar size to other driving-related datasets such as NuScenesQA, while having a much higher diversity and not being limited to questions related to object positioning.

3.3 Datasets

Refer to caption — Figure 2: Dataset Statistics. Dataset split by the number of question-answer pairs for the competencies covered and for the objects referred. One question-answer pair might cover more than one competency or object, hence the total is higher than the size of the datasets. The Action and Scenery datasets have complementary strengths, with one focused more on action-justification competencies and one more on description and localisation.

We created a collection of datasets for bringing language to autonomous driving. The total dataset size is 419.9k question-answer pairs, where a single data sample consists of a 4-second video clip at 1Hz. The total size of the dataset is about 10x larger than BDD-X [29], as shown in Table 3. Compared to prior datasets such as NuScenesQA [42], our dataset contains reasoning pairs in addition to object presence, description, and localisation. The answers are also more free-form and more complex, with an average answer length of 17.2 words versus 1.0 words in NuScenesQA. Examples of question answers pairs from LingoQA are shown in Appendix A.

Our labeled autonomous driving training dataset consists of two complementary parts: the action dataset and the scenery dataset.

Action dataset.

This dataset was made from a recorded driving corpus of interesting events where the car’s behavior changes, such as decelerations, accelerations, lane changes, narrow gaps, and turns. Such events were succinctly labeled by driving operators with very short high-level descriptions of the situations and behavioral policies (e.g. “following lane, pedestrian on a zebra crossing, should stop”). Additionally, we added metadata for such events from various perception systems, such as traffic light presence, vehicles and pedestrian visual detectors, weather descriptors, as well as other metadata (speed, steering wheel position, and road type from the map data). Using this data, we developed prompt templates for (1) describing the current action and its justification and (2) a set of example questions and hints about what the answer should mention. Next, we used those prompts with GPT-3.5 to rephrase, answer, and extend the example questions using the provided action description and answer hints. We rebalanced events by bucketing by actions and behavioral policies and sampled up to 500 events from each bucket without replacement, leading to 24,577 video snippets with 167,774 question/answer pairs.

Scenery dataset

The scenery dataset was built to complement the action dataset by focusing on perception-related questions in addition to driving behaviours. The dataset was made by densely and thoroughly labelling three 30-minute driving sessions with the ELAN video annotation software[38]. For the entire duration of the driving sessions, we provided short captions in about 15 different categories:

•

Driver’s actions, and their justifications
•

Driver’s attention
•

Observations about relevant vehicles, pedestrians, and other road actors with their visual descriptions
•

Observations about relevant road elements such as traffic lights, traffic islands, lane and intersection structures
•

Miscellaneous observations about the environment, such as weather, tube stations, and buildings.

Then, for every keyframe every second (1fps), we collect all annotations around this frame and build a textual description containing the driver’s actions and their justifications, the objects requiring the driver’s attention, and the observations. As opposed to the Action dataset, where recommended questions were provided to GPT-3.5 for rephasing, for the Scenery dataset, we asked GPT-4 to generate questions and answers using a set of generic prompts, but also using a prompt with the chain of thought specifically targeting perception questions. This forced GPT-4 to generate many diverse questions and answers. This led to a high-quality diverse dataset with about 43 QA-pairs per video.

Our training dataset covers 9 different competencies: action (what the vehicle is doing), justification (why the action is taken), attention (what should be paid attention to in the current situation), identification (identifying an object given its description), localisation, description, counting, anticipation and reasoning given counterfactuals. The questions also cover a diverse set of objects, such as pedestrians, vehicles, cyclists, buildings, road infrastructure, signs, markings. In Figure 2, we present the number of question and answer pairs for each of the 9 competencies above, as well as for the referred objects, for our two datasets, namely Action and Scenery. The complementary strengths of the datasets are apparent, with one focused on driving behaviours and one on perception tasks.

4 Model Methodology

We propose LingoQA Baseline, a vision language model for autonomous driving based on Vicuna v1.5 [13] with 7B parameters that can answer reasoning questions grounded in video outputs. We train a model that consumes a short video segment and produces answers to autonomous driving-related questions.

4.1 Architecture

The LingoQA Baseline model architecture is based on recent VLMs [30, 35, 17] but enhances them by incorporating a video encoding strategy to process multiple frames from a video snippet, as shown in Figure 3.

Vision encoder.

We use CLIP [43], a Vision Transformer (ViT) pre-trained contrastively on image-language pairs, to encode images into features. The inputs to the vision encoder are RGB images from the front camera. We squash the input images to a size of $224\times 224$ as opposed to crop** them in order to keep the full image context. Subsequently, we pass the features through a transformer network, the Querying Transformer (Q-Former), that akin to BLIP-2 [30] acts as a bridge between the vision and language feature spaces. The embeddings are then projected into the large language model (LLM) space using a linear projection layer. We repeat this process for $T=5$ frames of the input video and concatenate the tokens from each image.

Large language model.

We leverage pretrained LLMs to give LingoQA Baseline the ability to answer general questions related to both driving scenes, as well as general knowledge. We use Vicuna v1.5 [13] with 7B parameters built on top of Llama-2 [47]. The language model is auto-regressive and hence can be conditioned on textual inputs, as well as image tokens. The training objective is to predict the next language token in a sequence. We mask all tokens from the training loss that belong to the text prompt, including question and chat history.

4.2 Training Recipe

Our training uses a two-step approach to better utilise video features and improve learning when answering questions based on video data. Through this two-step training, we aim to better understand and use video data. In the first stage, we train the self-attention layers for the LLM and the vision encoder (QKV), together with all the Q-Former parameters and the linear language projection layer. In the second stage, we fine-tune the same parameters as in the previous stage, kee** the vision encoder frozen. Further details regarding training parameters are presented in Appendix F.

	Ablation	Lingo-Judge [%] $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$	CIDEr $\uparrow$
	LingoQA Baseline	60.80	15.00	$18.56$	65.61
Training recipe Instead of pre-train and fine-tune	No fine-tuning	$33.60$	$8.33$	$14.33$	$39.16$
Training recipe Instead of pre-train and fine-tune	No pre-training	$56.60$	$13.53$	$17.91$	$57.98$
Fine-tuning dataset Instead of action and scenery	Action only	$53.80$	$11.65$	$17.68$	$46.50$
Fine-tuning dataset Instead of action and scenery	Scenery only	$55.40$	$13.00$	$18.38$	$55.88$
Frame count Instead of 5 frames	Single frame	$57.00$	$14.21$	$18.40$	$59.46$
	3 frames	$59.80$	$14.61$	$18.44$	$62.61$
	7 frames	$60.60$	$14.46$	18.61	$61.82$
Video fusion Instead of late-fusion	Early-fusion	$48.40$	$13.98$	$17.61$	$61.42$
Video fusion Instead of late-fusion	Mid-fusion	$59.20$	$14.44$	$18.47$	$63.05$
Language model Instead of Vicuna-1.5-7B[13]	OPT-7B [60]	$50.00$	$14.98$	$15.99$	$60.08$
	Llama-2-7B-Chat [48]	$59.20$	$13.52$	$18.43$	$59.87$
	Mistral-7B-Instruct [26]	$58.00$	$13.80$	$18.33$	$64.21$

	Human	Lingo-J	GPT-4	BLEU	METEOR	CIDEr
Human	93.4	96.6	4.68	81.04	52.92	361.77
GPT-4V	56.67	59.6	3.19	6.30	12.35	42.82
LLaVA	38.97	49.4	2.45	4.23	8.38	38.39
FUYU	17.69	45.4	2.28	1.90	13.00	12.04

Question	Label	Prediction	CIDEr	GPT4	GPT4-CoT	L-J Prob.	L-J Class.
How many pedestrians are crossing the road?	Zero pedestrians	There are no pedestrians crossing the road.	23.18	5	5	0.96	True
What is the road speed limit?	20 mph - it is written on the road	The road speed limit is 20 mph.	97.91	4	3	0.95	True
How many cars are driving in your direction?	None	There are no cars driving in my direction.	0.11	4	3	0.96	True
Which vehicle should you follow if any?	The motorcyclist.	If any, I should follow the motorcycle ahead.	0.42	4	5	0.95	True
What is the current action and its justification? Answer in the form “action, justification”	Slow down, there is a stationary van infront of us	I am decelerating because of the stationary truck ahead.	34.79	5	5	0.96	True
What is the current action and its justification? Answer in the form “action, justification”	Stop, Red light	I am stop** because the traffic lights to go straight are red.	18.62	5	5	0.95	True
How many cyclists can you see?	I can see 3 cyclists	I can see two cyclists.	150.29	1	2	0.05	False
What color are the traffic lights showing?	The traffic lights are showing green	The traffic lights are showing red.	329.36	0	1	0.05	False
What action are you taking with respect to the cyclist?	Overtaking them on the right and kee** the speed	I am overtaking the cyclist on the left.	349.52	2	2	0.10	False
In which direction is the bus driving?	The bus is driving in the opposite direction	The bus is driving in the oncoming direction.	404.65	4	5	0.31	False
Are there any parked car on the side of the road?	Yes, there are two cars parked on the right of the road	No, there are no parked cars on either side of the road.	142.40	0	0	0.05	False
Is acceleration necessary in this situation? If so, provide the reason.	No. We should decelerate in this situation because there is a vehicle stop** ahead of us.	No, acceleration is not necessary in this situation as I am already driving at the speed limit.	177.96	3	3	0.31	False

Question	Label	Prediction	CIDEr	GPT4	GPT4-CoT	L-J Prob.	L-J Class.
Is any vehicle stopped at the zebra crossing?	There is no zebra crossing.	No, there are no vehicles stopped at the zebra crossing.	106.19	4	2	0.93	True
What elements are capturing your attention and what is the reason behind it?	I am making sure there is no vehicle on the left lane I am making my lane change to. I am also looking at the traffic light to make sure it stays green while I am crossing the intersection.	I am paying attention to the bus lane on the left because I need to maintain a safe distance from it.	0.91	3	1	0.57	True
What is the current action and its justification? Answer in the form ”action, justification”	We are taking a left turn, because we are at a t-junction. Then we keep a safe distance from the cyclists in front.	I am decelerating to keep a safe distance from the cyclist ahead of me.	46.45	2	2	0.32	False

	Lingo-J Val. Acc.
LingoQA	89.50
GPT-4V
few-shot (FS)	83.27
concise (C)	81.63
unprompted (U)	83.06
incorrect (I)	89.59
LLaVA
concise (C)	81.43
unprompted (U)	78.12
FUYU	64.89

Parameter	Pre-training	Fine-tuning
Precision	bf16	bf16
Warm-up steps	1000	1000
Maximum steps	100000	10000
Batch size	6	8
Gradient acc. steps	1	1
Learning rate	$5*10^{-5}$	$5*10^{-5}$
Learning rate scheduler	cosine	cosine
Weight decay	0.1	0.1

		Lingo-Judge [%] $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$	CIDEr $\uparrow$	GPT-4 $\uparrow$	Human $\uparrow$
Models	Model A	$59.6$	$15.45$	$18.36$	$66.32$	$3.23$	$0.571$
	Model B	$59.6$	$15.16$	$18.84$	$65.11$	$3.16$	$0.564$
	Model C	$57.4$	$14.87$	$18.52$	$65.49$	$3.08$	$0.563$
	Model D	$58.2$	$14.51$	$18.59$	$66.02$	$3.15$	$0.559$
	Model E	$59.0$	$14.42$	$18.58$	$66.95$	$3.14$	$0.553$
	Model F	$58.0$	$14.82$	$18.89$	$65.43$	$3.11$	$0.552$
	Model G	$54.8$	$14.41$	$17.86$	$64.67$	$2.98$	$0.529$
	Model H	$50.0$	$13.29$	$17.44$	$59.87$	$2.88$	$0.520$
	Model I	$53.0$	$14.63$	$17.98$	$64.45$	$2.96$	$0.510$
	Model J	$52.6$	$12.17$	$17.59$	$50.45$	$3.00$	$0.509$
	Model K	$53.0$	$13.20$	$18.03$	$54.90$	$3.04$	$0.500$
	Model L	$51.2$	$14.69$	$17.83$	$64.51$	$2.91$	$0.485$
	Model M	$43.2$	$13.76$	$17.37$	$60.36$	$2.67$	$0.371$
	Model N	$35.8$	$13.18$	$15.67$	$56.07$	$2.41$	$0.361$
	Model O	$33.6$	$8.33$	$14.33$	$39.16$	$2.07$	$0.279$
Humans	Human labellers group A	$96.6$	$81.04$	$52.92$	$361.77$	$4.68$	$0.934$
Humans	Human labellers group B	$91.2$	$61.72$	$42.57$	$267.87$	$4.3$	$0.894$

LingoQA: Video Question Answering for Autonomous Driving

Abstract

1 Introduction

2 Related Work

2.1 Language in Autonomous Driving

2.2 Evaluation Metrics

2.3 Datasets for Autonomous Driving

3 LingoQA Benchmark

3.1 Evaluation Dataset

3.2 Evaluation Metric

GPT-4 based evaluation

Lingo-Judge

Correlation to human ratings

3.3 Datasets

Action dataset.

Scenery dataset

4 Model Methodology

4.1 Architecture

Vision encoder.

Large language model.

4.2 Training Recipe

Stage 1: Pre-training for feature alignment.

Stage 2: Fine-tuning for video QA.

5 Empirical Evaluation on LingoQA

5.1 Training Strategy

5.2 Training Datasets Mixture

5.3 Impact of Frame Count

5.4 Impact of Video Fusion Strategy

5.5 Impact of Large Language Model

5.6 Evaluation of SOTA Vision-Language Models

6 Discussion

Strengths of Lingo-Judge.

Limitations of Lingo-Judge.

Dataset and model limitations.

7 Conclusion

Acknowledgements

References

Appendix A LingoQA Dataset Examples

Appendix B Lingo-Judge Examples

Appendix C GPT-4 Grading

Appendix D Lingo-Judge Correlation Study

Appendix E Lingo-Judge Generalisation

Appendix F Training Parameters

Appendix G LingoQA Baseline Examples