drive-bench/toolkitPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star56

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

License

Apache-2.0 license

56 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
docs		docs
evaluate		evaluate
inference		inference
script		script
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
prompt.txt		prompt.txt

Repository files navigation

English |简体中文

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie¹    Lingdong Kong^2,3    Yuhao Dong^2,4    Chonghao Sima^2,6
Wenwei Zhang²    Qi Alfred Chen¹    Ziwei Liu⁴    Liang Pan²

¹University of California, Irvine    ²Shanghai AI Laboratory    ³National University of Singapore    ⁴S-Lab, Nanyang Technological University    ⁵The University of Hong Kong

About


We introduce 🚙DriveBench, a benchmark dataset designed to evaluate VLM reliability across17 settings (clean, corrupted, and text-only inputs), encompassing19,200 frames,20,498 question-answer pairs,three question types,four mainstream driving tasks, anda total of 12 popular VLMs.
Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving.

📝 Updates

[2025.01] - The evaluation data can be accessible at ourHuggingFace Dataset Card. 🤗
[2025.01] - Introducing the 🚙DriveBench project! For more details, kindly refer to ourProject Page andPreprint. 🚀

Table of Content

📊 Benchmark Comparison

Benchmark	Perception	Prediction	Behavior	Planning	Robustness	Frames	QA	Logic	Evaluation Metrics
Benchmark	Perception	Prediction	Behavior	Planning	Robustness	(Test)	(Test)	Logic	Evaluation Metrics

BDD-X	✔	✘	✘	✘	✘	-	-	None	Language
BDD-OIA	✔	✘	✔	✘	✘	-	-	None	F1 Score
nuScenes-QA	✔	✘	✘	✘	✘	36,114	83,337	None	Acc
Talk2Car	✔	✘	✘	✔	✘	~1.8k	2,447	None	-
nuPrompt	✔	✘	✘	✘	✘	~36k	~6k	None	AMOTA
DRAMA	✔	✘	✘	✔	✘	-	~14k	Chain	Language
Rank2Tel	✔	✘	✘	✔	✘	-	-	Chain	Accuracy, Language
DirveMLLM	✔	✘	✘	✘	✘	880	-	None	Acc
DriveVLM	✔	✘	✔	✔	✘	-	-	None	GPT_ctx
DriveLM	✔	✔	✔	✔	✘	4,794	15,480	Graph	Language, GPT
DriveBench (Ours)	✔	✔	✔	✔	✔	19,200	20,498	Graph	Acc, Language, GPT, GPT_ctx

⚙️ Installation

For details related to installation and environment setups, kindly refer toINSTALL.md.

♨️ Data Preparation

Kindly refer toDATA_PREPAER.md for the details to prepare the datasets.

🚀 Getting Started

To learn more usage about this codebase, kindly refer toGET_STARTED.md.

🚡 Benchmark Results

Benchmark Configuration

Commercial VLMs

GPT4-o

Open-Source VLMs

LLaVA-1.5^[Code]
LLaVA-NeXT^[Code]
InternVL2^[Code]
Phi-3^[Code]
Phi-3.5^[Code]
Oryx^[Code]
Qwen2-VL^[Code]

Specialist VLMs

DriveLM-Agent^[Code]
Dolphins^[Code]

Benchmark Study

Model	Size	Type	Perception (Clean)	Perception (Corr.)	Perception (T.O.)	Prediction (Clean)	Prediction (Corr.)	Prediction (T.O.)	Planning (Clean)	Planning (Corr.)	Planning (T.O.)	Behavior (Clean)	Behavior (Corr.)	Behavior (T.O.)

Human	-	-	47.67	38.32	-	-	-	-	-	-	-	69.51	54.09	-

GPT-4o	-	Commercial	35.37	35.25	36.48	51.30	49.94	49.05	75.75	75.36	73.21	45.40	44.33	50.03

LLaVA-1.5	7B	Open	23.22	22.95	22.31	22.02	17.54	14.64	29.15	31.51	32.45	13.60	13.62	14.91
LLaVA-1.5	13B	Open	23.35	23.37	22.37	36.98	37.78	23.98	34.26	34.99	38.85	32.99	32.43	32.79
LLaVA-NeXT	7B	Open	24.15	19.62	13.86	35.07	35.89	28.36	45.27	44.36	27.58	48.16	39.44	11.92
InternVL2	8B	Open	32.36	32.68	33.60	45.52	37.93	48.89	53.27	55.25	34.56	54.58	40.78	20.14
Phi-3	4.2B	Open	22.88	23.93	28.26	40.11	37.27	22.61	60.03	61.31	46.88	45.20	44.57	28.22
Phi-3.5	4.2B	Open	27.52	27.51	28.26	45.13	38.21	4.92	31.91	28.36	46.30	37.89	49.13	39.16
Oryx	7B	Open	17.02	15.97	18.47	48.13	46.63	12.77	53.57	55.76	48.26	33.92	33.81	23.94
Qwen2-VL	7B	Open	28.99	27.85	35.16	37.89	39.55	37.77	57.04	54.78	41.66	49.07	47.68	54.48
Qwen2-VL	72B	Open	30.13	26.92	17.70	49.35	43.49	5.57	61.30	63.07	53.35	51.26	49.78	39.46

DriveLM	7B	Specialist	16.85	16.00	8.75	44.33	39.71	4.70	68.71	67.60	65.24	42.78	40.37	27.83
Dolphins	7B	Specialist	9.59	10.84	11.01	32.66	29.88	39.98	52.91	53.77	60.98	8.81	8.25	11.92

Robustness Analysis

Model	Size	Type	Weather			External			Sensor			Motion			Transmission
Model	Size	Type	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP

GPT-4o	-	Commercial	57.20	57.28	54.90	29.25	56.60	61.98	44.25	54.95	56.53	34.25	59.20	56.25	36.83	53.95	57.57

LLaVA-1.5	7B	Open	69.70	35.49	35.91	26.50	29.17	34.95	18.83	30.64	33.15	71.25	33.43	35.18	10.17	27.28	34.38
LLaVA-1.5	13B	Open	61.60	39.76	37.76	15.50	34.55	37.83	24.08	35.48	36.08	79.75	36.46	36.42	15.50	32.53	34.33
LLaVA-NeXT	7B	Open	69.70	36.96	48.52	48.50	30.32	57.18	21.83	30.40	44.37	66.00	34.20	50.44	11.83	29.43	53.50
InternVL2	8B	Open	59.90	48.72	48.60	50.75	47.74	57.82	29.92	45.06	51.14	68.25	49.51	49.67	30.00	43.42	54.24
Phi-3	4.2B	Open	40.00	40.59	45.61	25.00	31.44	45.99	16.83	35.58	43.71	31.25	42.92	48.43	27.67	33.04	41.35
Phi-3.5	4.2B	Open	60.60	41.82	45.97	21.25	36.89	30.95	25.58	34.66	39.30	33.00	46.03	49.33	39.67	33.47	39.67
Oryx	7B	Open	53.20	40.43	48.95	45.00	40.68	56.06	50.50	36.71	48.55	72.50	40.01	48.33	39.67	36.98	49.87
Qwen2-VL	7B	Open	76.70	49.33	45.12	37.50	47.62	51.24	22.83	39.45	47.23	57.00	47.40	47.74	35.83	42.31	48.60
Qwen2-VL	72B	Open	59.80	51.05	48.55	45.50	50.57	57.25	52.25	45.89	48.59	58.25	50.85	47.88	44.83	46.23	50.50

DriveLM	7B	Specialist	21.20	42.86	20.04	21.25	37.49	21.92	9.00	36.68	15.56	22.25	42.05	17.07	17.50	39.56	10.37
Dolphins	7B	Specialist	54.30	30.21	31.08	3.00	30.42	29.38	9.42	26.83	26.30	9.25	29.82	28.05	21.50	28.86	27.65

Qualitative Comparisons


Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.


Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.

Citation

If you find this work helpful, please kindly consider citing our paper:

@article{xie2025drivebench,author  ={Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},title   ={Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},journal ={arXiv preprint arXiv:2501.04003},year    ={2025},}

License

This work is under theApache License Version 2.0, while some specific implementations in this codebase might be with other licenses. Kindly refer toLICENSE.md for a more careful check, if you are using our code for commercial matters.

Acknowledgments

To be updated.

About

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

drive-bench.github.io

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

About

📝 Updates

Table of Content

📊 Benchmark Comparison

⚙️ Installation

♨️ Data Preparation

🚀 Getting Started

🚡 Benchmark Results

Benchmark Configuration

Benchmark Study

Robustness Analysis

Qualitative Comparisons

Citation

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Contributors3

Languages

Movatterモバイル変換

License

drive-bench/toolkit

Folders and files

Latest commit

History

Repository files navigation

Are VLMs Ready for Autonomous Driving?An Empirical Study from the Reliability, Data, and Metric Perspectives

About

📝 Updates

Table of Content

📊 Benchmark Comparison

⚙️ Installation

♨️ Data Preparation

🚀 Getting Started

🚡 Benchmark Results

Benchmark Configuration

Benchmark Study

Robustness Analysis

Qualitative Comparisons

Citation

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Languages

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

Packages