Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

License

NotificationsYou must be signed in to change notification settings

drive-bench/toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English |简体中文

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie1    Lingdong Kong2,3    Yuhao Dong2,4    Chonghao Sima2,6
Wenwei Zhang2    Qi Alfred Chen1    Ziwei Liu4    Liang Pan2

1University of California, Irvine    2Shanghai AI Laboratory    3National University of Singapore    4S-Lab, Nanyang Technological University    5The University of Hong Kong

     

About

drivebench
We introduce 🚙DriveBench, a benchmark dataset designed to evaluate VLM reliability across17 settings (clean, corrupted, and text-only inputs), encompassing19,200 frames,20,498 question-answer pairs,three question types,four mainstream driving tasks, anda total of 12 popular VLMs.
Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving.

📝 Updates

Table of Content

📊 Benchmark Comparison

BenchmarkPerceptionPredictionBehaviorPlanningRobustnessFramesQALogicEvaluation Metrics
(Test)(Test)
BDD-X--NoneLanguage
BDD-OIA--NoneF1 Score
nuScenes-QA36,11483,337NoneAcc
Talk2Car~1.8k2,447None-
nuPrompt~36k~6kNoneAMOTA
DRAMA-~14kChainLanguage
Rank2Tel--ChainAccuracy, Language
DirveMLLM880-NoneAcc
DriveVLM--NoneGPTctx
DriveLM4,79415,480GraphLanguage, GPT
DriveBench (Ours)19,20020,498GraphAcc, Language, GPT, GPTctx

⚙️ Installation

For details related to installation and environment setups, kindly refer toINSTALL.md.

♨️ Data Preparation

Kindly refer toDATA_PREPAER.md for the details to prepare the datasets.

🚀 Getting Started

To learn more usage about this codebase, kindly refer toGET_STARTED.md.

🚡 Benchmark Results

Benchmark Configuration

 Commercial VLMs
 Open-Source VLMs
 Specialist VLMs

Benchmark Study

ModelSizeTypePerception (Clean)Perception (Corr.)Perception (T.O.)Prediction (Clean)Prediction (Corr.)Prediction (T.O.)Planning (Clean)Planning (Corr.)Planning (T.O.)Behavior (Clean)Behavior (Corr.)Behavior (T.O.)
Human--47.6738.32-------69.5154.09-
GPT-4o-Commercial35.3735.2536.4851.3049.9449.0575.7575.3673.2145.4044.3350.03
LLaVA-1.57BOpen23.2222.9522.3122.0217.5414.6429.1531.5132.4513.6013.6214.91
LLaVA-1.513BOpen23.3523.3722.3736.9837.7823.9834.2634.9938.8532.9932.4332.79
LLaVA-NeXT7BOpen24.1519.6213.8635.0735.8928.3645.2744.3627.5848.1639.4411.92
InternVL28BOpen32.3632.6833.6045.5237.9348.8953.2755.2534.5654.5840.7820.14
Phi-34.2BOpen22.8823.9328.2640.1137.2722.6160.0361.3146.8845.2044.5728.22
Phi-3.54.2BOpen27.5227.5128.2645.1338.214.9231.9128.3646.3037.8949.1339.16
Oryx7BOpen17.0215.9718.4748.1346.6312.7753.5755.7648.2633.9233.8123.94
Qwen2-VL7BOpen28.9927.8535.1637.8939.5537.7757.0454.7841.6649.0747.6854.48
Qwen2-VL72BOpen30.1326.9217.7049.3543.495.5761.3063.0753.3551.2649.7839.46
DriveLM7BSpecialist16.8516.008.7544.3339.714.7068.7167.6065.2442.7840.3727.83
Dolphins7BSpecialist9.5910.8411.0132.6629.8839.9852.9153.7760.988.818.2511.92

Robustness Analysis

ModelSizeType
Weather

External

Sensor

Motion

Transmission
MCQVQACAPMCQVQACAPMCQVQACAPMCQVQACAPMCQVQACAP
GPT-4o-Commercial57.2057.2854.9029.2556.6061.9844.2554.9556.5334.2559.2056.2536.8353.9557.57
LLaVA-1.57BOpen69.7035.4935.9126.5029.1734.9518.8330.6433.1571.2533.4335.1810.1727.2834.38
LLaVA-1.513BOpen61.6039.7637.7615.5034.5537.8324.0835.4836.0879.7536.4636.4215.5032.5334.33
LLaVA-NeXT7BOpen69.7036.9648.5248.5030.3257.1821.8330.4044.3766.0034.2050.4411.8329.4353.50
InternVL28BOpen59.9048.7248.6050.7547.7457.8229.9245.0651.1468.2549.5149.6730.0043.4254.24
Phi-34.2BOpen40.0040.5945.6125.0031.4445.9916.8335.5843.7131.2542.9248.4327.6733.0441.35
Phi-3.54.2BOpen60.6041.8245.9721.2536.8930.9525.5834.6639.3033.0046.0349.3339.6733.4739.67
Oryx7BOpen53.2040.4348.9545.0040.6856.0650.5036.7148.5572.5040.0148.3339.6736.9849.87
Qwen2-VL7BOpen76.7049.3345.1237.5047.6251.2422.8339.4547.2357.0047.4047.7435.8342.3148.60
Qwen2-VL72BOpen59.8051.0548.5545.5050.5757.2552.2545.8948.5958.2550.8547.8844.8346.2350.50
DriveLM7BSpecialist21.2042.8620.0421.2537.4921.929.0036.6815.5622.2542.0517.0717.5039.5610.37
Dolphins7BSpecialist54.3030.2131.083.0030.4229.389.4226.8326.309.2529.8228.0521.5028.8627.65

Qualitative Comparisons

example
Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.
example
Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.

Citation

If you find this work helpful, please kindly consider citing our paper:

@article{xie2025drivebench,author  ={Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},title   ={Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},journal ={arXiv preprint arXiv:2501.04003},year    ={2025},}

License

This work is under theApache License Version 2.0, while some specific implementations in this codebase might be with other licenses. Kindly refer toLICENSE.md for a more careful check, if you are using our code for commercial matters.

Acknowledgments

To be updated.

About

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp