Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A Novel Benchmark evaluating the Deep Capability of Vulnerability Detection with Large Language Models

License

NotificationsYou must be signed in to change notification settings

Sweetaroo/VulDetectBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VulDetectBench is a benchmark designed to evaluate the vulnerability detection capabilities of Large Language Models (LLMs).

pypiStatic BadgeStatic Badge

Overview

While LLMs excel in code comprehension and generation, their ability to detect program vulnerabilities has been less explored. VulDetectBench addresses this by assessing LLMs through five increasingly difficult tasks.

Dataset Curation

image
VulDetectBench Curation Pipeline.

Experiment Results

Our test results shows thatexisting LLMs perform well on simple analysis tasks such as vulnerability existence detection and CWE type inference, while on specific vulnerability related tasks, although performance varies from LLM to LLM, the overall performance is not yet satisfactory.

image
Top 8 LLMs' ability on Vulnerability Detections.

VulDetectBench

Task1:Vulnerability Existence Detection

Statistics

Number of SamplesVulnerability TypesMinimal Token CountMaximal Token Count
100048503493

Format

{    "system":"Assuming you are an experienced code vulnerability analyst and the following code may have vulnerabilities.",    "user":"Is the code vulnerable?(YES/NO)"+{code}+"Your answer should either be 'YES' or 'NO' only.",    "answer":"YES"/"NO"}

Metrics

  • For a single testcase,we will show:

    • hit:Whether the model correctly classifies the case.
  • For overall performance,we will show:

    • Accuracy on the entire benchmark
    • F1-Score on the entire benchmark
Task2:Vulnerability Type Inference

Statistics

Number of SamplesVulnerability TypesMinimal Token CountMaximal Token Count
500482653372

Format

{    "system": "You are an outstanding code vulnerability analyst and expert in single-choice questions.You are only able to pick up 1 answer from given choices.",    "user": "What is the vulnerability type of the code?(A/B/C/D/E)      A.~      B.~      C.~      D.~      E.~", + {code}+"output 'A.' or 'B.' or 'C.' or 'D.' or 'E.' only.",    "answer":"X|Y"//(X is optimal option,Y is sub-optimal option)}

Metrics

  • For a single sample,we will show:

    • Strict Evaluation Score(SE):When the model hits X,SE=1;When the model hits Y,SE=0.5.Otherwise SE=0.
    • Moderate Evaluation Score(ME):When the model hits X or Y,ME=1;Otherwise ME=0.
  • For overall performance,we will show:

    • Average SE on the entire benchmark
    • Average ME on the entire benchmark
Task3:Key Objects & Functions Identification

Statistics

Number of SamplesVulnerability TypesMinimal Token CountMaximal Token Count
1003810173269

Format

{    "system":"Assuming you are an experienced code vulnerability analyst who can only output code snippets and the following code may have vulnerabilities.",    "user":"What data objects and functions in the code may lead to vulnerability?"+{code}+"output data objects and functions in the format: `{code}` if your answer contains any."    "answer":"{object1} {object2} ..."}

Metrics

  • For a single sample,we will show:

    • Token Recall:Number of correct tokens in model's output/Number of gold tokens in the answer.
  • For overall performance,we will show:

    • Macro Average Recall(MAR):

    $$athrm{MAR}=\frac{1}{n}\sum_{i=1}^n(\frac{TP_i}{TP_i+FN_i})$$

    • Micro Average Recall(MIR):

    $$athrm{MIR}=\frac{\sum_{i=1}^n TP_i}{\sum_{i=1}^n(TP_i+FP_i)}$$

Task4:Root Cause Location

Statistics

Number of SamplesVulnerability TypesMinimal Token CountMaximal Token Count
1003810103262

Format

{    "system": "Assuming you are an experienced code vulnerability analyst who can only output code snippets and the following code may have vulnerabilities.",    "user":"Which line of code is the root cause point of the vulnerability?"+{code}"output your answer code in the format: `{code}`",    "answer":`{root cause point}`}

Metrics

  • For a single sample,we will show:

    • Union line-of-code recall score(URS):

    $$athrm{URS}=\frac{\mathrm{Line_{output}\cap Line_{answer}}}{\mathrm{Line_{output}\cup Line_{answer}}}$$

    • Original line-of-code recall score(ORS):

    $$athrm{ORS}=\frac{\mathrm{Line_{output}\cap Line_{answer}}}{\mathrm{Line_{answer}}}$$

  • For overall performance,we will show:

    • Average URS on the entire benchmark
    • Average ORS on the entire benchmark
Task5:Trigger Point Location

Statistics

Number of SamplesVulnerability TypesMinimal Token CountMaximal Token Count
1003810113363

Format

{    "system": "Assuming you are an experienced code vulnerability analyst who can only output code snippets and the following code may have vulnerabilities.",    "user":"Which line of code is the trigger point of the vulnerability?"+{code}"output your answer code in the format: `{code}`",    "answer":`{trigger point}`}

Metrics

  • For a single sample,we will show:

    • Union line-of-code recall score(URS):

    $$athrm{URS}=\frac{\mathrm{Line_{output}\cap Line_{answer}}}{\mathrm{Line_{output}\cup Line_{answer}}}$$

    • Original line-of-code recall score(ORS):

    $$athrm{ORS}=\frac{\mathrm{Line_{output}\cap Line_{answer}}}{\mathrm{Line_{answer}}}$$

  • For overall performance,we will show:

    • Average URS on the entire benchmark
    • Average ORS on the entire benchmark

How to use

Step 1

To get the test sets,you can clone this repository:

git clone https://github.com/Sweetaroo/VulDetectBench.git

the dataset is underVulDetectBench/dataset/test.

To install the library,you can usepip

pip install vuldetectbench

Then,to prepare for testing metrics,run the following code on your terminal:

python>>> import nltk>>> nltk.download('punkt_tab')

Step 2

vuldetectbenchis used in python programs.

To run the benchmarks on your model,you first need to create a subclass of classAgent(expressed asSubAgent),which is used for answer generation of your model.Typically the subclass is composed of 2 methods,__init__ and__call__.

  • __init__:is used for model setup.You should implement everything necessary in this method,such as loading tokenizers,model deployment,api agent creation,etc.
  • __call__:is only used for answer generation.It receives 1 additional parameterprompt which is formatted as follows:
{"system":"Assuming you are an experienced code vulnerability analyst...","user":"Which line of code is ...?"}

__call__ returns a plainstr,the model's output.

Step 3

Create instances from classSubAgent,Tasks andVulDetectBench_Engine.

fromvuldetectbench.generationimportTasks,Agent,VulDetectBench_Enginemodel=SubAgent()tasks=Tasks(data_dir='...',task_no=...,method=...)engine=VulDetectBench_Engine(model=model,task_and_metrics=taskssave_path='...',verbose=...)

Here's a simple explanation of the parameters:

ParamtypeUsage
data_dirstrdetermine where the test data is
task_noint,List[int],Nonedetermine which task(s) to take.None:take all tasks at once.int:task one specific task.List[int]:take specific tasks,2 or above.
save_pathstr,Nonedetermine the path for saving evaluation reports.None:do not save,output in the terminal.str:save under a specific path.
verboseboolTrue:generate detailed reports,contains score on each sample and the entire benchmark.False:generate simple reports,only contains score on the entire benchmark.
methodstr['cot','few-shot'],NoneNone(default):using default prompting strategy in all tasks;'cot':using Chain-of-Thought prompting strategy in task3,4 and 5;'few-shot':using few-shot(2-shot) prompting stategy in task3,4 and 5.

Step 4

executeengine.run() to perform tasks.

./demos/gpt.py suggests a simple demo to run VulDetectBench using GPT-3.5-Turbo API.

Reference

If you use or reference our work,please cite our paper.

@misc{liu2024vuldetectbenchevaluatingdeepcapability,title={VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models},author={Yu Liu and Lang Gao and Mingxin Yang and Yu Xie and Ping Chen and Xiaojin Zhang and Wei Chen},year={2024},eprint={2406.07595},archivePrefix={arXiv},primaryClass={cs.CR},url={https://arxiv.org/abs/2406.07595}, }

About

A Novel Benchmark evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp