poloclub/llm-landscapePublic

NotificationsYou must be signed in to change notification settings
Fork6
Star21

NeurIPS'24 - LLM Safety Landscape

shengyun-peng.github.io/papers/llm-safety-landscape

License

MIT license

21 stars 6 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data		data
experiments		experiments
image		image
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
direction.py		direction.py
finetuning.py		finetuning.py
landscape.py		landscape.py
plot.py		plot.py
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [NeurIPS'24]

You can visualize the safety and capability landscapes of your own LLM!

Plot thesafety basin of your own model: if you make small, random tweaks to the model's weights, it stays as safe as the original model within a certain range. However, when these tweaks get large enough, there’s a tipping point where the model’s safety suddenly breaks down.
Harmful finetuning attacks (HFA) compromise safety by dragging the model away from the safety basin.
This safety landscape also shows that the system prompt plays a huge role in keeping the model safe, and that this protection extends to slightly tweaked versions of the model within the safety basin.
When we test the model’s safety with jailbreaking prompts, we see that these prompts are very sensitive to even small changes in the model's weights.

Research Paper

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng¹,Pin-Yu Chen²,Matthew Hull¹,Duen Horng Chau¹
¹Georgia Tech,²IBM Research

InNeurIPS 2024.

Quick Start

You can plot the 1D and 2D LLM landscapes and compute the VISAGE score for your own models. We are using Llama2-7b-chat as an example. Please modify the yaml file under/config for customized experiments.

Setup

make .done_venv

Compute direction

make direction

It consume ~27G on a single A100 GPU. The computed direction is stored atexperiments/advbench/1D_random/llama2/dirs1.pt.

Visualize landscape and compute VISAGE score

make landscape

ChangeNGPU in Makefile to the number of devices on your hardware.

Changebatch_size atconfig/dataset/default.yaml to avoid CUDA OOM.

Model generations are saved atexperiments/advbench/1D_random/llama2/output.jsonl.

The landscape visualization is saved atexperiments/advbench/1D_random/llama2/1D_random_llama2_landscape.png.

Citation

@article{peng2024navigating,title={Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models},author={Peng, ShengYun and Chen, Pin-Yu and Hull, Matthew and Chau, Duen Horng},journal={arXiv preprint arXiv:2405.17374},year={2024}}

Contact

If you have any questions, feel free toopen an issue or contactAnthony Peng.

About

NeurIPS'24 - LLM Safety Landscape

shengyun-peng.github.io/papers/llm-safety-landscape

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [NeurIPS'24]

Research Paper

Quick Start

Setup

Compute direction

Visualize landscape and compute VISAGE score

Citation

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

poloclub/llm-landscape

Folders and files

Latest commit

History

Repository files navigation

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models [NeurIPS'24]

Research Paper

Quick Start

Setup

Compute direction

Visualize landscape and compute VISAGE score

Citation

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages