- Notifications
You must be signed in to change notification settings - Fork6
NeurIPS'24 - LLM Safety Landscape
License
poloclub/llm-landscape
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
You can visualize the safety and capability landscapes of your own LLM!
- Plot thesafety basin of your own model: if you make small, random tweaks to the model's weights, it stays as safe as the original model within a certain range. However, when these tweaks get large enough, there’s a tipping point where the model’s safety suddenly breaks down.
- Harmful finetuning attacks (HFA) compromise safety by dragging the model away from the safety basin.
- This safety landscape also shows that the system prompt plays a huge role in keeping the model safe, and that this protection extends to slightly tweaked versions of the model within the safety basin.
- When we test the model’s safety with jailbreaking prompts, we see that these prompts are very sensitive to even small changes in the model's weights.
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
ShengYun Peng1,Pin-Yu Chen2,Matthew Hull1,Duen Horng Chau1
1Georgia Tech,2IBM Research
InNeurIPS 2024.
You can plot the 1D and 2D LLM landscapes and compute the VISAGE score for your own models. We are using Llama2-7b-chat as an example. Please modify the yaml file under/config
for customized experiments.
make .done_venv
make direction
It consume ~27G on a single A100 GPU. The computed direction is stored atexperiments/advbench/1D_random/llama2/dirs1.pt
.
make landscape
ChangeNGPU
in Makefile to the number of devices on your hardware.
Changebatch_size
atconfig/dataset/default.yaml
to avoid CUDA OOM.
Model generations are saved atexperiments/advbench/1D_random/llama2/output.jsonl
.
The landscape visualization is saved atexperiments/advbench/1D_random/llama2/1D_random_llama2_landscape.png
.
@article{peng2024navigating,title={Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models},author={Peng, ShengYun and Chen, Pin-Yu and Hull, Matthew and Chau, Duen Horng},journal={arXiv preprint arXiv:2405.17374},year={2024}}
If you have any questions, feel free toopen an issue or contactAnthony Peng.