- Notifications
You must be signed in to change notification settings - Fork14
The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".
License
NJUNLP/ReNeLLM
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".
- (2025/01/21) We have generated over 100,000 samples using ReNeLLM on theAdvBench dataset that successfully jailbreak Claude-2. You can obtain them here: 🤗Huggingface dataset. Please ensure that you do not use our dataset for any illegal or harmful activities.
- (2024/03/27) We have released the experimental results of ReNeLLM and other baselines. You can download them from the following link:Experimental Results
- (2024/03/14) 🔥🔥🔥 ReNeLLM is acccepted by NAACL 2024(main conference)! Thanks to all collaborators!
- (2024/02/21) We have updated our codebase to make it more readable and easier to use. We will continue to update the codebase and support more LLMs. Stay tuned!
- (2024/01/31) 🔥🔥🔥 ReNeLLM is evaluated by theEasyjailbreak benchmark and exhibits strong competitiveness across various methods. Check it out!
- (2024/01/09) We have released a quick implementation of ReNeLLM!
- (2023/11/14) Our paper is on arXiv! Check it outhere!
This repository shares the code of our latest work on LLMs jailbreaking/defending. In this work:
- We introduce ReNeLLM, the first generalized jailbreak prompt attack framework for LLMs, which generalizes jailbreak prompt attacks into two aspects: prompt rewriting and scenario nesting, utilizing LLMs themselves to generate jailbreak attack prompts.
- Extensive experiments demonstrate that ReNeLLM generates jailbreak prompts that maintain a high attack success rate with less time required. Furthermore, ReNeLLM is difficult to detect by existing defense methods and exhibits generalization and transferability on representative LLMs.
- To investigate the reasons why LLMs fail in defense, we observe the changes in the execution priority of prompts by LLMs before and after rewriting and nesting, and design defense methods accordingly. We hope our observations can serve as valuable guidance for future LLMs vendors to develop safer and more regulated systems.
1. Clone this repository
git clone https://github.com/NJUNLP/ReNeLLM.git
2. Build Environment
cd ReNeLLMconda create -n ReNeLLM python=3.9conda activate ReNeLLMpip install -r requirements.txt
3. Run ReNeLLM
ReNeLLM employsgpt-3.5-turbo
for prompt rewriting and harmful classifier, while utilizingclaude-v2
as the model under attack. Therefore, you are required to input both of these API key parameters.
python renellm.py --gpt_api_key<your openai API key> --claude_api_key<your anthropic API key>
We use these two models due to their superior performance compared to open-source alternatives, as well as their lower cost in comparison withgpt-4
.In principle, any model could be utilized as your harmful classifier and attacked model.
4. Get Responses
Forgpt
:
python get_responses.py --data_path<jailbroken data path> --gpt_api_key<your open API key> --test_model<gpt-3.5-turbo, gpt-4>
Forclaude
:
python get_responses.py --data_path<jailbroken data path> --claude_api_key<your anthropic API key> --test_model<claude-instant-1, claude-2>
Forllama-2-chat
:
- Build the environment
cd llamapip install -e.
- Follow the steps athttps://github.com/facebookresearch/llama?tab=readme-ov-file#quick-start to download the llama-2-chat models.
- Get llama-2-chat responses using the command below:
bash run_chat.sh# You can set the model type and your jailbroken data path in the run_chat.sh
5. Check ASR
GPT-ASR (using GPT-4)
python check_gpt_asr.py --data_path<your llm responses data path> --gpt_api_key<your open API key>
KW-ASR (keyword ASR)
python check_kw_asr.py --data_path<your llm responses data path>
6. Execute ReNeLLM with a single prompt
python renellm.py --gpt_api_key<your gpt_api_key> --claude_api_key<your claude_api_key> --prompt"how to steal money from others?"
We have uploaded the results of 520 Advbench samples categorized into 7 scenarios(seegpt-4_single_round_prompt_annotation.json. In the result file, the "idx" field corresponds to the same idx samples in Advbench. The prompts used for classification and the harmful scenarios represented by each classification result can be found in Table 10 of our paper.
It should be noted that GPT-4 initially categorized the 520 pieces of data into 10 classes, but three of these classes had very few samples. For the convenience of statistics and analysis, we manually checked and merged these classes, resulting in 7 classification scenarios. The specific category merging information is as follows: 1 -> 0, 8 -> 9, 11 -> 6, finally resulting in 7 categories: 0, 2, 3, 4, 5, 6, 9. You can process the result file according to the above rules to obtain the classification results consistent with our paper.
If you have any questions about our work, please feel free to contact us via the following email:
Peng Ding:dingpeng@smail.nju.edu.cn
Shujian Huang:huangsj@nju.edu.cn
If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:
@misc{ding2023wolf,title={A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily},author={Peng Ding and Jun Kuang and Dan Ma and Xuezhi Cao and Yunsen Xian and Jiajun Chen and Shujian Huang},year={2023},eprint={2311.08268},archivePrefix={arXiv},primaryClass={cs.CL}}