Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".

License

NotificationsYou must be signed in to change notification settings

NJUNLP/ReNeLLM

Repository files navigation

The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".

NAACL 2024Jailbreak AttacksAdversarial AttacksLarge Language Models

Table of Contents

Updates

  • (2025/01/21) We have generated over 100,000 samples using ReNeLLM on theAdvBench dataset that successfully jailbreak Claude-2. You can obtain them here: 🤗Huggingface dataset. Please ensure that you do not use our dataset for any illegal or harmful activities.
  • (2024/03/27) We have released the experimental results of ReNeLLM and other baselines. You can download them from the following link:Experimental Results
  • (2024/03/14) 🔥🔥🔥 ReNeLLM is acccepted by NAACL 2024(main conference)! Thanks to all collaborators!
  • (2024/02/21) We have updated our codebase to make it more readable and easier to use. We will continue to update the codebase and support more LLMs. Stay tuned!
  • (2024/01/31) 🔥🔥🔥 ReNeLLM is evaluated by theEasyjailbreak benchmark and exhibits strong competitiveness across various methods. Check it out!
  • (2024/01/09) We have released a quick implementation of ReNeLLM!
  • (2023/11/14) Our paper is on arXiv! Check it outhere!

Overview

This repository shares the code of our latest work on LLMs jailbreaking/defending. In this work:

  • We introduce ReNeLLM, the first generalized jailbreak prompt attack framework for LLMs, which generalizes jailbreak prompt attacks into two aspects: prompt rewriting and scenario nesting, utilizing LLMs themselves to generate jailbreak attack prompts.
  • Extensive experiments demonstrate that ReNeLLM generates jailbreak prompts that maintain a high attack success rate with less time required. Furthermore, ReNeLLM is difficult to detect by existing defense methods and exhibits generalization and transferability on representative LLMs.
  • To investigate the reasons why LLMs fail in defense, we observe the changes in the execution priority of prompts by LLMs before and after rewriting and nesting, and design defense methods accordingly. We hope our observations can serve as valuable guidance for future LLMs vendors to develop safer and more regulated systems.

Getting Started

1. Clone this repository

git clone https://github.com/NJUNLP/ReNeLLM.git

2. Build Environment

cd ReNeLLMconda create -n ReNeLLM python=3.9conda activate ReNeLLMpip install -r requirements.txt

3. Run ReNeLLM

ReNeLLM employsgpt-3.5-turbo for prompt rewriting and harmful classifier, while utilizingclaude-v2 as the model under attack. Therefore, you are required to input both of these API key parameters.

python renellm.py --gpt_api_key<your openai API key> --claude_api_key<your anthropic API key>

We use these two models due to their superior performance compared to open-source alternatives, as well as their lower cost in comparison withgpt-4.In principle, any model could be utilized as your harmful classifier and attacked model.

4. Get Responses

Forgpt:

python get_responses.py --data_path<jailbroken data path> --gpt_api_key<your open API key> --test_model<gpt-3.5-turbo, gpt-4>

Forclaude:

python get_responses.py --data_path<jailbroken data path> --claude_api_key<your anthropic API key> --test_model<claude-instant-1, claude-2>

Forllama-2-chat:

  • Build the environment
cd llamapip install -e.
bash run_chat.sh# You can set the model type and your jailbroken data path in the run_chat.sh

5. Check ASR

GPT-ASR (using GPT-4)

python check_gpt_asr.py --data_path<your llm responses data path> --gpt_api_key<your open API key>

KW-ASR (keyword ASR)

python check_kw_asr.py --data_path<your llm responses data path>

6. Execute ReNeLLM with a single prompt

python renellm.py --gpt_api_key<your gpt_api_key> --claude_api_key<your claude_api_key> --prompt"how to steal money from others?"

We have uploaded the results of 520 Advbench samples categorized into 7 scenarios(seegpt-4_single_round_prompt_annotation.json. In the result file, the "idx" field corresponds to the same idx samples in Advbench. The prompts used for classification and the harmful scenarios represented by each classification result can be found in Table 10 of our paper.

It should be noted that GPT-4 initially categorized the 520 pieces of data into 10 classes, but three of these classes had very few samples. For the convenience of statistics and analysis, we manually checked and merged these classes, resulting in 7 classification scenarios. The specific category merging information is as follows: 1 -> 0, 8 -> 9, 11 -> 6, finally resulting in 7 categories: 0, 2, 3, 4, 5, 6, 9. You can process the result file according to the above rules to obtain the classification results consistent with our paper.

Contact

If you have any questions about our work, please feel free to contact us via the following email:

Peng Ding:dingpeng@smail.nju.edu.cn

Shujian Huang:huangsj@nju.edu.cn

Citation

If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:

@misc{ding2023wolf,title={A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily},author={Peng Ding and Jun Kuang and Dan Ma and Xuezhi Cao and Yunsen Xian and Jiajun Chen and Shujian Huang},year={2023},eprint={2311.08268},archivePrefix={arXiv},primaryClass={cs.CL}}

About

The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp