Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/QROAPublic

QROA: A Black-Box Query-Response Optimization Attack on LLMs

License

NotificationsYou must be signed in to change notification settings

qroa/QROA

Repository files navigation

QROA, or Query-Response Optimization Attack, is an innovative and robust strategy designed to explore and exploit vulnerabilities in Large Language Models (LLMs) through black-box interactions. This method leverages optimized triggers embedded within benign-looking instructions to manipulate LLMs into generating harmful content. Developed without the need for direct model access or internal data insights, QROA operates solely via the standard input-output interface provided by LLMs. The attack's underlying techniques are inspired by advances in deep Q-learning, allowing dynamic token adjustments to maximize a reward function that aligns with the attacker's goals.

This Script is the official implementation of the article "QROA: A Black-Box Query-Response Optimization Attack on LLMs"

Paper:https://arxiv.org/abs/2406.02044

📄 Citation

@article{jawad2024qroa,  title={QROA: A Black-Box Query-Response Optimization Attack on LLMs},  author={Jawad, Hussein and BRUNEL, Nicolas J-B},  journal={arXiv preprint arXiv:2406.02044},  year={2024}}

📜 Abstract

Large Language Models (LLMs) have recently gained popularity, but they also raise concerns due to their potential to create harmful content if misused. This study in- troduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction. QROA adds an optimized trigger to a malicious instruction to compel the LLM to generate harmful content. Unlike previous approaches, QROA does not require access to the model’s logit information or any other internal data and operates solely through the standard query-response interface of LLMs. Inspired by deep Q-learning and Greedy coordinate descent, the method iteratively updates tokens to maximize a designed reward function. We tested our method on various LLMs such as Vicuna, Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80%. We also tested the model against Llama2-chat, the fine-tuned version of Llama2 designed to resist Jailbreak attacks, achieving good ASR with a suboptimal initial trigger seed. This study demonstrates the feasibility of generating jailbreak attacks against deployed LLMs in the public domain using black-box optimization methods, enabling more comprehensive safety testing of LLMs

QROA

⚙️ Installation

  1. Clone the repository:

    git clone https://github.com/qroa/qroa.gitcd qroa
  2. Install the required packages:

    pip install -r requirements.txt

🚀 Usage

⚔️ Running the Attack

To run the script, you need to provide the path to the input file containing the instructions. The input file should be in CSV format.Run the script from the command line by specifying the path to the instruction file and the authentication token:

python main.py data/instructions.csv [API_AUTH_TOKEN]

Replaceinstructions.csv with the path to your text file containing the instructions, andAPI_AUTH_TOKEN with the actual authentication token.

🧠 Supported Models

You can test the following models with QROA:

  • Llama2-chat (llama2_chat_hf)
  • Llama2 (llama2_hf)
  • Vicuna (vicuna_hf)
  • Mistral (mistral_hf)
  • Falcon (falcon_hf)
  • OpenAI GPT (openai-0613)
  • Mistral Next (mistral)

Simply change themodel parameter in themain function to the desired model.

🧪 Demo and Testing Model Generation

  • Notebook Demo: Rundemo.ipynb to see a demonstration of the process.
  • Notebook Analysis Experiement: Runanalysis.ipynb to analyse results and calculate metrics value (ASR).
  • Testing Model: Generation: Executegenerate.py to test the generation process on custom instructions and triggers.

This script can be run from the command line as follows:

python generate.py -auth_token [API_AUTH_TOKEN] -instruction [THE INSTRUCTION HERE] -suffix [THE SUFFIX HERE]

Where:

  • auth_token: Authentication token required for accessing the model.
  • instruction: The specific instruction you want the model to follow.
  • suffix: The adversarial trigger that, when appended to the instruction, causes the LLM to obey the instruction.

📁 Output Files

The following output files are generated during the execution of the script:

Generated and validated triggers are saved in JSON format:

  • Generated Triggers:./results/[MODEL_NAME]/triggers.json : Contains the triggers generated by the model.
  • Validated Triggers:./results/[MODEL_NAME]/triggers_validate.json : Contains the triggers validated after applying the z test.

Logs for generation and validation processes are also available:

  • Trigger Generation Logs:./logs/[MODEL_NAME]/logging_generator.csv : Logs the process of trigger generation.
  • Trigger Validation Logs:./logs/[MODEL_NAME]/logging_validator.csv : Logs the process of validating the triggers with the z test.

🔧 Configuration Settings

The following table outlines the configuration settings for the JailBreak process. Each parameter plays a role in the setup and execution of the process:

ParameterDescription
modelSpecifies the Large Language Model (LLM) to be used for the attack, such as 'vicuna_hf', 'falcon_hf', etc.
apply_defense_methodsA boolean parameter that determines whether defense methods are activated to protect the model during the JailBreak process.
auth_tokenAuthentication token required for accessing the model. This token could be from Hugging Face for accessing their models or other providers like OpenAI for closed source models.
system_promptThe initial message or command that initiates interaction with the LLM.
embedding_model_pathPath to the surrogate model's embedding layer.
len_coordinatesSpecifies the number of tokens in the generated trigger, defining the length of the attack vector.
learning_rateLearning rate for the optimizer.
weight_decayWeight decay (L2 penalty) for the optimizer.
nb_epochsThe total number of training cycles through the dataset where the model learns by adjusting internal parameters.
batch_sizeNumber of training examples used to calculate gradient and update internal model parameters per iteration.
scoring_typeMethod used to evaluate the effectiveness of triggers. For example, 'hm' could refer to a scoring model that uses a fine-tuned RoBERTa model for detecting harmful content.
max_generations_tokensMaximum number of tokens that the LLM is allowed to generate in response to a query during the attack.
topkThe top K value triggers identified in each epoch, equivalent to the number of queries sent to the target LLM.
max_dThe maximum size of the memory buffer.
ucb_cThe exploration-exploitation parameter for the Upper Confidence Bound (UCB) algorithm. A higher value encourages exploration of less certain actions.
triggers_initInitial triggers used as a starting point for the algorithm; these triggers are used to pre-fill the memory buffer to avoid starting from scratch.
thresholdThe statistical significance threshold used when validating triggers.
nb_samples_per_triggerNumber of samples per trigger for statistically validating the efficiency of the trigger.
logging_pathPath to the logging directory.
results_pathPath to the results directory.
temperatureSampling temperature used by the LLM.
top_pTop P value for nucleus sampling used by the LLM.
p_valueP-value for statistical testing.

[8]ページ先頭

©2009-2025 Movatter.jp