- Notifications
You must be signed in to change notification settings - Fork17
Updated prompt injection check#27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
This PR enhances the Prompt Injection Detection guardrail to focus exclusively on analyzing tool calls and tool outputs, while improving the evidence gathering and evaluation framework. The changes refine the security model to only flag content with direct evidence of malicious instructions, rather than inferring injection from behavioral symptoms.
Key changes:
- Updated prompt injection detection to skip assistant content messages and only analyze tool calls/outputs
- Added
evidencefield toPromptInjectionDetectionOutputfor capturing specific injection indicators - Enhanced conversation history parsing to gracefully handle non-JSON data
- Refactored error handling with shared
create_error_resulthelper function
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/guardrails/checks/text/prompt_injection_detection.py | Core logic updates: skip assistant messages, add evidence field, enhance system prompt with detailed injection detection criteria |
tests/unit/checks/test_prompt_injection_detection.py | Comprehensive test coverage for new skip behavior, assistant message handling, and tool output injection scenarios |
src/guardrails/evals/core/async_engine.py | Enhanced conversation parsing to handle plain strings and non-conversation JSON, support for Jailbreak guardrail |
src/guardrails/evals/core/types.py | Addedconversation_history field and getter method to Context class |
src/guardrails/checks/text/llm_base.py | Extractedcreate_error_result helper function for consistent error handling |
src/guardrails/checks/text/hallucination_detection.py | Updated to use sharedcreate_error_result helper |
tests/integration/test_suite.py | Commented out multiple test cases, removed config fields |
src/guardrails/evals/.gitignore | AddedPI_eval/ directory to ignore list |
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
| "guardrails": [ | ||
| { | ||
| "name":guardrail.definition.name, | ||
| "config": (guardrail.config.__dict__ifhasattr(guardrail.config,"__dict__")elseguardrail.config), | ||
| } | ||
| forguardrailinself.guardrails | ||
| ifguardrail.definition.name=="Prompt Injection Detection" | ||
| ifguardrail.definition.nameinconversation_aware_names | ||
| ], |
CopilotAIOct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The configuration creation logic filters guardrails by name match withconversation_aware_names, but this creates a minimal config with only conversation-aware guardrails. Ifself.guardrails doesn't contain a guardrail matching the expected trigger name fromsample.expected_triggers, the minimal_config will have an empty guardrails list, which could cause the evaluation to fail silently or produce incorrect results. The filtering should ensure at least one matching guardrail exists or handle the empty case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is in an if statement that handles that case
| """ | ||
| normalized_messages=normalize_conversation(messages) | ||
| user_texts= [entry["content"]forentryinnormalized_messagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)] | ||
| user_texts= [entry["content"]forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)] |
CopilotAIOct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This list comprehension will raise aTypeError ifentry[\"content\"] is not a string but is a truthy non-string type (e.g., a list or dict). Theisinstance check happens after the value is already accessed withentry[\"content\"], but the value could be any type. Consider using.get(\"content\") instead of direct access, or handle the case where content might be None before the isinstance check.
| user_texts= [entry["content"]forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)] | |
| user_texts= [entry.get("content")forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We are receiving a normalized message list, so this is not an issue
gabor-openai left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM thank you
ab3f458 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
llm_baseso all LLM based checks use a shared error reporter and updated other LLM checks to use it