Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Updated prompt injection check#27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
gabor-openai merged 5 commits intomainfromdev/steve/PI_eval
Oct 29, 2025
Merged

Conversation

@steven10a
Copy link
Collaborator

@steven10asteven10a commentedOct 28, 2025
edited
Loading

  • Updated system prompt of prompt injection guardrail for better performance
  • Small change tollm_base so all LLM based checks use a shared error reporter and updated other LLM checks to use it
  • Update eval tool to properly parse multi-turn input data
  • Updated evals with results of V2

CopilotAI review requested due to automatic review settingsOctober 28, 2025 18:43
Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail to focus exclusively on analyzing tool calls and tool outputs, while improving the evidence gathering and evaluation framework. The changes refine the security model to only flag content with direct evidence of malicious instructions, rather than inferring injection from behavioral symptoms.

Key changes:

  • Updated prompt injection detection to skip assistant content messages and only analyze tool calls/outputs
  • Addedevidence field toPromptInjectionDetectionOutput for capturing specific injection indicators
  • Enhanced conversation history parsing to gracefully handle non-JSON data
  • Refactored error handling with sharedcreate_error_result helper function

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
FileDescription
src/guardrails/checks/text/prompt_injection_detection.pyCore logic updates: skip assistant messages, add evidence field, enhance system prompt with detailed injection detection criteria
tests/unit/checks/test_prompt_injection_detection.pyComprehensive test coverage for new skip behavior, assistant message handling, and tool output injection scenarios
src/guardrails/evals/core/async_engine.pyEnhanced conversation parsing to handle plain strings and non-conversation JSON, support for Jailbreak guardrail
src/guardrails/evals/core/types.pyAddedconversation_history field and getter method to Context class
src/guardrails/checks/text/llm_base.pyExtractedcreate_error_result helper function for consistent error handling
src/guardrails/checks/text/hallucination_detection.pyUpdated to use sharedcreate_error_result helper
tests/integration/test_suite.pyCommented out multiple test cases, removed config fields
src/guardrails/evals/.gitignoreAddedPI_eval/ directory to ignore list

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Comment on lines 237 to 244
"guardrails": [
{
"name":guardrail.definition.name,
"config": (guardrail.config.__dict__ifhasattr(guardrail.config,"__dict__")elseguardrail.config),
}
forguardrailinself.guardrails
ifguardrail.definition.name=="Prompt Injection Detection"
ifguardrail.definition.nameinconversation_aware_names
],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The configuration creation logic filters guardrails by name match withconversation_aware_names, but this creates a minimal config with only conversation-aware guardrails. Ifself.guardrails doesn't contain a guardrail matching the expected trigger name fromsample.expected_triggers, the minimal_config will have an empty guardrails list, which could cause the evaluation to fail silently or produce incorrect results. The filtering should ensure at least one matching guardrail exists or handle the empty case.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is in an if statement that handles that case

"""
normalized_messages=normalize_conversation(messages)
user_texts= [entry["content"]forentryinnormalized_messagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)]
user_texts= [entry["content"]forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This list comprehension will raise aTypeError ifentry[\"content\"] is not a string but is a truthy non-string type (e.g., a list or dict). Theisinstance check happens after the value is already accessed withentry[\"content\"], but the value could be any type. Consider using.get(\"content\") instead of direct access, or handle the case where content might be None before the isinstance check.

Suggested change
user_texts= [entry["content"]forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)]
user_texts= [entry.get("content")forentryinmessagesifentry.get("role")=="user"andisinstance(entry.get("content"),str)]

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We are receiving a normalized message list, so this is not an issue

Copy link
Collaborator

@gabor-openaigabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM thank you

@gabor-openaigabor-openai merged commitab3f458 intomainOct 29, 2025
3 checks passed
@gabor-openaigabor-openai deleted the dev/steve/PI_eval branchOctober 29, 2025 16:54
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

Copilot code reviewCopilotCopilot left review comments

@gabor-openaigabor-openaigabor-openai approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@steven10a@gabor-openai

[8]ページ先頭

©2009-2025 Movatter.jp