openai/openai-guardrails-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork17
Star126

Dev/steven/nsfw docs#30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

gabor-openai merged 4 commits intomainfromdev/steven/nsfw_docs

Oct 29, 2025

Merged

Dev/steven/nsfw docs#30

gabor-openai merged 4 commits intomainfromdev/steven/nsfw_docs

Oct 29, 2025

Conversation

Copy link

Collaborator

steven10a commentedOct 29, 2025

Adding nsfw docs and results

steven10a added4 commits

October 28, 2025 14:37

Updated prompt injection check

fae454b

Formatting changes

98ab91e

Removed legacy code

be2ced6

add nsfw docs

10b868c

CopilotAI review requested due to automatic review settings

October 29, 2025 02:02

CopilotAI reviewed

Oct 29, 2025

View reviewed changes

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail with improved analysis capabilities, better test coverage, and broader conversation-aware guardrail support. The changes focus on detecting malicious instructions in tool calls and tool outputs that deviate from user intent.

Key changes:

Enhanced prompt injection detection to analyze tool outputs for embedded injection directives (fake conversations, response manipulation)
Extended evaluation framework to support multiple conversation-aware guardrails beyond just prompt injection detection
Added comprehensive test coverage for various injection attack patterns and edge cases

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/guardrails/checks/text/prompt_injection_detection.py`	Enhanced detection logic with evidence field, improved prompts for tool output analysis, and updated docstrings to focus on tool calls/outputs
`src/guardrails/checks/text/llm_base.py`	Extracted`create_error_result` helper function for standardized error handling
`src/guardrails/checks/text/hallucination_detection.py`	Refactored to use new`create_error_result` helper for consistent error handling
`src/guardrails/evals/core/async_engine.py`	Extended conversation-aware support to multiple guardrails (Jailbreak, Prompt Injection), improved payload parsing to handle non-JSON strings
`src/guardrails/evals/core/types.py`	Added`conversation_history` field and`get_conversation_history` method to Context class
`tests/unit/checks/test_prompt_injection_detection.py`	Added comprehensive tests for injection patterns, assistant message handling, and edge cases
`tests/unit/evals/test_async_engine.py`	Updated test to reflect new behavior of wrapping non-JSON strings as user messages
`tests/integration/test_suite.py`	Removed redundant config fields from pipeline configuration
`tests/unit/test_resources_responses.py`	Added blank line for formatting
`src/guardrails/evals/.gitignore`	Added`PI_eval/` directory to gitignore
`mkdocs.yml`	Reorganized checks documentation alphabetically
`docs/ref/checks/nsfw.md`	Updated benchmark results with new model performance metrics

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

src/guardrails/evals/core/async_engine.py

Comment on lines +233 to 235

		# Create a minimal guardrails config for conversation-aware checks
		minimal_config= {
		"version":1,

Copy link

CopilotAIOct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The config dictionary is missing thestage_name key that was previously present. While this may be intentional cleanup, the code should ensure the minimal config structure is valid and matches whatGuardrailsAsyncOpenAI expects. Consider adding a comment explaining the minimal required structure.

Suggested change

	# Create a minimal guardrails config for conversation-aware checks
	minimal_config= {
	"version":1,
	# Create a minimal guardrails config for conversation-aware checks.
	# The minimal required structure for GuardrailsAsyncOpenAI includes:
	# - "version": config version
	# - "stage_name": name of the stage (e.g., "output")
	# - "output": { "guardrails": [ ... ] }
	minimal_config= {
	"version":1,
	"stage_name":"output",