- Notifications
You must be signed in to change notification settings - Fork17
Making evals multi-turn. Updating PI sys prompt#20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
This PR transitions the prompt injection detection system from incremental index-based tracking to a multi-turn evaluation approach that analyzes the full conversation context since the last user message. The system now evaluates all actions following each user turn, providing richer context for detecting injected instructions. The PR also updates the system prompt to better define prompt injection attacks and incorporates 1,046 real-world examples from the AgentDojo benchmark.
Key Changes:
- Removed
_injection_last_checked_indextracking from all client classes - Introduced
_run_incremental_prompt_injectionto evaluate conversations turn-by-turn - Refactored
_parse_conversation_historyinto_slice_conversation_since_latest_userfor clearer intent - Updated the system prompt with clearer definitions and evaluation criteria for prompt injection
- Integrated AgentDojo dataset examples alongside existing synthetic data
Reviewed Changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/guardrails/client.py | Removed index tracking fields and methods from all client classes |
| src/guardrails/types.py | Removed injection index protocol methods from GuardrailLLMContextProto |
| src/guardrails/checks/text/prompt_injection_detection.py | Refactored to slice conversations from latest user message; updated system prompt with injection-focused criteria |
| src/guardrails/evals/core/async_engine.py | Added incremental multi-turn evaluation logic and conversation payload parsing |
| src/guardrails/evals/guardrail_evals.py | Updated import handling with fallback for direct script execution |
| src/guardrails/agents.py | Removed index tracking methods from ToolConversationContext |
| tests/unit/test_client_sync.py | Updated tests to verify absence of index tracking methods |
| tests/unit/test_client_async.py | Updated tests to verify absence of index tracking methods |
| tests/unit/test_agents.py | Updated tests to verify absence of index tracking methods |
| tests/unit/checks/test_prompt_injection_detection.py | Removed index tracking from fake context and assertions |
| tests/unit/evals/test_async_engine.py | New test file for incremental evaluation and payload parsing logic |
| docs/ref/checks/prompt_injection_detection.md | Updated benchmark description and performance metrics with AgentDojo dataset |
Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.
| ifisinstance(content,str): | ||
| returncontent | ||
| ifisinstance(content,list): | ||
| parts:list[str]= [] | ||
| foritemincontent: | ||
| ifisinstance(item,dict): | ||
| text=item.get("text") | ||
| iftext: | ||
| parts.append(text) | ||
| continue | ||
| fallback=item.get("content") | ||
| ifisinstance(fallback,str): | ||
| parts.append(fallback) | ||
| elifisinstance(item,str): | ||
| parts.append(item) | ||
| else: | ||
| parts.append(str(item)) | ||
| return" ".join(filter(None,parts)) | ||
| ifcontentisNone: | ||
| return"" | ||
| returnstr(content) |
CopilotAIOct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The content coercion logic does not filter out falsy non-None values (e.g., empty strings) before callingstr(item) on line 302. This means an empty string in the list will become'', then pass throughfilter(None, ...) and still contribute whitespace to the joined result if there are subsequent non-empty parts. Consider checkingif item: before appendingstr(item) to ensure consistent behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I disagree with this. We already guard against that edge case withfilter(None, part). Implementing this may actually cause us to hide legit values like 0 or "0".
| return_create_skip_result( | ||
| "No LLM actions or user intent to evaluate", | ||
| config.confidence_threshold, | ||
| user_goal=user_intent_dict.get("most_recent_message","N/A"), | ||
| action=llm_actions, | ||
| action=recent_messages, | ||
| data=str(data), | ||
| ) |
CopilotAIOct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The skip message 'No LLM actions or user intent to evaluate' is ambiguous because this condition only checksuser_intent_dict['most_recent_message'] being empty, notactionable_messages. Consider splitting this into two separate conditions or clarifying the message to indicate specifically that no user intent was found.
steven10aOct 16, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
[nit]. "No user intent to evaluate" would be clearer. But this is every much an edge case and should never happen. We should always have at least 1 user message.
| def_parse_conversation_payload(data:str)->list[Any]|None: | ||
| """Attempt to parse sample data into a conversation history list.""" | ||
| try: | ||
| payload=json.loads(data) | ||
| exceptjson.JSONDecodeError: | ||
| returnNone |
CopilotAIOct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The docstring does not document the return type behavior. It should explicitly state that the function returns a list of conversation messages if successful, orNone if parsing fails or the payload structure is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Also think this is a [nit]
| def__getattr__(name:str)->Any: | ||
| ifname=="GuardrailEval": | ||
| fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval | ||
| return_GuardrailEval |
CopilotAIOct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The lazy import in__getattr__ does not cache the imported class, causing repeated imports on every attribute access. Consider caching_GuardrailEval in a module-level variable or usingsys.modules to avoid redundant imports.
| def__getattr__(name:str)->Any: | |
| ifname=="GuardrailEval": | |
| fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval | |
| return_GuardrailEval | |
| _cached_GuardrailEval=None | |
| def__getattr__(name:str)->Any: | |
| global_cached_GuardrailEval | |
| ifname=="GuardrailEval": | |
| if_cached_GuardrailEvalisNone: | |
| fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval | |
| _cached_GuardrailEval=_GuardrailEval | |
| return_cached_GuardrailEval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think the evals tool could use a refresh and restructure. This is okay for now and I will follow up with an eval specific PR
| -**Data type**:Internal synthetic dataset simulating realistic agent traces | ||
| -**Test scenarios**: Multi-turn conversations with function calls and tool outputs | ||
| -**Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces | ||
| -**AgentDojo dataset**:1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In a follow-up PR, could you include a link to this dataset?
a251298 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.