Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Making evals multi-turn. Updating PI sys prompt#20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
gabor-openai merged 6 commits intomainfromdev/steven/eval_update
Oct 16, 2025

Conversation

@steven10a
Copy link
Collaborator

@steven10asteven10a commentedOct 15, 2025
edited
Loading

  • Updated evals to run in a multi-turn fashion
  • Updated the Prompt Injection check system prompt
  • Eliminated the use of last checked index. Instead it incrementally looks at all the actions since the last user message. This provides more context when making the judgement
  • Sourced 1046 agent trace examples formAgentDojo (a common PI benchmark).
  • Combined with our original synthetic data and re-running evals

@steven10asteven10a marked this pull request as ready for reviewOctober 16, 2025 00:48
CopilotAI review requested due to automatic review settingsOctober 16, 2025 00:48
Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR transitions the prompt injection detection system from incremental index-based tracking to a multi-turn evaluation approach that analyzes the full conversation context since the last user message. The system now evaluates all actions following each user turn, providing richer context for detecting injected instructions. The PR also updates the system prompt to better define prompt injection attacks and incorporates 1,046 real-world examples from the AgentDojo benchmark.

Key Changes:

  • Removed_injection_last_checked_index tracking from all client classes
  • Introduced_run_incremental_prompt_injection to evaluate conversations turn-by-turn
  • Refactored_parse_conversation_history into_slice_conversation_since_latest_user for clearer intent
  • Updated the system prompt with clearer definitions and evaluation criteria for prompt injection
  • Integrated AgentDojo dataset examples alongside existing synthetic data

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
FileDescription
src/guardrails/client.pyRemoved index tracking fields and methods from all client classes
src/guardrails/types.pyRemoved injection index protocol methods from GuardrailLLMContextProto
src/guardrails/checks/text/prompt_injection_detection.pyRefactored to slice conversations from latest user message; updated system prompt with injection-focused criteria
src/guardrails/evals/core/async_engine.pyAdded incremental multi-turn evaluation logic and conversation payload parsing
src/guardrails/evals/guardrail_evals.pyUpdated import handling with fallback for direct script execution
src/guardrails/agents.pyRemoved index tracking methods from ToolConversationContext
tests/unit/test_client_sync.pyUpdated tests to verify absence of index tracking methods
tests/unit/test_client_async.pyUpdated tests to verify absence of index tracking methods
tests/unit/test_agents.pyUpdated tests to verify absence of index tracking methods
tests/unit/checks/test_prompt_injection_detection.pyRemoved index tracking from fake context and assertions
tests/unit/evals/test_async_engine.pyNew test file for incremental evaluation and payload parsing logic
docs/ref/checks/prompt_injection_detection.mdUpdated benchmark description and performance metrics with AgentDojo dataset

Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.

Comment on lines +286 to +309
ifisinstance(content,str):
returncontent

ifisinstance(content,list):
parts:list[str]= []
foritemincontent:
ifisinstance(item,dict):
text=item.get("text")
iftext:
parts.append(text)
continue
fallback=item.get("content")
ifisinstance(fallback,str):
parts.append(fallback)
elifisinstance(item,str):
parts.append(item)
else:
parts.append(str(item))
return" ".join(filter(None,parts))

ifcontentisNone:
return""

returnstr(content)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The content coercion logic does not filter out falsy non-None values (e.g., empty strings) before callingstr(item) on line 302. This means an empty string in the list will become'', then pass throughfilter(None, ...) and still contribute whitespace to the joined result if there are subsequent non-empty parts. Consider checkingif item: before appendingstr(item) to ensure consistent behavior.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I disagree with this. We already guard against that edge case withfilter(None, part). Implementing this may actually cause us to hide legit values like 0 or "0".

Comment on lines 187 to +193
return_create_skip_result(
"No LLM actions or user intent to evaluate",
config.confidence_threshold,
user_goal=user_intent_dict.get("most_recent_message","N/A"),
action=llm_actions,
action=recent_messages,
data=str(data),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The skip message 'No LLM actions or user intent to evaluate' is ambiguous because this condition only checksuser_intent_dict['most_recent_message'] being empty, notactionable_messages. Consider splitting this into two separate conditions or clarifying the message to indicate specifically that no user intent was found.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

@steven10asteven10aOct 16, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nit]. "No user intent to evaluate" would be clearer. But this is every much an edge case and should never happen. We should always have at least 1 user message.

Comment on lines +43 to +48
def_parse_conversation_payload(data:str)->list[Any]|None:
"""Attempt to parse sample data into a conversation history list."""
try:
payload=json.loads(data)
exceptjson.JSONDecodeError:
returnNone

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The docstring does not document the return type behavior. It should explicitly state that the function returns a list of conversation messages if successful, orNone if parsing fails or the payload structure is invalid.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also think this is a [nit]

Comment on lines +39 to +43
def__getattr__(name:str)->Any:
ifname=="GuardrailEval":
fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval

return_GuardrailEval

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The lazy import in__getattr__ does not cache the imported class, causing repeated imports on every attribute access. Consider caching_GuardrailEval in a module-level variable or usingsys.modules to avoid redundant imports.

Suggested change
def__getattr__(name:str)->Any:
ifname=="GuardrailEval":
fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval
return_GuardrailEval
_cached_GuardrailEval=None
def__getattr__(name:str)->Any:
global_cached_GuardrailEval
ifname=="GuardrailEval":
if_cached_GuardrailEvalisNone:
fromguardrails.evals.guardrail_evalsimportGuardrailEvalas_GuardrailEval
_cached_GuardrailEval=_GuardrailEval
return_cached_GuardrailEval

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the evals tool could use a refresh and restructure. This is okay for now and I will follow up with an eval specific PR

-**Data type**:Internal synthetic dataset simulating realistic agent traces
-**Test scenarios**: Multi-turn conversations with function calls and tool outputs
-**Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
-**AgentDojo dataset**:1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In a follow-up PR, could you include a link to this dataset?

@gabor-openaigabor-openai merged commita251298 intomainOct 16, 2025
3 checks passed
@gabor-openaigabor-openai deleted the dev/steven/eval_update branchOctober 16, 2025 17:40
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

Copilot code reviewCopilotCopilot left review comments

@gabor-openaigabor-openaigabor-openai approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@steven10a@gabor-openai

[8]ページ先頭

©2009-2025 Movatter.jp