Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Use Presidio for masking#40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
gabor-openai merged 7 commits intomainfromdev/steven/pii_updates
Nov 6, 2025
Merged

Conversation

@steven10a
Copy link
Collaborator

@steven10asteven10a commentedNov 3, 2025
edited
Loading

Using Presidio's built in AnonymizerEngine to handle the PII masking

  • Provides a more robust solution that handles overlapping entities
  • Removechecked_text field from checks that don't use them

Made PII detection more robust against obfuscation attempts

  • Implemented Unicode normalization
  • Added the ability to decodeBase64,URL, andHex encoded content for entity detection
  • Accept an optional parameterdetect_encoded_pii that can be set via the config to turn this on

Added custom identifiers to Presidio

  • CVV/BIC detection and email-in-URL contexts

Updated docs and added more tests

@steven10asteven10a marked this pull request as ready for reviewNovember 4, 2025 22:20
CopilotAI review requested due to automatic review settingsNovember 4, 2025 22:20
Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR refactors thechecked_text field handling across guardrails and significantly enhances the PII detection guardrail with advanced security features.

Key changes:

  • Removeschecked_text from guardrails that don't modify text (moderation, keywords, URLs, secret keys, etc.)
  • Keepschecked_text only in PII guardrail where text is actively masked
  • Adds advanced PII detection: Unicode normalization, encoded PII detection (Base64/URL/hex), custom CVV/BIC_SWIFT recognizers
  • Refactors_apply_preflight_modifications to use Presidio's anonymizer engine directly for better performance and correctness

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file
FileDescription
src/guardrails/checks/text/pii.pyMajor enhancement with Unicode normalization, encoded PII detection, custom recognizers (CVV, BIC_SWIFT), and Presidio anonymizer integration
src/guardrails/_base_client.pyRefactored PII masking to use guardrail'schecked_text directly and Presidio for structured content; removedsummary_text type
src/guardrails/checks/text/moderation.pyRemovedchecked_text field (no text modification)
src/guardrails/checks/text/keywords.pyRemovedchecked_text field (no text modification)
src/guardrails/checks/text/urls.pyRemovedchecked_text field (no text modification)
src/guardrails/checks/text/secret_keys.pyRemovedchecked_text field (no text modification)
src/guardrails/checks/text/prompt_injection_detection.pyRemovedchecked_text field (no text modification)
src/guardrails/checks/text/hallucination_detection.pyRemovedchecked_text parameter from error result creation
src/guardrails/checks/text/llm_base.pyRemovedchecked_text parameter fromcreate_error_result function
tests/unit/test_base_client.pyUpdated tests to reflect new PII guardrail result structure and removedsummary_text type
tests/unit/checks/test_pii.pyAdded comprehensive tests for Unicode normalization and encoded PII detection
tests/unit/checks/test_keywords.pyRemoved assertion onchecked_text field
tests/unit/test_safety_identifier.pyRemoved trailing blank line
src/guardrails/utils/safety_identifier.pyRemoved trailing blank line
pyproject.tomlAddedpresidio-anonymizer>=2.2.360 dependency
examples/basic/pii_mask_example.pyAdded CVV, BIC_SWIFT entities and enabled encoded PII detection
docs/ref/checks/*.mdUpdated documentation to removechecked_text field references where appropriate

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

len(candidate_lower)>=3
andany(# Any 3-char chunk overlaps
candidate_lower[i :i+3]indetected_lower
foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency
Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The overlap detection logic performs substring matching in a loop that could be expensive for large texts. For a candidate of length N, this generates N/2 substring checks. Consider using a more efficient algorithm like rolling hash or KMP for substring matching, or at least increase the step size from 2 to a larger value like 4 or 5 to reduce iterations while still catching meaningful overlaps.

Suggested change
foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency
foriinrange(0,len(candidate_lower)-2,4)# Step by4 for better efficiency

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. We are only looking at detected entities for overlap (emails, ssn, location, etc...) so the strings will be relatively small and this is not an issue


# Patterns for detecting encoded content
# Note: Hex must be checked BEFORE Base64 since hex strings can match Base64 pattern
_HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min)
Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The hex pattern with a minimum of 24 characters will match many non-PII hex strings (UUIDs without dashes are 32 chars, but legitimate hex data like color codes or short hashes could be 24+ chars). This could lead to false positives where legitimate hex data is incorrectly flagged as encoded PII. Consider adding context-aware matching (e.g., preceded by 'hex:', 'encoded:', etc.) or increasing the minimum length to reduce false positives.

Suggested change
_HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min)
_HEX_PATTERN=re.compile(
r"(?:\b(?:hex:|encoded:|0x)[ ]*)[0-9a-fA-F]{24,}\b"
)# Only match hex strings with context prefix to reduce false positives

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This feature is opt-in for heightened security so higher false positives is acceptable. It's purpose is to detect intentional obfuscation, so checking context wouldn't work.

Comment on lines +537 to +559
# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_decoded=urllib.parse.unquote(decoded_text)

# If URL decoding changed the text, track encoded spans for masking
ifurl_decoded!=decoded_text:
# Find URL-encoded spans in the ORIGINAL text for masking purposes
formatchin_URL_ENCODED_PATTERN.finditer(text):
ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
continue

decoded_chunk=urllib.parse.unquote(match.group())
ifdecoded_chunk!=match.group():
candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
decoded_text=url_decoded
Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function callsurllib.parse.unquote on the entiredecoded_text after Base64/hex replacements have been made. This means if decoded_text contains a legitimate%XX sequence that was part of a Base64 decode (not URL encoding), it will be incorrectly decoded again. For example, if a Base64 string decodes to 'test%20value', the %20 will be converted to a space, which may not be the intended behavior. Consider tracking which parts are URL-encoded separately before applying unquote.

Suggested change
# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_decoded=urllib.parse.unquote(decoded_text)
# If URL decoding changed the text, track encoded spans for masking
ifurl_decoded!=decoded_text:
# Find URL-encoded spans in the ORIGINAL text for masking purposes
formatchin_URL_ENCODED_PATTERN.finditer(text):
ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
continue
decoded_chunk=urllib.parse.unquote(match.group())
ifdecoded_chunk!=match.group():
candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
decoded_text=url_decoded
# URL decode only actual URL-encoded spans (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_candidates= []
# Find URL-encoded spans in the ORIGINAL text for masking purposes
formatchin_URL_ENCODED_PATTERN.finditer(text):
ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
continue
decoded_chunk=urllib.parse.unquote(match.group())
ifdecoded_chunk!=match.group():
url_candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
# Replace URL-encoded spans in decoded_text (from last to first to avoid offset issues)
url_candidates.sort(key=lambdac:c.start,reverse=True)
forcandidateinurl_candidates:
decoded_text=decoded_text[:candidate.start]+candidate.decoded_text+decoded_text[candidate.end :]
candidates.append(candidate)

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is acceptable


iffound_entities:
# Mask the encoded version in original text
entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When multiple entity types overlap in the same encoded chunk, only the first entity type fromfound_entities is used for the marker. This could be misleading if an encoded string contains both an email and a phone number - only one will be indicated in the mask. Consider using all found entity types (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation.

Suggested change
entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
entity_marker=f"<{'|'.join(sorted(found_entities))}_ENCODED>"

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. The entire encoded span is still being masked before being sent to the LLM. All of the detected entities are still being listed in the response object if a developer wants to see them.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Comment on lines +702 to +712
has_overlap= (
candidate_lowerindetected_lower# Candidate is substring of detection
ordetected_lowerincandidate_lower# Detection is substring of candidate
or (
len(candidate_lower)>=3
andany(# Any 3-char chunk overlaps
candidate_lower[i :i+3]indetected_lower
foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency
)
)
)
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The overlap detection logic in_mask_encoded_pii uses a nested loop with substring checks that could be O(n*m) in the worst case. For each candidate, it checks against all analyzer results, and within that, performs chunked substring matching. Consider using more efficient string matching algorithms (e.g., KMP or Boyer-Moore) for the chunked overlap detection, or pre-computing a set of 3-character chunks from detections for O(1) lookup.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. Already addressed this comment. Checking small entities so this won't be an issue


iffound_entities:
# Mask the encoded version in original text
entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When multiple entity types are found in a single encoded candidate (e.g., an encoded string containing both EMAIL_ADDRESS and PHONE_NUMBER), only the first entity type from the set is used for the marker. This could lead to misleading masking labels. Consider either using all entity types in the marker (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation that only one entity type is shown when multiple are detected.

Suggested change
entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
entity_types_str="|".join(sorted(found_entities))
entity_marker=f"<{entity_types_str}_ENCODED>"

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. Already addressed this comment

Comment on lines +322 to +326
ifdetected_valueincandidate.decoded_text:
# Mask the encoded version
entity_marker=f"<{entity_type}_ENCODED>"
masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
break
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The conditionif detected_value in candidate.decoded_text: may fail to find the correct candidate when multiple candidates exist or when the detected value spans multiple candidates. This could result in encoded PII not being masked. Consider checking overlap ranges more precisely using start/end positions or ensuring that the candidate list is properly sorted and all overlaps are checked.

Suggested change
ifdetected_valueincandidate.decoded_text:
# Mask the encoded version
entity_marker=f"<{entity_type}_ENCODED>"
masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
break
# Check if detection overlaps candidate's decoded text range
candidate_decoded_start=getattr(candidate,"decoded_start",None)
candidate_decoded_end=getattr(candidate,"decoded_end",None)
ifcandidate_decoded_startisnotNoneandcandidate_decoded_endisnotNone:
if (
result.start<candidate_decoded_end
andresult.end>candidate_decoded_start
):
# Mask the encoded version
entity_marker=f"<{entity_type}_ENCODED>"
masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
break

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is only with structured content which is checking each part individually. For unstructured content we have more sophisticated overlap checking

# CVV/CVC recognizer (3-4 digits, often near credit card context)
cvv_pattern=Pattern(
name="cvv_pattern",
regex=r"\b(?:cvv|cvc|security\s*code|card\s*code)[:\s=]*(\d{3,4})\b",
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The CVV pattern requires a label prefix ('cvv', 'cvc', 'security code', 'card code') before the digits. This means standalone 3-4 digit numbers won't be detected as CVV codes, which could miss obfuscated CVV values. While this reduces false positives, it creates a security gap. Consider documenting this limitation or adding a configurable option for stricter CVV detection that catches standalone 3-4 digit sequences in financial contexts.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. I am okay with this simple approach and we don't want to over flag 3 digit numbers

OpenAIResponseType=Union[Completion,ChatCompletion,ChatCompletionChunk,Response]# noqa: UP007

# Text content types recognized in message content parts
_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"}
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The removal of 'summary_text' from_TEXT_CONTENT_TYPES means that content parts with type 'summary_text' will no longer have PII masking applied or text extracted. If 'summary_text' is still a valid content type in the system, this change could lead to PII leaks. Ensure this removal is intentional and that 'summary_text' is truly deprecated, or document why it should not be processed.

Suggested change
_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"}
_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text","summary_text"}

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI.summary_text is not a valid input


# Should detect and mask the obfuscated email
assert"<EMAIL_ADDRESS>"inmodified[0]["content"][0]["text"]# noqa: S101
assert"@"notinmodified[0]["content"][0]["text"]or"@"notinmodified[0]["content"][0]["text"]# noqa: S101
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The assertionassert '@' not in modified[0]['content'][0]['text'] or '@' not in modified[0]['content'][0]['text'] will pass if EITHER character is absent, which means the test passes even if the fullwidth '@' (@) is still present. This should useand instead ofor to ensure both the ASCII and fullwidth '@' characters are properly masked.

Suggested change
assert"@"notinmodified[0]["content"][0]["text"]or"@"notinmodified[0]["content"][0]["text"]# noqa: S101
assert"@"notinmodified[0]["content"][0]["text"]and"@"notinmodified[0]["content"][0]["text"]# noqa: S101

Copilot uses AI. Check for mistakes.
Comment on lines +459 to +466
try:
decoded_bytes=base64.b64decode(text,validate=True)
# Security: Fail closed - reject content > 10KB to prevent memory DoS and PII bypass
iflen(decoded_bytes)>10_000:
msg= (
f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
)
raiseValueError(msg)
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The 10KB size limit for decoded content is enforced by raising a ValueError, but this exception is not caught in the calling code (_build_decoded_text), which could cause the entire PII check to fail rather than gracefully skipping oversized encoded content. Consider catching this ValueError and logging a warning instead, or document that requests with large encoded payloads will be rejected entirely.

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

@steven10asteven10aNov 5, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. We want to fail if there is a massive encoded file. Checking for encoded content is opt-in so users can not default to not checking encoded content if they wish

Comment on lines 290 to 291
decoded_text=normalized

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This assignment to 'decoded_text' is unnecessary as it isredefined before this value is used.

Suggested change
decoded_text = normalized

Copilot uses AI. Check for mistakes.
Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.


💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.


# Zero-width and invisible Unicode characters that can be used to bypass detection
_ZERO_WIDTH_CHARS=re.compile(
r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The comment mentions 'BOM' but the actual character \ufeff is the Zero Width No-Break Space, not the Byte Order Mark. While \ufeff can function as a BOM when at the beginning of a file, in the middle of text it's a zero-width no-break space. The comment should clarify this to avoid confusion: 'Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space'.

Suggested change
r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM
r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner,zero-width no-break space (\ufeff;BOM only at start of file)

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nit]

Comment on lines +463 to +466
msg= (
f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
)
raiseValueError(msg)
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The error message leaks the size of decoded content, which could help attackers fingerprint the system or understand size limits for crafted payloads. Consider using a generic error message without the specific size:'Base64 decoded content exceeds maximum allowed size of 10KB.'

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That is okay. Returning to dev not user

Comment on lines +496 to +497
msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
raiseValueError(msg)
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similar to the Base64 case, the error message leaks the size of decoded content. Use a generic error message:'Hex decoded content exceeds maximum allowed size of 10KB.'

Suggested change
msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
raiseValueError(msg)
raiseValueError("Hex decoded content exceeds maximum allowed size of 10KB.")

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That is okay. Returning to dev not user

len(candidate_lower)>=3
andany(# Any 3-char chunk overlaps
candidate_lower[i :i+3]indetected_lower
foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The 3-character chunk overlap check at line 707-710 iterates with step=2 for efficiency, but this could miss valid overlaps. For example, ifcandidate_lower='abcdef' anddetected_lower contains 'bcd', the check at i=0 ('abc') and i=2 ('cde') would both miss it. Consider using step=1 or documenting why step=2 is acceptable for this security-critical matching.

Suggested change
foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency
foriinrange(0,len(candidate_lower)-2)# Step by1 to avoid missing overlaps

Copilot uses AI. Check for mistakes.
Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is fine. This is the third check and stepping by 2 will only miss PII that is exactly 3 characters long and isn't caught by all of the other checks. Very rare and unrealistic case

asyncdeftest_pii_detects_base64_encoded_email()->None:
"""Base64-encoded email should be detected when flag is enabled."""
config=PIIConfig(entities=[PIIEntity.EMAIL_ADDRESS],block=False,detect_encoded_pii=True)
# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com
Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The comment states the Base64 string decodes to 'john@example.com', but the actual decoded value is 'john@example.com' which is correct. However, for consistency with other test comments that specify the exact encoding (e.g., line 383 shows the URL-encoded version), consider verifying this comment is accurate by decoding the Base64 string.

Suggested change
# am9obkBleGFtcGxlLmNvbQ==isbase64 for john@example.com
# am9obkBleGFtcGxlLmNvbQ==(base64 for'john@example.com')

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@gabor-openaigabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

@gabor-openaigabor-openai merged commitbf0cb52 intomainNov 6, 2025
9 checks passed
@gabor-openaigabor-openai deleted the dev/steven/pii_updates branchNovember 6, 2025 00:52
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

Copilot code reviewCopilotCopilot left review comments

@gabor-openaigabor-openaigabor-openai approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@steven10a@gabor-openai

[8]ページ先頭

©2009-2025 Movatter.jp