- Notifications
You must be signed in to change notification settings - Fork17
Use Presidio for masking#40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
This PR refactors thechecked_text field handling across guardrails and significantly enhances the PII detection guardrail with advanced security features.
Key changes:
- Removes
checked_textfrom guardrails that don't modify text (moderation, keywords, URLs, secret keys, etc.) - Keeps
checked_textonly in PII guardrail where text is actively masked - Adds advanced PII detection: Unicode normalization, encoded PII detection (Base64/URL/hex), custom CVV/BIC_SWIFT recognizers
- Refactors
_apply_preflight_modificationsto use Presidio's anonymizer engine directly for better performance and correctness
Reviewed Changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/guardrails/checks/text/pii.py | Major enhancement with Unicode normalization, encoded PII detection, custom recognizers (CVV, BIC_SWIFT), and Presidio anonymizer integration |
| src/guardrails/_base_client.py | Refactored PII masking to use guardrail'schecked_text directly and Presidio for structured content; removedsummary_text type |
| src/guardrails/checks/text/moderation.py | Removedchecked_text field (no text modification) |
| src/guardrails/checks/text/keywords.py | Removedchecked_text field (no text modification) |
| src/guardrails/checks/text/urls.py | Removedchecked_text field (no text modification) |
| src/guardrails/checks/text/secret_keys.py | Removedchecked_text field (no text modification) |
| src/guardrails/checks/text/prompt_injection_detection.py | Removedchecked_text field (no text modification) |
| src/guardrails/checks/text/hallucination_detection.py | Removedchecked_text parameter from error result creation |
| src/guardrails/checks/text/llm_base.py | Removedchecked_text parameter fromcreate_error_result function |
| tests/unit/test_base_client.py | Updated tests to reflect new PII guardrail result structure and removedsummary_text type |
| tests/unit/checks/test_pii.py | Added comprehensive tests for Unicode normalization and encoded PII detection |
| tests/unit/checks/test_keywords.py | Removed assertion onchecked_text field |
| tests/unit/test_safety_identifier.py | Removed trailing blank line |
| src/guardrails/utils/safety_identifier.py | Removed trailing blank line |
| pyproject.toml | Addedpresidio-anonymizer>=2.2.360 dependency |
| examples/basic/pii_mask_example.py | Added CVV, BIC_SWIFT entities and enabled encoded PII detection |
| docs/ref/checks/*.md | Updated documentation to removechecked_text field references where appropriate |
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| len(candidate_lower)>=3 | ||
| andany(# Any 3-char chunk overlaps | ||
| candidate_lower[i :i+3]indetected_lower | ||
| foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency |
CopilotAINov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The overlap detection logic performs substring matching in a loop that could be expensive for large texts. For a candidate of length N, this generates N/2 substring checks. Consider using a more efficient algorithm like rolling hash or KMP for substring matching, or at least increase the step size from 2 to a larger value like 4 or 5 to reduce iterations while still catching meaningful overlaps.
| foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency | |
| foriinrange(0,len(candidate_lower)-2,4)# Step by4 for better efficiency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. We are only looking at detected entities for overlap (emails, ssn, location, etc...) so the strings will be relatively small and this is not an issue
| # Patterns for detecting encoded content | ||
| # Note: Hex must be checked BEFORE Base64 since hex strings can match Base64 pattern | ||
| _HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min) |
CopilotAINov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The hex pattern with a minimum of 24 characters will match many non-PII hex strings (UUIDs without dashes are 32 chars, but legitimate hex data like color codes or short hashes could be 24+ chars). This could lead to false positives where legitimate hex data is incorrectly flagged as encoded PII. Consider adding context-aware matching (e.g., preceded by 'hex:', 'encoded:', etc.) or increasing the minimum length to reduce false positives.
| _HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min) | |
| _HEX_PATTERN=re.compile( | |
| r"(?:\b(?:hex:|encoded:|0x)[ ]*)[0-9a-fA-F]{24,}\b" | |
| )# Only match hex strings with context prefix to reduce false positives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. This feature is opt-in for heightened security so higher false positives is acceptable. It's purpose is to detect intentional obfuscation, so checking context wouldn't work.
| # URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net) | ||
| # This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly | ||
| url_decoded=urllib.parse.unquote(decoded_text) | ||
| # If URL decoding changed the text, track encoded spans for masking | ||
| ifurl_decoded!=decoded_text: | ||
| # Find URL-encoded spans in the ORIGINAL text for masking purposes | ||
| formatchin_URL_ENCODED_PATTERN.finditer(text): | ||
| ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans): | ||
| continue | ||
| decoded_chunk=urllib.parse.unquote(match.group()) | ||
| ifdecoded_chunk!=match.group(): | ||
| candidates.append( | ||
| EncodedCandidate( | ||
| encoded_text=match.group(), | ||
| decoded_text=decoded_chunk, | ||
| encoding_type="url", | ||
| start=match.start(), | ||
| end=match.end(), | ||
| ) | ||
| ) | ||
| decoded_text=url_decoded |
CopilotAINov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The function callsurllib.parse.unquote on the entiredecoded_text after Base64/hex replacements have been made. This means if decoded_text contains a legitimate%XX sequence that was part of a Base64 decode (not URL encoding), it will be incorrectly decoded again. For example, if a Base64 string decodes to 'test%20value', the %20 will be converted to a space, which may not be the intended behavior. Consider tracking which parts are URL-encoded separately before applying unquote.
| # URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net) | |
| # This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly | |
| url_decoded=urllib.parse.unquote(decoded_text) | |
| # If URL decoding changed the text, track encoded spans for masking | |
| ifurl_decoded!=decoded_text: | |
| # Find URL-encoded spans in the ORIGINAL text for masking purposes | |
| formatchin_URL_ENCODED_PATTERN.finditer(text): | |
| ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans): | |
| continue | |
| decoded_chunk=urllib.parse.unquote(match.group()) | |
| ifdecoded_chunk!=match.group(): | |
| candidates.append( | |
| EncodedCandidate( | |
| encoded_text=match.group(), | |
| decoded_text=decoded_chunk, | |
| encoding_type="url", | |
| start=match.start(), | |
| end=match.end(), | |
| ) | |
| ) | |
| decoded_text=url_decoded | |
| # URL decode only actual URL-encoded spans (handles partial encodings like %6a%61%6e%65%40securemail.net) | |
| # This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly | |
| url_candidates= [] | |
| # Find URL-encoded spans in the ORIGINAL text for masking purposes | |
| formatchin_URL_ENCODED_PATTERN.finditer(text): | |
| ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans): | |
| continue | |
| decoded_chunk=urllib.parse.unquote(match.group()) | |
| ifdecoded_chunk!=match.group(): | |
| url_candidates.append( | |
| EncodedCandidate( | |
| encoded_text=match.group(), | |
| decoded_text=decoded_chunk, | |
| encoding_type="url", | |
| start=match.start(), | |
| end=match.end(), | |
| ) | |
| ) | |
| # Replace URL-encoded spans in decoded_text (from last to first to avoid offset issues) | |
| url_candidates.sort(key=lambdac:c.start,reverse=True) | |
| forcandidateinurl_candidates: | |
| decoded_text=decoded_text[:candidate.start]+candidate.decoded_text+decoded_text[candidate.end :] | |
| candidates.append(candidate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. This is acceptable
| iffound_entities: | ||
| # Mask the encoded version in original text | ||
| entity_marker=f"<{next(iter(found_entities))}_ENCODED>" |
CopilotAINov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When multiple entity types overlap in the same encoded chunk, only the first entity type fromfound_entities is used for the marker. This could be misleading if an encoded string contains both an email and a phone number - only one will be indicated in the mask. Consider using all found entity types (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation.
| entity_marker=f"<{next(iter(found_entities))}_ENCODED>" | |
| entity_marker=f"<{'|'.join(sorted(found_entities))}_ENCODED>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. The entire encoded span is still being masked before being sent to the LLM. All of the detected entities are still being listed in the response object if a developer wants to see them.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
| has_overlap= ( | ||
| candidate_lowerindetected_lower# Candidate is substring of detection | ||
| ordetected_lowerincandidate_lower# Detection is substring of candidate | ||
| or ( | ||
| len(candidate_lower)>=3 | ||
| andany(# Any 3-char chunk overlaps | ||
| candidate_lower[i :i+3]indetected_lower | ||
| foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency | ||
| ) | ||
| ) | ||
| ) |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
[nitpick] The overlap detection logic in_mask_encoded_pii uses a nested loop with substring checks that could be O(n*m) in the worst case. For each candidate, it checks against all analyzer results, and within that, performs chunked substring matching. Consider using more efficient string matching algorithms (e.g., KMP or Boyer-Moore) for the chunked overlap detection, or pre-computing a set of 3-character chunks from detections for O(1) lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. Already addressed this comment. Checking small entities so this won't be an issue
| iffound_entities: | ||
| # Mask the encoded version in original text | ||
| entity_marker=f"<{next(iter(found_entities))}_ENCODED>" |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When multiple entity types are found in a single encoded candidate (e.g., an encoded string containing both EMAIL_ADDRESS and PHONE_NUMBER), only the first entity type from the set is used for the marker. This could lead to misleading masking labels. Consider either using all entity types in the marker (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation that only one entity type is shown when multiple are detected.
| entity_marker=f"<{next(iter(found_entities))}_ENCODED>" | |
| entity_types_str="|".join(sorted(found_entities)) | |
| entity_marker=f"<{entity_types_str}_ENCODED>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. Already addressed this comment
| ifdetected_valueincandidate.decoded_text: | ||
| # Mask the encoded version | ||
| entity_marker=f"<{entity_type}_ENCODED>" | ||
| masked=masked[:candidate.start]+entity_marker+masked[candidate.end :] | ||
| break |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The conditionif detected_value in candidate.decoded_text: may fail to find the correct candidate when multiple candidates exist or when the detected value spans multiple candidates. This could result in encoded PII not being masked. Consider checking overlap ranges more precisely using start/end positions or ensuring that the candidate list is properly sorted and all overlaps are checked.
| ifdetected_valueincandidate.decoded_text: | |
| # Mask the encoded version | |
| entity_marker=f"<{entity_type}_ENCODED>" | |
| masked=masked[:candidate.start]+entity_marker+masked[candidate.end :] | |
| break | |
| # Check if detection overlaps candidate's decoded text range | |
| candidate_decoded_start=getattr(candidate,"decoded_start",None) | |
| candidate_decoded_end=getattr(candidate,"decoded_end",None) | |
| ifcandidate_decoded_startisnotNoneandcandidate_decoded_endisnotNone: | |
| if ( | |
| result.start<candidate_decoded_end | |
| andresult.end>candidate_decoded_start | |
| ): | |
| # Mask the encoded version | |
| entity_marker=f"<{entity_type}_ENCODED>" | |
| masked=masked[:candidate.start]+entity_marker+masked[candidate.end :] | |
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. This is only with structured content which is checking each part individually. For unstructured content we have more sophisticated overlap checking
| # CVV/CVC recognizer (3-4 digits, often near credit card context) | ||
| cvv_pattern=Pattern( | ||
| name="cvv_pattern", | ||
| regex=r"\b(?:cvv|cvc|security\s*code|card\s*code)[:\s=]*(\d{3,4})\b", |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
[nitpick] The CVV pattern requires a label prefix ('cvv', 'cvc', 'security code', 'card code') before the digits. This means standalone 3-4 digit numbers won't be detected as CVV codes, which could miss obfuscated CVV values. While this reduces false positives, it creates a security gap. Consider documenting this limitation or adding a configurable option for stricter CVV detection that catches standalone 3-4 digit sequences in financial contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. I am okay with this simple approach and we don't want to over flag 3 digit numbers
| OpenAIResponseType=Union[Completion,ChatCompletion,ChatCompletionChunk,Response]# noqa: UP007 | ||
| # Text content types recognized in message content parts | ||
| _TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"} |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The removal of 'summary_text' from_TEXT_CONTENT_TYPES means that content parts with type 'summary_text' will no longer have PII masking applied or text extracted. If 'summary_text' is still a valid content type in the system, this change could lead to PII leaks. Ensure this removal is intentional and that 'summary_text' is truly deprecated, or document why it should not be processed.
| _TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"} | |
| _TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text","summary_text"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI.summary_text is not a valid input
tests/unit/test_base_client.py Outdated
| # Should detect and mask the obfuscated email | ||
| assert"<EMAIL_ADDRESS>"inmodified[0]["content"][0]["text"]# noqa: S101 | ||
| assert"@"notinmodified[0]["content"][0]["text"]or"@"notinmodified[0]["content"][0]["text"]# noqa: S101 |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The assertionassert '@' not in modified[0]['content'][0]['text'] or '@' not in modified[0]['content'][0]['text'] will pass if EITHER character is absent, which means the test passes even if the fullwidth '@' (@) is still present. This should useand instead ofor to ensure both the ASCII and fullwidth '@' characters are properly masked.
| assert"@"notinmodified[0]["content"][0]["text"]or"@"notinmodified[0]["content"][0]["text"]# noqa: S101 | |
| assert"@"notinmodified[0]["content"][0]["text"]and"@"notinmodified[0]["content"][0]["text"]# noqa: S101 |
| try: | ||
| decoded_bytes=base64.b64decode(text,validate=True) | ||
| # Security: Fail closed - reject content > 10KB to prevent memory DoS and PII bypass | ||
| iflen(decoded_bytes)>10_000: | ||
| msg= ( | ||
| f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB." | ||
| ) | ||
| raiseValueError(msg) |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The 10KB size limit for decoded content is enforced by raising a ValueError, but this exception is not caught in the calling code (_build_decoded_text), which could cause the entire PII check to fail rather than gracefully skipping oversized encoded content. Consider catching this ValueError and logging a warning instead, or document that requests with large encoded payloads will be rejected entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. We want to fail if there is a massive encoded file. Checking for encoded content is opt-in so users can not default to not checking encoded content if they wish
src/guardrails/_base_client.py Outdated
| decoded_text=normalized | ||
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This assignment to 'decoded_text' is unnecessary as it isredefined before this value is used.
| decoded_text = normalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
| # Zero-width and invisible Unicode characters that can be used to bypass detection | ||
| _ZERO_WIDTH_CHARS=re.compile( | ||
| r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The comment mentions 'BOM' but the actual character \ufeff is the Zero Width No-Break Space, not the Byte Order Mark. While \ufeff can function as a BOM when at the beginning of a file, in the middle of text it's a zero-width no-break space. The comment should clarify this to avoid confusion: 'Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space'.
| r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM | |
| r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner,zero-width no-break space (\ufeff;BOM only at start of file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
[nit]
| msg= ( | ||
| f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB." | ||
| ) | ||
| raiseValueError(msg) |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The error message leaks the size of decoded content, which could help attackers fingerprint the system or understand size limits for crafted payloads. Consider using a generic error message without the specific size:'Base64 decoded content exceeds maximum allowed size of 10KB.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That is okay. Returning to dev not user
| msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB." | ||
| raiseValueError(msg) |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Similar to the Base64 case, the error message leaks the size of decoded content. Use a generic error message:'Hex decoded content exceeds maximum allowed size of 10KB.'
| msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB." | |
| raiseValueError(msg) | |
| raiseValueError("Hex decoded content exceeds maximum allowed size of 10KB.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That is okay. Returning to dev not user
| len(candidate_lower)>=3 | ||
| andany(# Any 3-char chunk overlaps | ||
| candidate_lower[i :i+3]indetected_lower | ||
| foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The 3-character chunk overlap check at line 707-710 iterates with step=2 for efficiency, but this could miss valid overlaps. For example, ifcandidate_lower='abcdef' anddetected_lower contains 'bcd', the check at i=0 ('abc') and i=2 ('cde') would both miss it. Consider using step=1 or documenting why step=2 is acceptable for this security-critical matching.
| foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency | |
| foriinrange(0,len(candidate_lower)-2)# Step by1 to avoid missing overlaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
WAI. This is fine. This is the third check and stepping by 2 will only miss PII that is exactly 3 characters long and isn't caught by all of the other checks. Very rare and unrealistic case
| asyncdeftest_pii_detects_base64_encoded_email()->None: | ||
| """Base64-encoded email should be detected when flag is enabled.""" | ||
| config=PIIConfig(entities=[PIIEntity.EMAIL_ADDRESS],block=False,detect_encoded_pii=True) | ||
| # am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com |
CopilotAINov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
[nitpick] The comment states the Base64 string decodes to 'john@example.com', but the actual decoded value is 'john@example.com' which is correct. However, for consistency with other test comments that specify the exact encoding (e.g., line 383 shows the URL-encoded version), consider verifying this comment is accurate by decoding the Base64 string.
| # am9obkBleGFtcGxlLmNvbQ==isbase64 for john@example.com | |
| # am9obkBleGFtcGxlLmNvbQ==(base64 for'john@example.com') |
gabor-openai left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM
bf0cb52 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Using Presidio's built in AnonymizerEngine to handle the PII masking
checked_textfield from checks that don't use themMade PII detection more robust against obfuscation attempts
Base64,URL, andHexencoded content for entity detectiondetect_encoded_piithat can be set via the config to turn this onAdded custom identifiers to Presidio
Updated docs and added more tests