Movatterモバイル変換

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR refactors thechecked_text field handling across guardrails and significantly enhances the PII detection guardrail with advanced security features.

Key changes:

Removeschecked_text from guardrails that don't modify text (moderation, keywords, URLs, secret keys, etc.)
Keepschecked_text only in PII guardrail where text is actively masked
Adds advanced PII detection: Unicode normalization, encoded PII detection (Base64/URL/hex), custom CVV/BIC_SWIFT recognizers
Refactors_apply_preflight_modifications to use Presidio's anonymizer engine directly for better performance and correctness

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/guardrails/checks/text/pii.py	Major enhancement with Unicode normalization, encoded PII detection, custom recognizers (CVV, BIC_SWIFT), and Presidio anonymizer integration
src/guardrails/_base_client.py	Refactored PII masking to use guardrail's`checked_text` directly and Presidio for structured content; removed`summary_text` type
src/guardrails/checks/text/moderation.py	Removed`checked_text` field (no text modification)
src/guardrails/checks/text/keywords.py	Removed`checked_text` field (no text modification)
src/guardrails/checks/text/urls.py	Removed`checked_text` field (no text modification)
src/guardrails/checks/text/secret_keys.py	Removed`checked_text` field (no text modification)
src/guardrails/checks/text/prompt_injection_detection.py	Removed`checked_text` field (no text modification)
src/guardrails/checks/text/hallucination_detection.py	Removed`checked_text` parameter from error result creation
src/guardrails/checks/text/llm_base.py	Removed`checked_text` parameter from`create_error_result` function
tests/unit/test_base_client.py	Updated tests to reflect new PII guardrail result structure and removed`summary_text` type
tests/unit/checks/test_pii.py	Added comprehensive tests for Unicode normalization and encoded PII detection
tests/unit/checks/test_keywords.py	Removed assertion on`checked_text` field
tests/unit/test_safety_identifier.py	Removed trailing blank line
src/guardrails/utils/safety_identifier.py	Removed trailing blank line
pyproject.toml	Added`presidio-anonymizer>=2.2.360` dependency
examples/basic/pii_mask_example.py	Added CVV, BIC_SWIFT entities and enabled encoded PII detection
docs/ref/checks/*.md	Updated documentation to remove`checked_text` field references where appropriate

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

src/guardrails/_base_client.py OutdatedShow resolvedHide resolved

		len(candidate_lower)>=3
		andany(# Any 3-char chunk overlaps
		candidate_lower[i :i+3]indetected_lower
		foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency

Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The overlap detection logic performs substring matching in a loop that could be expensive for large texts. For a candidate of length N, this generates N/2 substring checks. Consider using a more efficient algorithm like rolling hash or KMP for substring matching, or at least increase the step size from 2 to a larger value like 4 or 5 to reduce iterations while still catching meaningful overlaps.

Suggested change

	foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency
	foriinrange(0,len(candidate_lower)-2,4)# Step by4 for better efficiency

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. We are only looking at detected entities for overlap (emails, ssn, location, etc...) so the strings will be relatively small and this is not an issue


		# Patterns for detecting encoded content
		# Note: Hex must be checked BEFORE Base64 since hex strings can match Base64 pattern
		_HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min)

Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The hex pattern with a minimum of 24 characters will match many non-PII hex strings (UUIDs without dashes are 32 chars, but legitimate hex data like color codes or short hashes could be 24+ chars). This could lead to false positives where legitimate hex data is incorrectly flagged as encoded PII. Consider adding context-aware matching (e.g., preceded by 'hex:', 'encoded:', etc.) or increasing the minimum length to reduce false positives.

Suggested change

	_HEX_PATTERN=re.compile(r"\b[0-9a-fA-F]{24,}\b")# Reduced from 32 to 24 (12 bytes min)
	_HEX_PATTERN=re.compile(
	r"(?:\b(?:hex:\|encoded:\|0x)[ ]*)[0-9a-fA-F]{24,}\b"
	)# Only match hex strings with context prefix to reduce false positives

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This feature is opt-in for heightened security so higher false positives is acceptable. It's purpose is to detect intentional obfuscation, so checking context wouldn't work.

Comment on lines +537 to +559

		# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
		# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
		url_decoded=urllib.parse.unquote(decoded_text)

		# If URL decoding changed the text, track encoded spans for masking
		ifurl_decoded!=decoded_text:
		# Find URL-encoded spans in the ORIGINAL text for masking purposes
		formatchin_URL_ENCODED_PATTERN.finditer(text):
		ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
		continue

		decoded_chunk=urllib.parse.unquote(match.group())
		ifdecoded_chunk!=match.group():
		candidates.append(
		EncodedCandidate(
		encoded_text=match.group(),
		decoded_text=decoded_chunk,
		encoding_type="url",
		start=match.start(),
		end=match.end(),
		)
		)
		decoded_text=url_decoded

Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function callsurllib.parse.unquote on the entiredecoded_text after Base64/hex replacements have been made. This means if decoded_text contains a legitimate%XX sequence that was part of a Base64 decode (not URL encoding), it will be incorrectly decoded again. For example, if a Base64 string decodes to 'test%20value', the %20 will be converted to a space, which may not be the intended behavior. Consider tracking which parts are URL-encoded separately before applying unquote.

Suggested change

	# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
	# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
	url_decoded=urllib.parse.unquote(decoded_text)

	# If URL decoding changed the text, track encoded spans for masking
	ifurl_decoded!=decoded_text:
	# Find URL-encoded spans in the ORIGINAL text for masking purposes
	formatchin_URL_ENCODED_PATTERN.finditer(text):
	ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
	continue

	decoded_chunk=urllib.parse.unquote(match.group())
	ifdecoded_chunk!=match.group():
	candidates.append(
	EncodedCandidate(
	encoded_text=match.group(),
	decoded_text=decoded_chunk,
	encoding_type="url",
	start=match.start(),
	end=match.end(),
	)
	)
	decoded_text=url_decoded
	# URL decode only actual URL-encoded spans (handles partial encodings like %6a%61%6e%65%40securemail.net)
	# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
	url_candidates= []
	# Find URL-encoded spans in the ORIGINAL text for masking purposes
	formatchin_URL_ENCODED_PATTERN.finditer(text):
	ifany(start<=match.start()<endorstart<match.end()<=endforstart,endinused_spans):
	continue

	decoded_chunk=urllib.parse.unquote(match.group())
	ifdecoded_chunk!=match.group():
	url_candidates.append(
	EncodedCandidate(
	encoded_text=match.group(),
	decoded_text=decoded_chunk,
	encoding_type="url",
	start=match.start(),
	end=match.end(),
	)
	)
	# Replace URL-encoded spans in decoded_text (from last to first to avoid offset issues)
	url_candidates.sort(key=lambdac:c.start,reverse=True)
	forcandidateinurl_candidates:
	decoded_text=decoded_text[:candidate.start]+candidate.decoded_text+decoded_text[candidate.end :]
	candidates.append(candidate)

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is acceptable


		iffound_entities:
		# Mask the encoded version in original text
		entity_marker=f"<{next(iter(found_entities))}_ENCODED>"

Copy link

CopilotAINov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When multiple entity types overlap in the same encoded chunk, only the first entity type fromfound_entities is used for the marker. This could be misleading if an encoded string contains both an email and a phone number - only one will be indicated in the mask. Consider using all found entity types (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation.

Suggested change

	entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
	entity_marker=f"<{'\|'.join(sorted(found_entities))}_ENCODED>"

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. The entire encoded span is still being masked before being sent to the LLM. All of the detected entities are still being listed in the response object if a developer wants to see them.

src/guardrails/checks/text/pii.pyShow resolvedHide resolved

Reject large encoded content as DOS

40aa80f

steven10a requested a review fromCopilot

November 5, 2025 15:38

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

src/guardrails/checks/text/pii.py OutdatedShow resolvedHide resolved

src/guardrails/checks/text/pii.pyShow resolvedHide resolved

Handle hex errors

62ed5e0

steven10a requested a review fromCopilot

November 5, 2025 15:49

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

steven10a requested a review fromCopilot

November 5, 2025 16:13

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

steven10a requested a review fromCopilot

November 5, 2025 21:20

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

src/guardrails/_base_client.py OutdatedShow resolvedHide resolved

fix structured output masking path

a901d25

steven10a requested a review fromCopilot

November 5, 2025 21:48

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.

Comment on lines +702 to +712

		has_overlap= (
		candidate_lowerindetected_lower# Candidate is substring of detection
		ordetected_lowerincandidate_lower# Detection is substring of candidate
		or (
		len(candidate_lower)>=3
		andany(# Any 3-char chunk overlaps
		candidate_lower[i :i+3]indetected_lower
		foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency
		)
		)
		)

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The overlap detection logic in_mask_encoded_pii uses a nested loop with substring checks that could be O(n*m) in the worst case. For each candidate, it checks against all analyzer results, and within that, performs chunked substring matching. Consider using more efficient string matching algorithms (e.g., KMP or Boyer-Moore) for the chunked overlap detection, or pre-computing a set of 3-character chunks from detections for O(1) lookup.

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. Already addressed this comment. Checking small entities so this won't be an issue

src/guardrails/_base_client.py


		iffound_entities:
		# Mask the encoded version in original text
		entity_marker=f"<{next(iter(found_entities))}_ENCODED>"

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When multiple entity types are found in a single encoded candidate (e.g., an encoded string containing both EMAIL_ADDRESS and PHONE_NUMBER), only the first entity type from the set is used for the marker. This could lead to misleading masking labels. Consider either using all entity types in the marker (e.g.,<EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation that only one entity type is shown when multiple are detected.

Suggested change

	entity_marker=f"<{next(iter(found_entities))}_ENCODED>"
	entity_types_str="\|".join(sorted(found_entities))
	entity_marker=f"<{entity_types_str}_ENCODED>"

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. Already addressed this comment

Comment on lines +322 to +326

		ifdetected_valueincandidate.decoded_text:
		# Mask the encoded version
		entity_marker=f"<{entity_type}_ENCODED>"
		masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
		break

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The conditionif detected_value in candidate.decoded_text: may fail to find the correct candidate when multiple candidates exist or when the detected value spans multiple candidates. This could result in encoded PII not being masked. Consider checking overlap ranges more precisely using start/end positions or ensuring that the candidate list is properly sorted and all overlaps are checked.

Suggested change

	ifdetected_valueincandidate.decoded_text:
	# Mask the encoded version
	entity_marker=f"<{entity_type}_ENCODED>"
	masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
	break
	# Check if detection overlaps candidate's decoded text range
	candidate_decoded_start=getattr(candidate,"decoded_start",None)
	candidate_decoded_end=getattr(candidate,"decoded_end",None)
	ifcandidate_decoded_startisnotNoneandcandidate_decoded_endisnotNone:
	if (
	result.start<candidate_decoded_end
	andresult.end>candidate_decoded_start
	):
	# Mask the encoded version
	entity_marker=f"<{entity_type}_ENCODED>"
	masked=masked[:candidate.start]+entity_marker+masked[candidate.end :]
	break

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is only with structured content which is checking each part individually. For unstructured content we have more sophisticated overlap checking

src/guardrails/_base_client.py

		# CVV/CVC recognizer (3-4 digits, often near credit card context)
		cvv_pattern=Pattern(
		name="cvv_pattern",
		regex=r"\b(?:cvv\|cvc\|security\scode\|card\scode)[:\s=]*(\d{3,4})\b",

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The CVV pattern requires a label prefix ('cvv', 'cvc', 'security code', 'card code') before the digits. This means standalone 3-4 digit numbers won't be detected as CVV codes, which could miss obfuscated CVV values. While this reduces false positives, it creates a security gap. Consider documenting this limitation or adding a configurable option for stricter CVV detection that catches standalone 3-4 digit sequences in financial contexts.

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. I am okay with this simple approach and we don't want to over flag 3 digit numbers

		OpenAIResponseType=Union[Completion,ChatCompletion,ChatCompletionChunk,Response]# noqa: UP007

		# Text content types recognized in message content parts
		_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"}

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The removal of 'summary_text' from_TEXT_CONTENT_TYPES means that content parts with type 'summary_text' will no longer have PII masking applied or text extracted. If 'summary_text' is still a valid content type in the system, this change could lead to PII leaks. Ensure this removal is intentional and that 'summary_text' is truly deprecated, or document why it should not be processed.

Suggested change

	_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text"}
	_TEXT_CONTENT_TYPES:Final[set[str]]= {"text","input_text","output_text","summary_text"}

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI.summary_text is not a valid input

tests/unit/test_base_client.py Outdated


		# Should detect and mask the obfuscated email
		assert"<EMAIL_ADDRESS>"inmodified[0]["content"][0]["text"]# noqa: S101
		assert"@"notinmodified[0]["content"][0]["text"]or"＠"notinmodified[0]["content"][0]["text"]# noqa: S101

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The assertionassert '@' not in modified[0]['content'][0]['text'] or '＠' not in modified[0]['content'][0]['text'] will pass if EITHER character is absent, which means the test passes even if the fullwidth '@' (＠) is still present. This should useand instead ofor to ensure both the ASCII and fullwidth '@' characters are properly masked.

Suggested change

	assert"@"notinmodified[0]["content"][0]["text"]or"＠"notinmodified[0]["content"][0]["text"]# noqa: S101
	assert"@"notinmodified[0]["content"][0]["text"]and"＠"notinmodified[0]["content"][0]["text"]# noqa: S101

Copilot uses AI. Check for mistakes.

Comment on lines +459 to +466

		try:
		decoded_bytes=base64.b64decode(text,validate=True)
		# Security: Fail closed - reject content > 10KB to prevent memory DoS and PII bypass
		iflen(decoded_bytes)>10_000:
		msg= (
		f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
		)
		raiseValueError(msg)

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The 10KB size limit for decoded content is enforced by raising a ValueError, but this exception is not caught in the calling code (_build_decoded_text), which could cause the entire PII check to fail rather than gracefully skipping oversized encoded content. Consider catching this ValueError and logging a warning instead, or document that requests with large encoded payloads will be rejected entirely.

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. We want to fail if there is a massive encoded file. Checking for encoded content is opt-in so users can not default to not checking encoded content if they wish

src/guardrails/_base_client.py Outdated

Comment on lines 290 to 291

		decoded_text=normalized

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This assignment to 'decoded_text' is unnecessary as it isredefined before this value is used.

Suggested change

decoded_text = normalized

Copilot uses AI. Check for mistakes.

address copilot nits

7be1b04

steven10a requested review fromCopilot andgabor-openai

November 5, 2025 22:53

CopilotAI reviewed

Copy link

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.


		# Zero-width and invisible Unicode characters that can be used to bypass detection
		_ZERO_WIDTH_CHARS=re.compile(
		r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The comment mentions 'BOM' but the actual character \ufeff is the Zero Width No-Break Space, not the Byte Order Mark. While \ufeff can function as a BOM when at the beginning of a file, in the middle of text it's a zero-width no-break space. The comment should clarify this to avoid confusion: 'Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space'.

Suggested change

	r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner, BOM
	r"[\u200b\u200c\u200d\u2060\ufeff]"# Zero-width space, ZWNJ, ZWJ, word joiner,zero-width no-break space (\ufeff;BOM only at start of file)

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nit]

Comment on lines +463 to +466

		msg= (
		f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
		)
		raiseValueError(msg)

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The error message leaks the size of decoded content, which could help attackers fingerprint the system or understand size limits for crafted payloads. Consider using a generic error message without the specific size:'Base64 decoded content exceeds maximum allowed size of 10KB.'

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That is okay. Returning to dev not user

Comment on lines +496 to +497

		msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
		raiseValueError(msg)

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similar to the Base64 case, the error message leaks the size of decoded content. Use a generic error message:'Hex decoded content exceeds maximum allowed size of 10KB.'

Suggested change

	msg=f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
	raiseValueError(msg)
	raiseValueError("Hex decoded content exceeds maximum allowed size of 10KB.")

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That is okay. Returning to dev not user

tests/unit/checks/test_pii.py

		len(candidate_lower)>=3
		andany(# Any 3-char chunk overlaps
		candidate_lower[i :i+3]indetected_lower
		foriinrange(0,len(candidate_lower)-2,2)# Step by 2 for efficiency

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The 3-character chunk overlap check at line 707-710 iterates with step=2 for efficiency, but this could miss valid overlaps. For example, ifcandidate_lower='abcdef' anddetected_lower contains 'bcd', the check at i=0 ('abc') and i=2 ('cde') would both miss it. Consider using step=1 or documenting why step=2 is acceptable for this security-critical matching.

Suggested change

	foriinrange(0,len(candidate_lower)-2,2)# Step by2 for efficiency
	foriinrange(0,len(candidate_lower)-2)# Step by1 to avoid missing overlaps

Copilot uses AI. Check for mistakes.

Copy link

CollaboratorAuthor

steven10aNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

WAI. This is fine. This is the third check and stepping by 2 will only miss PII that is exactly 3 characters long and isn't caught by all of the other checks. Very rare and unrealistic case

		asyncdeftest_pii_detects_base64_encoded_email()->None:
		"""Base64-encoded email should be detected when flag is enabled."""
		config=PIIConfig(entities=[PIIEntity.EMAIL_ADDRESS],block=False,detect_encoded_pii=True)
		# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com

Copy link

CopilotAINov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

[nitpick] The comment states the Base64 string decodes to 'john@example.com', but the actual decoded value is 'john@example.com' which is correct. However, for consistency with other test comments that specify the exact encoding (e.g., line 383 shows the URL-encoded version), consider verifying this comment is accurate by decoding the Base64 string.

Suggested change

	# am9obkBleGFtcGxlLmNvbQ==isbase64 for john@example.com
	# am9obkBleGFtcGxlLmNvbQ==(base64 for'john@example.com')

Copilot uses AI. Check for mistakes.

gabor-openai approved these changes

Nov 6, 2025