Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork964
Fix UnicodeDecodeError when reading packed-refs with non-UTF8 characters#2091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Fixesgitpython-developers#2064The packed-refs file can contain ref names that are not valid UTF-8(e.g., Latin-1 encoded tag names created by older Git versions ornon-UTF8 systems). Previously, opening the file with encoding='UTF-8'would raise UnicodeDecodeError.Changes:- Add errors='surrogateescape' to the open() call in _iter_packed_refs()- This allows reading files with arbitrary byte sequences while still treating valid UTF-8 as text- Add test that verifies non-UTF8 packed-refs can be read successfullyThe 'surrogateescape' error handler is the standard Python approach forhandling potentially non-UTF8 data in filesystem operations, as itpreserves the original bytes in a reversible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull request overview
This PR fixes aUnicodeDecodeError that occurred when GitPython attempted to readpacked-refs files containing ref names encoded with non-UTF-8 character encodings (e.g., Latin-1 encoded tag names from older Git versions). The fix uses Python'ssurrogateescape error handler, which is the standard approach for handling filesystem operations with potentially mixed or unknown encodings.
Key changes:
- Adds
errors='surrogateescape'parameter to file reading in_iter_packed_refs()method - Adds comprehensive test that reproduces and verifies the fix for the Unicode decoding issue
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| git/refs/symbolic.py | Addserrors='surrogateescape' to the packed-refs file reader to handle non-UTF8 encoded ref names gracefully |
| test/test_refs.py | Adds test case that creates a packed-refs file with Latin-1 encoded ref name and verifies it can be read without errors |
💡Add Copilot custom instructions for smarter, more guided reviews.Learn how to get started.
Summary
Fixes#2064
The
packed-refsfile can contain ref names that are not valid UTF-8 (e.g., Latin-1 encoded tag names created by older Git versions or systems with different locale settings). Previously, GitPython would fail withUnicodeDecodeErrorwhen reading such files.Reproduction
As described in#2064:
Before fix:
After fix: Successfully reads all 101 tags.
Changes
errors='surrogateescape'to theopen()call in_iter_packed_refs()Technical Details
The
surrogateescapeerror handler is Python's standard approach for handling potentially non-UTF8 data in filesystem operations. It:\uDC80-\uDCFF)This is the same approach used by Python's
os.fsdecode()and is recommended for filesystem operations where encoding may be unknown or mixed.