Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork938
Don't choke on (legitimately) invalidly encoded Unicode paths#467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
@nvie Thanks for bringing this up ! I believe it's one of the worst decisions made in GitPython to attempt to decode everything using UTF-8 (or sometimes using whatever encoding Git hints us at). Especially for paths, doing this is plain wrong as the filesystem may use arbitrary encodings. This is the reason Rust usesOsStrings to represent these, as opposed toStrings which need to be valid UTF-8. That said, fixing the API in that regard might be a somewhat major undertaking certainly worth making. Without a major version upgrade, the proposed adjustment might be good enough. The question is, what you as an API user think about these options - I'd be happy to hear your thoughts. |
nvie commentedJun 14, 2016 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I don't think it's too bad in practice. In most parts where paths hit the UI, you'd want to use a However, sometimes there just is a need to use the raw bytes, especially when the result of command A (say, a diff), defines the input of command B (say, a blame). Only in these cases, you need to have the exact bytes. I would still show the I'll have a stab at adding those extra raw paths as properties. |
Previously, the following fields on Diff instances were assumed to bepassed in as unicode strings: - `a_path` - `b_path` - `rename_from` - `rename_to`However, since Git natively records paths as bytes, these maypotentially not have a valid unicode representation.This patch changes the Diff instance to instead take the followingequivalent fields that should be raw bytes instead: - `a_rawpath` - `b_rawpath` - `raw_rename_from` - `raw_rename_to`NOTE ON BACKWARD COMPATIBILITY:The original `a_path`, `b_path`, etc. fields are still available asproperties (rather than slots). These properties now dynamically decodethe raw bytes into a unicode string (performing the potentiallydestructive operation of replacing invalid unicode chars by "�"'s).This means that all code using Diffs should remain backward compatible.The only exception is when people would manually construct Diffinstances by calling the constructor directly, in which case they shouldnow pass in bytes rather than unicode strings.See also the discussion on#467
yarikoptic commentedSep 1, 2016 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I guess related:#505 -- I have tried to run tests using TMPDIR with unicode in it... actually may be it is only marginally related since paths in that case are unicode outside of the repository, not within (as it is discussed here), correct? |
Uh oh!
There was an error while loading.Please reload this page.
We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for the
a_path
andb_path
properties, we're not able to decode this string and thus choke when asking for the diff.This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes
\200
(or\x80
) by \ufffd (= �).Follow-up discussion
However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened.
Example:
b'illegal-\x80.txt'
(containing illegal byte\x80
)u'illegal-\ufffd.txt'
(= "illegal-�.txt")b'illegal-\xef\xbf\xbd.txt'
When we next pass
illegal-\xef\xbf\xbd.txt
to git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposinga_rawpath
andb_rawpath
, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.