Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Don't choke on (legitimately) invalidly encoded Unicode paths#467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Byron merged 1 commit intomasterfromfix-dont-choke-on-invalid-unicode-paths
Jun 14, 2016

Conversation

nvie
Copy link
Contributor

@nvienvie commentedJun 7, 2016
edited
Loading

We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for thea_path andb_path properties, we're not able to decode this string and thus choke when asking for the diff.

This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes\200 (or\x80) by \ufffd (= �).

Follow-up discussion

However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened.

Example:

  • The input path isb'illegal-\x80.txt' (containing illegal byte\x80)
  • When decoded to UTF-8 with characters replaced, we get the unicode stringu'illegal-\ufffd.txt' (= "illegal-�.txt")
  • When encoding that in UTF-8, we findb'illegal-\xef\xbf\xbd.txt'

When we next passillegal-\xef\xbf\xbd.txt to git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposinga_rawpath andb_rawpath, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.

@Byron
Copy link
Member

@nvie Thanks for bringing this up ! I believe it's one of the worst decisions made in GitPython to attempt to decode everything using UTF-8 (or sometimes using whatever encoding Git hints us at). Especially for paths, doing this is plain wrong as the filesystem may use arbitrary encodings. This is the reason Rust usesOsStrings to represent these, as opposed toStrings which need to be valid UTF-8.
Unfortunately, back in the days Rust wasn't on my Radar, and also I didn't know what I know now.

That said, fixing the API in that regard might be a somewhat major undertaking certainly worth making. Without a major version upgrade, the proposed adjustment might be good enough.

The question is, what you as an API user think about these options - I'd be happy to hear your thoughts.

@ByronByron merged commitd22c40b intomasterJun 14, 2016
@ByronByron deleted the fix-dont-choke-on-invalid-unicode-paths branchJune 14, 2016 05:38
@nvie
Copy link
ContributorAuthor

nvie commentedJun 14, 2016
edited
Loading

I don't think it's too bad in practice. In most parts where paths hit the UI, you'd want to use adecode(errors='replace')'ed version of the raw bytes anyway. If there are invalidly encoded bytes, it's really just fine to see the � chars in there. So I don't think it's too bad that the result of a Diff returns unicode strings for the paths. In practice, they work in >99.9% of all cases.

However, sometimes there just is a need to use the raw bytes, especially when the result of command A (say, a diff), defines the input of command B (say, a blame). Only in these cases, you need to have the exact bytes. I would still show thedecode(errors='replace')'ed version of the path in the UI, but in order to ask Git for a blame on that file, I need access to the raw version.

I'll have a stab at adding those extra raw paths as properties.

nvie added a commit that referenced this pull requestJun 14, 2016
Previously, the following fields on Diff instances were assumed to bepassed in as unicode strings:  - `a_path`  - `b_path`  - `rename_from`  - `rename_to`However, since Git natively records paths as bytes, these maypotentially not have a valid unicode representation.This patch changes the Diff instance to instead take the followingequivalent fields that should be raw bytes instead:  - `a_rawpath`  - `b_rawpath`  - `raw_rename_from`  - `raw_rename_to`NOTE ON BACKWARD COMPATIBILITY:The original `a_path`, `b_path`, etc. fields are still available asproperties (rather than slots).  These properties now dynamically decodethe raw bytes into a unicode string (performing the potentiallydestructive operation of replacing invalid unicode chars by "�"'s).This means that all code using Diffs should remain backward compatible.The only exception is when people would manually construct Diffinstances by calling the constructor directly, in which case they shouldnow pass in bytes rather than unicode strings.See also the discussion on#467
@yarikoptic
Copy link
Contributor

yarikoptic commentedSep 1, 2016
edited
Loading

I guess related:#505 -- I have tried to run tests using TMPDIR with unicode in it... actually may be it is only marginally related since paths in that case are unicode outside of the repository, not within (as it is discussed here), correct?

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees

@ByronByron

Development

Successfully merging this pull request may close these issues.

3 participants
@nvie@Byron@yarikoptic

[8]ページ先頭

©2009-2025 Movatter.jp