Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-84013: normalize directory contents during import#113447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ronaldoussoren wants to merge11 commits intopython:main
base:main
Choose a base branch
Loading
fromronaldoussoren:gh-84013

Conversation

ronaldoussoren
Copy link
Contributor

@ronaldoussorenronaldoussoren commentedDec 24, 2023
edited by bedevere-appbot
Loading

Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized.

Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.

Identifiers in Python code are normalized to NFKC andthere is no guarantee that names in the file system arenormalized.Teach the FileFinder class about normalization, but onlydo this when the to-be-imported name contains non-ASCIIcharacters.
@ronaldoussoren
Copy link
ContributorAuthor

The PR is a draft primarily because the tests are incomplete:

  1. No tests using PYTHONCASEOK
  2. No new unit tests for the FileFinder class

@ronaldoussoren
Copy link
ContributorAuthor

ronaldoussoren commentedDec 27, 2023
edited
Loading

For some reason this PR doesn't work on any system but macOS, at least as far as the tests are concerned. I'll see if I can find the problem for Linux, I don't have a system where I can develop on Windows.

This has been resolved by keeping the unnormalised name around because this is needed to match filesystem behaviour. Worked on macOS because the kernel will normalise names there.

Copy link
Member

@ericsnowcurrentlyericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Comment on lines +1622 to +1623
if not tail_module.isascii() and not self._cache_is_normalized:
self._normalize_cache()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For the sake of__import__() andimportlib.import_module() we will probably need to either normalizecache_module or preserve the original name in the cache. The former sounds expensive so the latter might be the way to go.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Regardless, we should probably have some test cases that use one of those functions rather than syntax.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I changed the code to keep the unnormalised name around, we need the unnormalised name to access the filesystem and need the normalised name to compare totail_module (which will generally be normalised by the compiler).

I agree that there should be more test cases, I started out with just the minimal test from the issue to see if I can get this to work without bootstrapping issues and get some feedback. I'll definitely add more tests when this approach is seen as acceptable by@brettcannon (as the importlib expert).

ericsnowcurrently reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the cache always keeps the raw name then can we drop_cache_is_normalized?

@ericsnowcurrently
Copy link
Member

Also, I meant to say that I think you're on the right track.

ronaldoussoren reacted with thumbs up emoji

@ronaldoussoren
Copy link
ContributorAuthor

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Good question, whatever we choose should be documented.

The names passed to__import__ may not be normalised if called by a user, so either the implementation of that function needs to normalise its arguments, or finders would have to do this. From a usability perspective it's probably better to normalise names__import__ and document this in the importlib documentation for finders.

@ericsnowcurrently
Copy link
Member

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC@brettcannon@ncoghlan@warsaw@methane@vstinner

From a usability perspective it's probably better to normalise names__import__ and document this in the importlib documentation for finders.

Note that the bytecode for an import statement calls__import__() only if it is user-defined. Otherwise, when__import__() is the default, the bytecode callsPyImport_ImportModuleLevelObject() directly.1 This does present an opportunity to normalize in__import__() without double-normalizing for import statements. Either way, though, whatever we document should clearly indicate which builtin/stdlib functions normalize and which don't.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

Footnotes

  1. This is an optimization. IIRC, there was a period of time where we always just called__import__(). Also, for a little while the default implementation of__import__() wasimportlib.__import__(). Nowimportlib.__import__() is never involved (in CPython at least) unless called directly by a user.

@ericsnowcurrently
Copy link
Member

FWIW, one solution that crossed my mind is to modify the parser, since that is where the PEP 3131 normalization takes place, AKA the root of the problem. The parser could preserve the original unnormalized identifier and we'd use it where appropriate. For instance, the import statement bytecode could use the original name, while the normalized identifier would be used as the bound name. The discrepancy between the bound name and thesys.modules key (and module__name__) might be a problem though.

Relatedly, it may be worth adding an optional arg to__import__() for the unnormalized module name. If not provided, it would default to callingunicodedata.normalize('NFKC, name). If the parser preserved the original name then that could be passed as the extra arg. Hopefully we don't need to add a new arg toimport()` though.

@ronaldoussoren
Copy link
ContributorAuthor

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

The language reference already says all identifiers are normalised to NFKC, seehttps://docs.python.org/3/reference/lexical_analysis.html#identifiers. The PEP, and more importantly, the import system specification (PEP 451 and its predecessors) failed to fully think through the consequences of this when loading modules from the filesystem. That's not really anyones fault, Unicode and in particular Unicode normalisation are hard.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC@brettcannon@ncoghlan@warsaw@methane@vstinner

Yes please ;-). I intend to work on fixes the issues with this PR on Linux (and that will likely fix the issues on Windows as well), but beyond that I'd like some advise on the likelihood that the change will be accepted.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

I guess that not many people write modules with non-ascii names, and those that do quickly learn that this doesn't always work. The forks that want to use non-ascii names are also more likely to be less familiar with English.

P.S. I do have to warn that I'll likely be less responsive after new year. I had a lot time of from work in December, but expect that I'll be distracted by $WORK in January. I'll continue working on this, but slower.

ericsnowcurrently reacted with thumbs up emoji

The finder cache is now a mapping from (normalized) namesto actual filesystem names. With some luck this fixes testfailures on Linux and Windows.
@ronaldoussoren
Copy link
ContributorAuthor

ronaldoussoren commentedDec 29, 2023
edited
Loading

There are four ways that names get into import lib:

  1. import statements without overriding__import__: The parser will normalise identifiers, the finder will see a normalised name
  2. import statement with overriding__import__: same as 1 or 3 depending on whether or note__import__ changes the name to be imported
  3. Direct calling of__import__ orimport lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.
  4. Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

It's probably best to change the implementation of__import__ andimport_module to normalise the name and document that the name passed to finders must be normalised.

The impact of the former should be limited because as@ericsnowcurrently notes these functions aren't used in normal import code paths.

Requiring that names passed to finders must be normalized is technically a backward incompatible change, although in practice finders already mostly see normalised names. Normalising names in finders can have a performance impact due to unnecessary normalising of names, although it is unclear to me how significant that impact would be given that the number of non-ASCII module names is likely very low.

UPDATE: In the current version of the PR__import__ andimport_module normalise and the documentation for loaders was updated to mention that the names must be normalized by the caller.

ronaldoussorenand others added4 commitsDecember 29, 2023 12:57
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
1. Add a NEWS entry2. __import__ and find_module will now normalize their   arguments as needed3. Update the importlib documentation to note that the   name passed to finders must be normalized.
@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This change worries me a little because I don't understand why the original code catchesValueError and then bails out.

Removing this like I did here does not result in test failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifgit blame isn't helpful then it can probably be removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

Comment on lines +1 to +3
It is now possible to import modules from the filesystem regardless of how
the name is normalized in the filesystem. This in particular affects
importing modules with a name that contains accented latin characters.
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This probably also needs a What's New entry, lets worry about that when the PR is closer to acceptance to avoid creating merge conflicts.

@ronaldoussoren
Copy link
ContributorAuthor

ronaldoussoren commentedDec 29, 2023
edited
Loading

Did I mention Unicode is hard?

In NFKC normalization there are some non-ASCII code points that normalise to code points that are ASCII, such as "ff" (the ligature for "ff"). This leads to some "interesting" behaviour with my PR.

Given a file named "fluffy.py" (with ligature) you currently get anImportError when trying to import "fluffy". With my patch the import will not raise anImportError when there has first been an import for some non-ASCII name that was presented to the finder for the directory containing "fluffy.py".

importrandomifrandom.choice([True,False]):try:importcaféexceptImportError:passimportfluffy

This will work 50% of the time and raiseImportError the other 50%. This will AFAIK only affect names in latin script, other languages won't normalise to ASCII strings from my limited understanding of normalization rules.

Another way this PR can lead to unexpected behaviour changes is to have two source files with different normalisations, one that happens to be NFKC normalized and one not (for example the "Mädchen.py" files I used intest_import.py). Currently the NFKC normalized file is picked up, with this PR the other might get picked up depending on the order of in whichlistdir iterates over the directory.

In practice I don't expect these to be a problem:

  • The easiest way to enter accented characters in NFKC normalized values (that islen("ü") == 1) at least on macOS, which means you'll have to go the extra mile to get a file name that isn't normalized NFKC on Linux
  • MacOS normalises names to NFD in the filesystem, and won't allow creating two files whose names only differs by their normalization (tested on an APFS filesystem, the older HFS+ should be the same).

What I don't know if there are realistic failure modes for the second scenario with non-latin scripts.

@ericsnowcurrently
Copy link
Member

There are four ways that names get into import lib:

Yeah, I think you've captured that well.

Ideally, the mechanism for auto-normalization would baked in to the core import machinery itself, such that we didn't have to modify specific user-facing functions or require any extra attention from authors/maintainers of import hooks. However, I'm not sure that is something we could do and without impacting the performance of import statements. (I could be wrong though.)

  1. import statements without overriding__import__: The parser will normalise identifiers, the finder will see a normalised name
  2. import statement with overriding__import__: same as 1 or 3 depending on whether or note__import__ changes the name to be imported

The function is still called with the parser-normalized name though, right? So in this case the non-default__import__() still wouldn't need to NKFC-normalize the name. However, itwould need to accommodate any adjustments we made if the original, unnormalized name were preserved by the parser and passed through. Then again, a non-default__import__() would still need to normalize the name if called directly (which is definitely messy).

  1. Direct calling of__import__ orimport lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.

FWIW,importlib.__import__() should probably be lumped in withimportlib.import_module(). Almost any change to one should match the other. The only caveat is thatimportlib.__import__() may be set asbuiltins.__import__(). It might even be the default one for some Python implementations. That was certainly the design goal (and was what we did in CPython for a release or two, IIRC).

importlib.util.find_spec() should be treated the same asimportlib.import_module() as well, but without the caveat.

  1. Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

This applies to user-defined finders, both metapath finders (sys.meta_path) and path finders (viasys.path_hooks &sys.path_importer_cache()).

This also brings up perhaps a fifth group: user-defined finders. Or is that what you meant with (4)?

It's probably best to change the implementation of__import__ andimport_module to normalise the name and document that the name passed to finders must be normalised.

That seems reasonable. I don't expect that they are used so widely or intensely that it would have a performance impact.

When modifying these functions, we do have to be careful about how/where we do it. We don't want to impactPyImport_ImportModuleLevelObject(), which is what the import statement uses directly by default.PyImport_ImportModuleLevelObject(), in turn, callsimportlib._bootstrap._find_and_load(), so we need to be careful with that.

For__import__() we'd want to make any changes strictly inbuiltin___import___impl() (Python/bltinmodule.c). Forimportlib.import_module() (and thusimportlib.util.find_spec() andmaybeimportlib.__import__()) we could modify each function directly (or perhaps inimportlib._bootstrap._gcd_import() but probably not). Again, we just have to remember to avoidimportlib._bootstrap._find_and_load(). (Forimportlib.util.find_spec() we should also avoid touchingimportlib._bootstrap._find_spec().)

@ericsnowcurrently
Copy link
Member

Let's step back for a moment to make sure we're thinking about the right things. IIUC, at the end of the day, the key objective here boils down toFileFinder correctly handling the following situations:

  • module name is non-ascii, NFKC-normal (via PEP 3131)
    • filesystem NFKC-normalizes filenames (i.e. the_path_cache situation you've addressed in this PR)
    • filesystem does not NFKC-normalize filenames and the corresponding filename isnot normalized
  • module name is non-ascii but not NFKC-normal (viaimportlib.import_module(), etc.)
    • filesystem NFKC-normalizes filenames
    • filesystem does not NFKC-normalize filenames but the corresponding filenameis normalized

Is that right?

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled byFileFinder.

(It's also notable that we accommodate the matter of case-sensitivity inFileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely onFileFinder.)

""" Normalize 'name' in NKFC form """
global _unicodedata_normalize
if _unicodedata_normalize is None:
from unicodedata import _normalize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
fromunicodedataimport_normalize
fromunicodedataimport_normalizeas_unicodedata_normalize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I get an import error for_normalize but notnormalize in the REPL. Is that expected?

Comment on lines +1622 to +1623
if not tail_module.isascii() and not self._cache_is_normalized:
self._normalize_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the cache always keeps the raw name then can we drop_cache_is_normalized?

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifgit blame isn't helpful then it can probably be removed.

@brettcannon
Copy link
Member

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled byFileFinder.

(It's also notable that we accommodate the matter of case-sensitivity inFileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely onFileFinder.)

I think I agree w/ Eric that this feels like a problem similar to case-sensitivity. As long as the comparison needs are just at the file system boundary and not e.g.sys.modules then I think we have prior art w/ case-sensitivity.

@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
is_namespace = _path_isdir(base_path)
# Check for a file w/ a proper suffix exists.
for suffix, loader_class in self._loaders:
# XXX: Why is ValueError caught here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@brettcannonbrettcannonbrettcannon left review comments

@ericsnowcurrentlyericsnowcurrentlyericsnowcurrently left review comments

@serhiy-storchakaserhiy-storchakaserhiy-storchaka left review comments

@ncoghlanncoghlanAwaiting requested review from ncoghlanncoghlan is a code owner

@warsawwarsawAwaiting requested review from warsawwarsaw is a code owner

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@ronaldoussoren@ericsnowcurrently@brettcannon@serhiy-storchaka

[8]ページ先頭

©2009-2025 Movatter.jp