NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.8k

gh-84013: normalize directory contents during import#113447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

ronaldoussoren wants to merge11 commits intopython:main

base:main

Choose a base branch

fromronaldoussoren:gh-84013

Open

gh-84013: normalize directory contents during import#113447

ronaldoussoren wants to merge11 commits intopython:mainfromronaldoussoren:gh-84013

Conversation

Copy link

Contributor

ronaldoussoren commentedDec 24, 2023•
edited by bedevere-appbot
Loading

Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized.

Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.

Issue:Modules with decomposable characters in module name not found on macOS #84013

pythongh-84013: normalize directory contents during import

b7ba389

Identifiers in Python code are normalized to NFKC andthere is no guarantee that names in the file system arenormalized.Teach the FileFinder class about normalization, but onlydo this when the to-be-imported name contains non-ASCIIcharacters.

ronaldoussoren added DO-NOT-MERGE topic-importlib labels

Dec 24, 2023

bedevere-appbot mentioned this pull request

Dec 24, 2023

Modules with decomposable characters in module name not found on macOS#84013

Open

Copy link

ContributorAuthor

ronaldoussoren commentedDec 24, 2023

The PR is a draft primarily because the tests are incomplete:

No tests using PYTHONCASEOK
No new unit tests for the FileFinder class

Slightly cleaner code

7f4b01b

ronaldoussoren marked this pull request as ready for review

December 24, 2023 09:43

ronaldoussoren requested review frombrettcannon,ericsnowcurrently,ncoghlan andwarsaw ascode owners

December 24, 2023 09:43

bedevere-appbot added the awaiting core review label

Dec 24, 2023

ronaldoussoren added2 commits

December 25, 2023 10:57

Argh... Previous cleanup was a bit too quick...

6ea9497

Actually fix the "cleaned up" version

c216e08

Copy link

ContributorAuthor

ronaldoussoren commentedDec 27, 2023•
edited
Loading

For some reason this PR doesn't work on any system but macOS, at least as far as the tests are concerned. I'll see if I can find the problem for Linux, I don't have a system where I can develop on Windows.

This has been resolved by keeping the unnormalised name around because this is needed to match filesystem behaviour. Worked on macOS because the kernel will normalise names there.

ericsnowcurrently reviewed

Dec 28, 2023

View reviewed changes

Copy link

Member

ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have some concerns.

Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Lib/importlib/_bootstrap_external.py OutdatedShow resolvedHide resolved

Lib/test/test_import/__init__.py OutdatedShow resolvedHide resolved

Lib/importlib/_bootstrap_external.py OutdatedShow resolvedHide resolved

Lib/importlib/_bootstrap_external.py

Comment on lines +1622 to +1623

		if not tail_module.isascii() and not self._cache_is_normalized:
		self._normalize_cache()

Copy link

Member

ericsnowcurrentlyDec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For the sake of__import__() andimportlib.import_module() we will probably need to either normalizecache_module or preserve the original name in the cache. The former sounds expensive so the latter might be the way to go.

Copy link

Member

ericsnowcurrentlyDec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Regardless, we should probably have some test cases that use one of those functions rather than syntax.

Copy link

ContributorAuthor

ronaldoussorenDec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I changed the code to keep the unnormalised name around, we need the unnormalised name to access the filesystem and need the normalised name to compare totail_module (which will generally be normalised by the compiler).

I agree that there should be more test cases, I started out with just the minimal test from the issue to see if I can get this to work without bootstrapping issues and get some feedback. I'll definitely add more tests when this approach is seen as acceptable by@brettcannon (as the importlib expert).

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the cache always keeps the raw name then can we drop_cache_is_normalized?

Lib/importlib/_bootstrap_external.py OutdatedShow resolvedHide resolved

Copy link

Member

ericsnowcurrently commentedDec 28, 2023

Also, I meant to say that I think you're on the right track.

Copy link

ContributorAuthor

ronaldoussoren commentedDec 28, 2023

I have some concerns.
Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?

Good question, whatever we choose should be documented.

The names passed to__import__ may not be normalised if called by a user, so either the implementation of that function needs to normalise its arguments, or finders would have to do this. From a usability perspective it's probably better to normalise names__import__ and document this in the importlib documentation for finders.

Copy link

Member

ericsnowcurrently commentedDec 28, 2023

whatever we choose should be documented.

FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC@brettcannon @ncoghlan @warsaw @methane @vstinner

From a usability perspective it's probably better to normalise names__import__ and document this in the importlib documentation for finders.

Note that the bytecode for an import statement calls__import__() only if it is user-defined. Otherwise, when__import__() is the default, the bytecode callsPyImport_ImportModuleLevelObject() directly.¹ This does present an opportunity to normalize in__import__() without double-normalizing for import statements. Either way, though, whatever we document should clearly indicate which builtin/stdlib functions normalize and which don't.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

Footnotes

This is an optimization. IIRC, there was a period of time where we always just called__import__(). Also, for a little while the default implementation of__import__() wasimportlib.__import__(). Nowimportlib.__import__() is never involved (in CPython at least) unless called directly by a user.↩

Copy link

Member

ericsnowcurrently commentedDec 28, 2023

FWIW, one solution that crossed my mind is to modify the parser, since that is where the PEP 3131 normalization takes place, AKA the root of the problem. The parser could preserve the original unnormalized identifier and we'd use it where appropriate. For instance, the import statement bytecode could use the original name, while the normalized identifier would be used as the bound name. The discrepancy between the bound name and thesys.modules key (and module__name__) might be a problem though.

Relatedly, it may be worth adding an optional arg to__import__() for the unnormalized module name. If not provided, it would default to callingunicodedata.normalize('NFKC, name). If the parser preserved the original name then that could be passed as the extra arg. Hopefully we don't need to add a new arg toimport()` though.

Copy link

ContributorAuthor

ronaldoussoren commentedDec 28, 2023

whatever we choose should be documented.
FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted.

The language reference already says all identifiers are normalised to NFKC, seehttps://docs.python.org/3/reference/lexical_analysis.html#identifiers. The PEP, and more importantly, the import system specification (PEP 451 and its predecessors) failed to fully think through the consequences of this when loading modules from the filesystem. That's not really anyones fault, Unicode and in particular Unicode normalisation are hard.

Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC@brettcannon @ncoghlan @warsaw @methane @vstinner

Yes please ;-). I intend to work on fixes the issues with this PR on Linux (and that will likely fix the issues on Windows as well), but beyond that I'd like some advise on the likelihood that the change will be accepted.

Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common.

I guess that not many people write modules with non-ascii names, and those that do quickly learn that this doesn't always work. The forks that want to use non-ascii names are also more likely to be less familiar with English.

P.S. I do have to warn that I'll likely be less responsive after new year. I had a lot time of from work in December, but expect that I'll be distracted by $WORK in January. I'll continue working on this, but slower.

Change of implementation

3f16ae1

The finder cache is now a mapping from (normalized) namesto actual filesystem names. With some luck this fixes testfailures on Linux and Windows.

Copy link

ContributorAuthor

ronaldoussoren commentedDec 29, 2023•
edited
Loading

There are four ways that names get into import lib:

import statements without overriding__import__: The parser will normalise identifiers, the finder will see a normalised name
import statement with overriding__import__: same as 1 or 3 depending on whether or note__import__ changes the name to be imported
Direct calling of__import__ orimport lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.
Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

It's probably best to change the implementation of__import__ andimport_module to normalise the name and document that the name passed to finders must be normalised.

The impact of the former should be limited because as@ericsnowcurrently notes these functions aren't used in normal import code paths.

Requiring that names passed to finders must be normalized is technically a backward incompatible change, although in practice finders already mostly see normalised names. Normalising names in finders can have a performance impact due to unnecessary normalising of names, although it is unclear to me how significant that impact would be given that the number of non-ASCII module names is likely very low.

UPDATE: In the current version of the PR__import__ andimport_module normalise and the documentation for loaders was updated to mention that the names must be normalized by the caller.

ronaldoussorenand others added4 commits

December 29, 2023 12:57

Revert accidental reordering

80b6eec

Update Lib/test/test_import/__init__.py

b9e8927

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>

Address review comment

7d2d2bf

Some changes

18d99ae

1. Add a NEWS entry2. __import__ and find_module will now normalize their   arguments as needed3. Update the importlib documentation to note that the   name passed to finders must be normalized.

ronaldoussoren commented

Dec 29, 2023

View reviewed changes

Lib/importlib/_bootstrap_external.py

		@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
		is_namespace = _path_isdir(base_path)
		# Check for a file w/ a proper suffix exists.
		for suffix, loader_class in self._loaders:
		# XXX: Why is ValueError caught here?

Copy link

ContributorAuthor

ronaldoussorenDec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This change worries me a little because I don't understand why the original code catchesValueError and then bails out.

Removing this like I did here does not result in test failures.

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifgit blame isn't helpful then it can probably be removed.

Copy link

Member

serhiy-storchakaMay 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

ronaldoussoren commented

Dec 29, 2023

View reviewed changes

Misc/NEWS.d/next/Library/2023-12-29-13-09-44.gh-issue-84013.rUaxkr.rst

Comment on lines +1 to +3

		It is now possible to import modules from the filesystem regardless of how
		the name is normalized in the filesystem. This in particular affects
		importing modules with a name that contains accented latin characters.

Copy link

ContributorAuthor

ronaldoussorenDec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This probably also needs a What's New entry, lets worry about that when the PR is closer to acceptance to avoid creating merge conflicts.

Copy link

ContributorAuthor

ronaldoussoren commentedDec 29, 2023•
edited
Loading

Did I mention Unicode is hard?

In NFKC normalization there are some non-ASCII code points that normalise to code points that are ASCII, such as "ﬀ" (the ligature for "ff"). This leads to some "interesting" behaviour with my PR.

Given a file named "fluﬀy.py" (with ligature) you currently get anImportError when trying to import "fluffy". With my patch the import will not raise anImportError when there has first been an import for some non-ASCII name that was presented to the finder for the directory containing "fluﬀy.py".

importrandomifrandom.choice([True,False]):try:importcaféexceptImportError:passimportfluffy

This will work 50% of the time and raiseImportError the other 50%. This will AFAIK only affect names in latin script, other languages won't normalise to ASCII strings from my limited understanding of normalization rules.

Another way this PR can lead to unexpected behaviour changes is to have two source files with different normalisations, one that happens to be NFKC normalized and one not (for example the "Mädchen.py" files I used intest_import.py). Currently the NFKC normalized file is picked up, with this PR the other might get picked up depending on the order of in whichlistdir iterates over the directory.

In practice I don't expect these to be a problem:

The easiest way to enter accented characters in NFKC normalized values (that islen("ü") == 1) at least on macOS, which means you'll have to go the extra mile to get a file name that isn't normalized NFKC on Linux
MacOS normalises names to NFD in the filesystem, and won't allow creating two files whose names only differs by their normalization (tested on an APFS filesystem, the older HFS+ should be the same).

What I don't know if there are realistic failure modes for the second scenario with non-latin scripts.

Copy link

Member

ericsnowcurrently commentedDec 29, 2023

There are four ways that names get into import lib:

Yeah, I think you've captured that well.

Ideally, the mechanism for auto-normalization would baked in to the core import machinery itself, such that we didn't have to modify specific user-facing functions or require any extra attention from authors/maintainers of import hooks. However, I'm not sure that is something we could do and without impacting the performance of import statements. (I could be wrong though.)

import statements without overriding__import__: The parser will normalise identifiers, the finder will see a normalised name
import statement with overriding__import__: same as 1 or 3 depending on whether or note__import__ changes the name to be imported

The function is still called with the parser-normalized name though, right? So in this case the non-default__import__() still wouldn't need to NKFC-normalize the name. However, itwould need to accommodate any adjustments we made if the original, unnormalized name were preserved by the parser and passed through. Then again, a non-default__import__() would still need to normalize the name if called directly (which is definitely messy).

Direct calling of__import__ orimport lib.import_module: All bets are off, but name likely is not normalised when using Latin script with accents because the default way to enter these on common platform is not NFKD normalised. The two functions mentioned can normalise the name with little to no impact on user code.

FWIW,importlib.__import__() should probably be lumped in withimportlib.import_module(). Almost any change to one should match the other. The only caveat is thatimportlib.__import__() may be set asbuiltins.__import__(). It might even be the default one for some Python implementations. That was certainly the design goal (and was what we did in CPython for a release or two, IIRC).

importlib.util.find_spec() should be treated the same asimportlib.import_module() as well, but without the caveat.

Direct invocation of import lib finders: This either requires documenting that the name must be normalised, or changes to all finders (including those outside of the stdlib)

This applies to user-defined finders, both metapath finders (sys.meta_path) and path finders (viasys.path_hooks &sys.path_importer_cache()).

This also brings up perhaps a fifth group: user-defined finders. Or is that what you meant with (4)?

It's probably best to change the implementation of__import__ andimport_module to normalise the name and document that the name passed to finders must be normalised.

That seems reasonable. I don't expect that they are used so widely or intensely that it would have a performance impact.

When modifying these functions, we do have to be careful about how/where we do it. We don't want to impactPyImport_ImportModuleLevelObject(), which is what the import statement uses directly by default.PyImport_ImportModuleLevelObject(), in turn, callsimportlib._bootstrap._find_and_load(), so we need to be careful with that.

For__import__() we'd want to make any changes strictly inbuiltin___import___impl() (Python/bltinmodule.c). Forimportlib.import_module() (and thusimportlib.util.find_spec() andmaybeimportlib.__import__()) we could modify each function directly (or perhaps inimportlib._bootstrap._gcd_import() but probably not). Again, we just have to remember to avoidimportlib._bootstrap._find_and_load(). (Forimportlib.util.find_spec() we should also avoid touchingimportlib._bootstrap._find_spec().)

Copy link

Member

ericsnowcurrently commentedDec 29, 2023

Let's step back for a moment to make sure we're thinking about the right things. IIUC, at the end of the day, the key objective here boils down toFileFinder correctly handling the following situations:

module name is non-ascii, NFKC-normal (via PEP 3131)
- filesystem NFKC-normalizes filenames (i.e. the_path_cache situation you've addressed in this PR)
- filesystem does not NFKC-normalize filenames and the corresponding filename isnot normalized
module name is non-ascii but not NFKC-normal (viaimportlib.import_module(), etc.)
- filesystem NFKC-normalizes filenames
- filesystem does not NFKC-normalize filenames but the corresponding filenameis normalized

Is that right?

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled byFileFinder.

(It's also notable that we accommodate the matter of case-sensitivity inFileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely onFileFinder.)

brettcannon reviewed

Feb 23, 2024

View reviewed changes

Lib/importlib/_bootstrap.py OutdatedShow resolvedHide resolved

Lib/importlib/_bootstrap.py Outdated

		""" Normalize 'name' in NKFC form """
		global _unicodedata_normalize
		if _unicodedata_normalize is None:
		from unicodedata import _normalize

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	fromunicodedataimport_normalize
	fromunicodedataimport_normalizeas_unicodedata_normalize

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I get an import error for_normalize but notnormalize in the REPL. Is that expected?

Lib/importlib/_bootstrap.py OutdatedShow resolvedHide resolved

Lib/importlib/_bootstrap_external.py

Comment on lines +1622 to +1623

		if not tail_module.isascii() and not self._cache_is_normalized:
		self._normalize_cache()

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the cache always keeps the raw name then can we drop_cache_is_normalized?

Lib/importlib/_bootstrap_external.py

		@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
		is_namespace = _path_isdir(base_path)
		# Check for a file w/ a proper suffix exists.
		for suffix, loader_class in self._loaders:
		# XXX: Why is ValueError caught here?

Copy link

Member

brettcannonFeb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ifgit blame isn't helpful then it can probably be removed.

Lib/importlib/_bootstrap_external.py OutdatedShow resolvedHide resolved

Copy link

Member

brettcannon commentedFeb 23, 2024

This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled byFileFinder.
(It's also notable that we accommodate the matter of case-sensitivity inFileFinder rather than solving it in the parser/compiler or in the broader import machinery. Thus there's precedent for handling this sort of problem generally, i.e. user-defined finders each have to make the same accommodations or rely onFileFinder.)

I think I agree w/ Eric that this feels like a problem similar to case-sensitivity. As long as the comparison needs are just at the file system boundary and not e.g.sys.modules then I think we have prior art w/ case-sensitivity.

serhiy-storchaka reviewed

May 21, 2025

View reviewed changes

Lib/importlib/_bootstrap_external.py

		@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None):
		is_namespace = _path_isdir(base_path)
		# Check for a file w/ a proper suffix exists.
		for suffix, loader_class in self._loaders:
		# XXX: Why is ValueError caught here?

Copy link

Member

serhiy-storchakaMay 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?

Lib/importlib/_bootstrap.py OutdatedShow resolvedHide resolved

serhiy-storchakaand others added2 commits

May 21, 2025 14:32

Apply suggestions from code review

985e50a

Co-authored-by: Brett Cannon <brett@python.org>

Merge branch 'main' intopythongh-84013

c13edc8

Labels

awaiting core review DO-NOT-MERGE topic-importlib

4 participants

Movatterモバイル変換

Uh oh!

gh-84013: normalize directory contents during import#113447

Are you sure you want to change the base?

gh-84013: normalize directory contents during import#113447

Uh oh!

Conversation

ronaldoussoren commentedDec 24, 2023• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ronaldoussoren commentedDec 24, 2023

Uh oh!

ronaldoussoren commentedDec 27, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericsnowcurrently commentedDec 28, 2023

Uh oh!

ronaldoussoren commentedDec 28, 2023

Uh oh!

ericsnowcurrently commentedDec 28, 2023

Footnotes

Uh oh!

ericsnowcurrently commentedDec 28, 2023

Uh oh!

ronaldoussoren commentedDec 28, 2023

Uh oh!

ronaldoussoren commentedDec 29, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronaldoussoren commentedDec 29, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commentedDec 29, 2023

Uh oh!

ericsnowcurrently commentedDec 29, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brettcannon commentedFeb 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ronaldoussoren commentedDec 24, 2023•
edited by bedevere-appbot
Loading

ronaldoussoren commentedDec 27, 2023•
edited
Loading

ronaldoussoren commentedDec 29, 2023•
edited
Loading

ronaldoussoren commentedDec 29, 2023•
edited
Loading