Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-84013: normalize directory contents during import#113447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Identifiers in Python code are normalized to NFKC andthere is no guarantee that names in the file system arenormalized.Teach the FileFinder class about normalization, but onlydo this when the to-be-imported name contains non-ASCIIcharacters.
The PR is a draft primarily because the tests are incomplete:
|
ronaldoussoren commentedDec 27, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
This has been resolved by keeping the unnormalised name around because this is needed to match filesystem behaviour. Worked on macOS because the kernel will normalise names there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have some concerns.
Also, should we document that finders (both metapath-based and path-based) must deal with un-normalized names, or can we do that for all finders internally? Instead, should we state in the language spec that all module names passing through the import system are guaranteed to be NFKC-normalized?
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
if not tail_module.isascii() and not self._cache_is_normalized: | ||
self._normalize_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For the sake of__import__()
andimportlib.import_module()
we will probably need to either normalizecache_module
or preserve the original name in the cache. The former sounds expensive so the latter might be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Regardless, we should probably have some test cases that use one of those functions rather than syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I changed the code to keep the unnormalised name around, we need the unnormalised name to access the filesystem and need the normalised name to compare totail_module
(which will generally be normalised by the compiler).
I agree that there should be more test cases, I started out with just the minimal test from the issue to see if I can get this to work without bootstrapping issues and get some feedback. I'll definitely add more tests when this approach is seen as acceptable by@brettcannon (as the importlib expert).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If the cache always keeps the raw name then can we drop_cache_is_normalized
?
Uh oh!
There was an error while loading.Please reload this page.
Also, I meant to say that I think you're on the right track. |
Good question, whatever we choose should be documented. The names passed to |
FWIW, the eventual outcome is a direct consequence of (unintentional?) side effect of PEP 3131. I would not expect such a change to the language reference, if that's what's needed, to require a PEP. However, discussion on discuss.python.org may be warranted. Regardless, it would be good to have feedback from other import and encoding experts before we go much farther down any particular path, even if the actual patch is relatively small. CC@brettcannon@ncoghlan@warsaw@methane@vstinner
Note that the bytecode for an import statement calls Ideally the import machinery would take care of it in every case and no one would have to think about it, but I'm not sure how feasible that would be. For all its other benefits, the effect of PEP 3131 really is a pain here, isn't it? 😝 Then again, I suppose that the import machinery still not having been fixed after 16 years implies that the problematic corner cases aren't all that common. Footnotes
|
FWIW, one solution that crossed my mind is to modify the parser, since that is where the PEP 3131 normalization takes place, AKA the root of the problem. The parser could preserve the original unnormalized identifier and we'd use it where appropriate. For instance, the import statement bytecode could use the original name, while the normalized identifier would be used as the bound name. The discrepancy between the bound name and the Relatedly, it may be worth adding an optional arg to |
The language reference already says all identifiers are normalised to NFKC, seehttps://docs.python.org/3/reference/lexical_analysis.html#identifiers. The PEP, and more importantly, the import system specification (PEP 451 and its predecessors) failed to fully think through the consequences of this when loading modules from the filesystem. That's not really anyones fault, Unicode and in particular Unicode normalisation are hard.
Yes please ;-). I intend to work on fixes the issues with this PR on Linux (and that will likely fix the issues on Windows as well), but beyond that I'd like some advise on the likelihood that the change will be accepted.
I guess that not many people write modules with non-ascii names, and those that do quickly learn that this doesn't always work. The forks that want to use non-ascii names are also more likely to be less familiar with English. P.S. I do have to warn that I'll likely be less responsive after new year. I had a lot time of from work in December, but expect that I'll be distracted by $WORK in January. I'll continue working on this, but slower. |
The finder cache is now a mapping from (normalized) namesto actual filesystem names. With some luck this fixes testfailures on Linux and Windows.
ronaldoussoren commentedDec 29, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There are four ways that names get into import lib:
It's probably best to change the implementation of The impact of the former should be limited because as@ericsnowcurrently notes these functions aren't used in normal import code paths. Requiring that names passed to finders must be normalized is technically a backward incompatible change, although in practice finders already mostly see normalised names. Normalising names in finders can have a performance impact due to unnecessary normalising of names, although it is unclear to me how significant that impact would be given that the number of non-ASCII module names is likely very low. UPDATE: In the current version of the PR |
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
1. Add a NEWS entry2. __import__ and find_module will now normalize their arguments as needed3. Update the importlib documentation to note that the name passed to finders must be normalized.
@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None): | |||
is_namespace = _path_isdir(base_path) | |||
# Check for a file w/ a proper suffix exists. | |||
for suffix, loader_class in self._loaders: | |||
# XXX: Why is ValueError caught here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This change worries me a little because I don't understand why the original code catchesValueError
and then bails out.
Removing this like I did here does not result in test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ifgit blame
isn't helpful then it can probably be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?
It is now possible to import modules from the filesystem regardless of how | ||
the name is normalized in the filesystem. This in particular affects | ||
importing modules with a name that contains accented latin characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This probably also needs a What's New entry, lets worry about that when the PR is closer to acceptance to avoid creating merge conflicts.
ronaldoussoren commentedDec 29, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Did I mention Unicode is hard? In NFKC normalization there are some non-ASCII code points that normalise to code points that are ASCII, such as "ff" (the ligature for "ff"). This leads to some "interesting" behaviour with my PR. Given a file named "fluffy.py" (with ligature) you currently get an importrandomifrandom.choice([True,False]):try:importcaféexceptImportError:passimportfluffy This will work 50% of the time and raise Another way this PR can lead to unexpected behaviour changes is to have two source files with different normalisations, one that happens to be NFKC normalized and one not (for example the "Mädchen.py" files I used in In practice I don't expect these to be a problem:
What I don't know if there are realistic failure modes for the second scenario with non-latin scripts. |
Yeah, I think you've captured that well. Ideally, the mechanism for auto-normalization would baked in to the core import machinery itself, such that we didn't have to modify specific user-facing functions or require any extra attention from authors/maintainers of import hooks. However, I'm not sure that is something we could do and without impacting the performance of import statements. (I could be wrong though.)
The function is still called with the parser-normalized name though, right? So in this case the non-default
FWIW,
This applies to user-defined finders, both metapath finders ( This also brings up perhaps a fifth group: user-defined finders. Or is that what you meant with (4)?
That seems reasonable. I don't expect that they are used so widely or intensely that it would have a performance impact. When modifying these functions, we do have to be careful about how/where we do it. We don't want to impact For |
Let's step back for a moment to make sure we're thinking about the right things. IIUC, at the end of the day, the key objective here boils down to
Is that right? This looks, at least partially, like the way we have to deal with case-insensitive filesystems. We should make sure we take some cues from how that is handled by (It's also notable that we accommodate the matter of case-sensitivity in |
Uh oh!
There was an error while loading.Please reload this page.
Lib/importlib/_bootstrap.py Outdated
""" Normalize 'name' in NKFC form """ | ||
global _unicodedata_normalize | ||
if _unicodedata_normalize is None: | ||
from unicodedata import _normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
fromunicodedataimport_normalize | |
fromunicodedataimport_normalizeas_unicodedata_normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I get an import error for_normalize
but notnormalize
in the REPL. Is that expected?
Uh oh!
There was an error while loading.Please reload this page.
if not tail_module.isascii() and not self._cache_is_normalized: | ||
self._normalize_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If the cache always keeps the raw name then can we drop_cache_is_normalized
?
@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None): | |||
is_namespace = _path_isdir(base_path) | |||
# Check for a file w/ a proper suffix exists. | |||
for suffix, loader_class in self._loaders: | |||
# XXX: Why is ValueError caught here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ifgit blame
isn't helpful then it can probably be removed.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
I think I agree w/ Eric that this feels like a problem similar to case-sensitivity. As long as the comparison needs are just at the file system boundary and not e.g. |
@@ -1638,15 +1645,21 @@ def find_spec(self, fullname, target=None): | |||
is_namespace = _path_isdir(base_path) | |||
# Check for a file w/ a proper suffix exists. | |||
for suffix, loader_class in self._loaders: | |||
# XXX: Why is ValueError caught here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
UnicodeEncodeError if the name is not compatible with the filesystem encoding, ValueError for embedded null characters?
Uh oh!
There was an error while loading.Please reload this page.
Co-authored-by: Brett Cannon <brett@python.org>
Uh oh!
There was an error while loading.Please reload this page.
Identifiers in Python code are normalized to NFKC and there is no guarantee that names in the file system are normalized.
Teach the FileFinder class about normalization, but only do this when the to-be-imported name contains non-ASCII characters.