Movatterモバイル変換


[0]ホーム

URL:


[Python-Dev] requirements for moving __import__ over toimportlib?

PJ Ebypje at telecommunity.com
Thu Feb 9 23:00:04 CET 2012


On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer <mwm at mired.org> wrote:> For those of you not watching -ideas, or ignoring the "Python TIOBE> -3%" discussion, this would seem to be relevant to any discussion of> reworking the import mechanism:>>http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html>> Interesting.  This gives me an idea for a way to cut stat calls persys.path entry per import by roughly 4x, at the cost of a one-timedirectory read per sys.path entry.That is, an importer created for a particular directory could, upon firstuse, cache a frozenset(listdir()), and the stat().st_mtime of thedirectory.  All the filename checks could then be performed against thefrozenset, and the st_mtime of the directory only checked once per import,to verify whether the frozenset() needed refreshing.Since a failed module lookup takes at least 5 stat checks (pyc, pyo, py,directory, and compiled extension (pyd/so)), this cuts it down to only 1,at the price of a listdir().  The big question is how long does a listdir()take, compared to a stat() or failed open()?   That would tell us whetherthe tradeoff is worth making.I did some crude timeit tests on frozenset(listdir()) and trapping failedstat calls.  It looks like, for a Windows directory the size of the 2.7stdlib, you need about four *failed* import attempts to overcome theinitial caching cost, or about 8 successful bytecode imports.  (For Linux,you might need to double these numbers; my tests showed a different ratiothere, perhaps due to the Linux stdib I tested having nearly twice as manydirectory entries as the directory I tested on Windows!)However, the numbers are much better for application directories than forthe stdlib, since they are located earlier on sys.path.  Every successfulstdlib import in an application is equal to one failed import attempt forevery preceding directory on sys.path, so as long as the average directoryon sys.path isn't vastly larger than the stdlib, and the averageapplication imports at least four modules from the stdlib (on Windows, or 8on Linux), there would be a net performance gain for the application as awhole.  (That is, there'd be an improved per-sys.path entry import time forstdlib modules, even if not for any application modules.)For smaller directories, the tradeoff actually gets better.  A directoryone seventh the size of the 2.7 Windows stdlib has a listdir() that'sproportionately faster, but failed stats() in that directory are *not*proportionately faster; they're only somewhat faster.  This means that ittakes fewer failed module lookups to make caching a win - about 2 in thiscase, vs. 4 for the stdlib.Now, these numbers are with actual disk or network access abstracted away,because the data's in the operating system cache when I run the tests. It's possible that this strategy could backfire if you used, say, an NFSdirectory with ten thousand files in it as your first sys.path entry. Without knowing the timings for listdir/stat/failed stat in that setup,it's hard to say how many stdlib imports you need before you come outahead.  When I tried a directory about 7 times larger than the stdlib,creating the frozenset took 10 times as long, but the cost of a failed statdidn't go up by very much.This suggests that there's probably an optimal directory size cutoff forthis trick; if only there were some way to check the size of a directorywithout reading it, we could turn off the caching for oversize directories,and get a major speed boost for everything else.  On most platforms, thestat().st_size of the directory itself will give you some idea, but onWindows that's always zero.  On Windows, we could work around that by usinga lower-level API than listdir() and simply stop reading the directory ifwe hit the maximum number of entries we're willing to build a cache for,and then call it off.(Another possibility would be to explicitly enable caching by putting aflag file in the directory, or perhaps by putting a special prefix on thesys.path entry, setting the cutoff in an environment variable, etc.)In any case, this seems really worth a closer look: in non-pathologicalcases, it could make directory-based importing as fast as zip imports are. I'd be especially interested in knowing how the listdir/stat/failed statratios work on NFS - ISTM that they might be even *more* conducive to thisapproach, if setup latency dominates the cost of individual system calls.If this works out, it'd be a good example of why importlib is a good idea;i.e., allowing us to play with ideas like this.  Brett, wouldn't you loveto be able to say importlib is *faster* than the old C-based importing? ;-)-------------- next part --------------An HTML attachment was scrubbed...URL: <http://mail.python.org/pipermail/python-dev/attachments/20120209/d4548608/attachment.html>


More information about the Python-Devmailing list

[8]ページ先頭

©2009-2025 Movatter.jp