Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 471 – os.scandir() function – a better and faster directory iterator

Author:
Ben Hoyt <benhoyt at gmail.com>
BDFL-Delegate:
Victor Stinner <vstinner at python.org>
Status:
Final
Type:
Standards Track
Created:
30-May-2014
Python-Version:
3.5
Post-History:
27-Jun-2014, 08-Jul-2014, 14-Jul-2014

Table of Contents

Abstract

This PEP proposes including a new directory iteration function,os.scandir(), in the standard library. This new function addsuseful functionality and increases the speed ofos.walk() by 2-20times (depending on the platform and file system) by avoiding calls toos.stat() in most cases.

Rationale

Python’s built-inos.walk() is significantly slower than it needsto be, because – in addition to callingos.listdir() on eachdirectory – it executes thestat() system call orGetFileAttributes() on each file to determine whether the entry isa directory or not.

But the underlying system calls –FindFirstFile /FindNextFile on Windows andreaddir on POSIX systems –already tell you whether the files returned are directories or not, sono further system calls are needed. Further, the Windows system callsreturn all the information for astat_result object on the directoryentry, such as file size and last modification time.

In short, you can reduce the number of system calls required for atree function likeos.walk() from approximately 2N to N, where Nis the total number of files and directories in the tree. (And becausedirectory trees are usually wider than they are deep, it’s often muchbetter than this.)

In practice, removing all those extra system calls makesos.walk()about8-9 times as fast on Windows, and about2-3 times as faston POSIX systems. So we’re not talking aboutmicro-optimizations. See morebenchmarks here.

Somewhat relatedly, many people (see PythonIssue 11406) are alsokeen on a version ofos.listdir() that yields filenames as ititerates instead of returning them as one big list. This improvesmemory efficiency for iterating very large directories.

So, as well as providing ascandir() iterator function for callingdirectly, Python’s existingos.walk() function can be sped up ahuge amount.

Implementation

The implementation of this proposal was written by Ben Hoyt (initialversion) and Tim Golden (who helped a lot with the C extensionmodule). It lives on GitHub atbenhoyt/scandir. (The implementationmay lag behind the updates to this PEP a little.)

Note that this module has been used and tested (see “Use in the wild”section in this PEP), so it’s more than a proof-of-concept. However,it is marked as beta software and is not extensively battle-tested.It will need some cleanup and more thorough testing before going intothe standard library, as well as integration intoposixmodule.c.

Specifics of proposal

os.scandir()

Specifically, this PEP proposes adding a single function to theosmodule in the standard library,scandir, that takes a single,optional string as its argument:

scandir(path='.')->generatorofDirEntryobjects

Likelistdir,scandir calls the operating system’s directoryiteration system calls to get the names of the files in the givenpath, but it’s different fromlistdir in two ways:

  • Instead of returning bare filename strings, it returns lightweightDirEntry objects that hold the filename string and providesimple methods that allow access to the additional data theoperating system may have returned.
  • It returns a generator instead of a list, so thatscandir actsas a true iterator instead of returning the full list immediately.

scandir() yields aDirEntry object for each file andsub-directory inpath. Just likelistdir, the'.'and'..' pseudo-directories are skipped, and the entries areyielded in system-dependent order. EachDirEntry object has thefollowing attributes and methods:

  • name: the entry’s filename, relative to the scandirpathargument (corresponds to the return values ofos.listdir)
  • path: the entry’s full path name (not necessarily an absolutepath) – the equivalent ofos.path.join(scandir_path,entry.name)
  • inode(): return the inode number of the entry. The result is cached ontheDirEntry object, useos.stat(entry.path,follow_symlinks=False).st_ino to fetch up-to-date information.On Unix, no system call is required.
  • is_dir(*,follow_symlinks=True): similar topathlib.Path.is_dir(), but the return value is cached on theDirEntry object; doesn’t require a system call in most cases;don’t follow symbolic links iffollow_symlinks is False
  • is_file(*,follow_symlinks=True): similar topathlib.Path.is_file(), but the return value is cached on theDirEntry object; doesn’t require a system call in most cases;don’t follow symbolic links iffollow_symlinks is False
  • is_symlink(): similar topathlib.Path.is_symlink(), but thereturn value is cached on theDirEntry object; doesn’t require asystem call in most cases
  • stat(*,follow_symlinks=True): likeos.stat(), but thereturn value is cached on theDirEntry object; does not require asystem call on Windows (except for symlinks); don’t follow symbolic links(likeos.lstat()) iffollow_symlinks is False

Allmethods may perform system calls in some cases and thereforepossibly raiseOSError – see the “Notes on exception handling”section for more details.

TheDirEntry attribute and method names were chosen to be the sameas those in the newpathlib module where possible, forconsistency. The only difference in functionality is that theDirEntry methods cache their values on the entry object after thefirst call.

Like the other functions in theos module,scandir() acceptseither a bytes or str object for thepath parameter, andreturns theDirEntry.name andDirEntry.path attributes withthe same type aspath. However, it isstrongly recommendedto use the str type, as this ensures cross-platform support forUnicode filenames. (On Windows, bytes filenames have been deprecatedsince Python 3.3).

os.walk()

As part of this proposal,os.walk() will also be modified to usescandir() rather thanlistdir() andos.path.isdir(). Thiswill increase the speed ofos.walk() very significantly (asmentioned above, by 2-20 times, depending on the system).

Examples

First, a very simple example ofscandir() showing use of theDirEntry.name attribute and theDirEntry.is_dir() method:

defsubdirs(path):"""Yield directory names not starting with '.' under given path."""forentryinos.scandir(path):ifnotentry.name.startswith('.')andentry.is_dir():yieldentry.name

Thissubdirs() function will be significantly faster with scandirthanos.listdir() andos.path.isdir() on both Windows and POSIXsystems, especially on medium-sized or large directories.

Or, for getting the total size of files in a directory tree, showinguse of theDirEntry.stat() method andDirEntry.pathattribute:

defget_tree_size(path):"""Return total size of files in given path and subdirs."""total=0forentryinos.scandir(path):ifentry.is_dir(follow_symlinks=False):total+=get_tree_size(entry.path)else:total+=entry.stat(follow_symlinks=False).st_sizereturntotal

This also shows the use of thefollow_symlinks parameter tois_dir() – in a recursive function like this, we probably don’twant to follow links. (To properly follow links in a recursivefunction like this we’d want special handling for the case wherefollowing a symlink leads to a recursive loop.)

Note thatget_tree_size() will get a huge speed boost on Windows,because no extra stat call are needed, but on POSIX systems the sizeinformation is not returned by the directory iteration functions, sothis function won’t gain anything there.

Notes on caching

TheDirEntry objects are relatively dumb – thename andpath attributes are obviously always cached, and theis_Xandstat methods cache their values (immediately on Windows viaFindNextFile, and on first use on POSIX systems via astatsystem call) and never refetch from the system.

For this reason,DirEntry objects are intended to be used andthrown away after iteration, not stored in long-lived data structuredand the methods called again and again.

If developers want “refresh” behaviour (for example, for watching afile’s size change), they can simply usepathlib.Path objects,or call the regularos.stat() oros.path.getsize() functionswhich get fresh data from the operating system every call.

Notes on exception handling

DirEntry.is_X() andDirEntry.stat() are explicitly methodsrather than attributes or properties, to make it clear that they maynot be cheap operations (although they often are), and they may do asystem call. As a result, these methods may raiseOSError.

For example,DirEntry.stat() will always make a system call onPOSIX-based systems, and theDirEntry.is_X() methods will make astat() system call on such systems ifreaddir() does notsupportd_type or returns ad_type with a value ofDT_UNKNOWN, which can occur under certain conditions or oncertain file systems.

Often this does not matter – for example,os.walk() as defined inthe standard library only catches errors around thelistdir()calls.

Also, because the exception-raising behaviour of theDirEntry.is_Xmethods matches that ofpathlib – which only raisesOSErrorin the case of permissions or other fatal errors, but returns Falseif the path doesn’t exist or is a broken symlink – it’s oftennot necessary to catch errors around theis_X() calls.

However, when a user requires fine-grained error handling, it may bedesirable to catchOSError around all method calls and handle asappropriate.

For example, below is a version of theget_tree_size() exampleshown above, but with fine-grained error handling added:

defget_tree_size(path):"""Return total size of files in path and subdirs. If    is_dir() or stat() fails, print an error message to stderr    and assume zero size (for example, file has been deleted).    """total=0forentryinos.scandir(path):try:is_dir=entry.is_dir(follow_symlinks=False)exceptOSErroraserror:print('Error calling is_dir():',error,file=sys.stderr)continueifis_dir:total+=get_tree_size(entry.path)else:try:total+=entry.stat(follow_symlinks=False).st_sizeexceptOSErroraserror:print('Error calling stat():',error,file=sys.stderr)returntotal

Support

The scandir module on GitHub has been forked and used quite a bit (see“Use in the wild” in this PEP), but there’s also been a fair bit ofdirect support for a scandir-like function from core developers andothers on the python-dev and python-ideas mailing lists. A sampling:

  • python-dev: a good number of +1’s and very few negatives forscandir andPEP 471 onthis June 2014 python-dev thread
  • Alyssa Coghlan, a core Python developer: “I’ve had the local RedHat release engineering team express their displeasure at having tostat every file in a network mounted directory tree for info that ispresent in the dirent structure, so a definite +1 to os.scandir fromme, so long as it makes that info available.”[source1]
  • Tim Golden, a core Python developer, supports scandir enough tohave spent time refactoring and significantly improving scandir’s Cextension module.[source2]
  • Christian Heimes, a core Python developer: “+1 for somethinglike yielddir()”[source3]and “Indeed! I’d like to see the feature in 3.4 so I can remove myown hack from our code base.”[source4]
  • Gregory P. Smith, a core Python developer: “As 3.4beta1 happenstonight, this isn’t going to make 3.4 so i’m bumping this to 3.5.I really like the proposed design outlined above.”[source5]
  • Guido van Rossum on the possibility of adding scandir to Python3.5 (as it was too late for 3.4): “The ship has likewise sailed foradding scandir() (whether to os or pathlib). By all means experimentand get it ready for consideration for 3.5, but I don’t want to addit to 3.4.”[source6]

Support for this PEP itself (meta-support?) was given by Alyssa (Nick) Coghlanon python-dev: “A PEP reviewing all this for 3.5 and proposing aspecific os.scandir API would be a good thing.”[source7]

Use in the wild

To date, thescandir implementation is definitely useful, but hasbeen clearly marked “beta”, so it’s uncertain how much use of it thereis in the wild. Ben Hoyt has had several reports from people using it.For example:

  • Chris F: “I am processing some pretty large directories and was halfexpecting to have to modify getdents. So thanks for saving me theeffort.” [via personal email]
  • bschollnick: “I wanted to let you know about this, since I am usingScandir as a building block for this code. Here’s a good example ofscandir making a radical performance improvement over os.listdir.”[source8]
  • Avram L: “I’m testing our scandir for a project I’m working on.Seems pretty solid, so first thing, just want to say nice work!”[via personal email]
  • Matt Z: “I used scandir to dump the contents of a network dir inunder 15 seconds. 13 root dirs, 60,000 files in the structure. Thiswill replace some old VBA code embedded in a spreadsheet that wastaking 15-20 minutes to do the exact same thing.” [via personalemail]

Others haverequested a PyPI package for it, which has beencreated. SeePyPI package.

GitHub stats don’t mean too much, but scandir does have severalwatchers, issues, forks, etc. Here’s the run-down as of the stats asof July 7, 2014:

  • Watchers: 17
  • Stars: 57
  • Forks: 20
  • Issues: 4 open, 26 closed

Also, because this PEP will increase the speed ofos.walk()significantly, there are thousands of developers and scripts, and a lotof production code, that would benefit from it. For example, on GitHub,there are almost as many uses ofos.walk (194,000) as there are ofos.mkdir (230,000).

Rejected ideas

Naming

The only other real contender for this function’s name wasiterdir(). However,iterX() functions in Python (mostly foundin Python 2) tend to be simple iterator equivalents of theirnon-iterator counterparts. For example,dict.iterkeys() is just aniterator version ofdict.keys(), but the objects returned areidentical. Inscandir()’s case, however, the return values arequite different objects (DirEntry objects vs filename strings), sothis should probably be reflected by a difference in name – hencescandir().

See somerelevant discussion on python-dev.

Wildcard support

FindFirstFile/FindNextFile on Windows support passing a“wildcard” like*.jpg, so at first folks (this PEP’s authorincluded) felt it would be a good idea to include awindows_wildcard keyword argument to thescandir function sousers could pass this in.

However, on further thought and discussion it was decided that thiswould be bad idea,unless it could be made cross-platform (apattern keyword argument or similar). This seems easy enough atfirst – just use the OS wildcard support on Windows, and somethinglikefnmatch orre afterwards on POSIX-based systems.

Unfortunately the exact Windows wildcard matching rules aren’t reallydocumented anywhere by Microsoft, and they’re quite quirky (see thisblog post),meaning it’s very problematic to emulate usingfnmatch or regexes.

So the consensus was that Windows wildcard support was a bad idea.It would be possible to add at a later date if there’s across-platform way to achieve it, but not for the initial version.

Read more on thethis Nov 2012 python-ideas threadand thisJune 2014 python-dev thread on PEP 471.

Methods not following symlinks by default

There was much debate on python-dev (see messages inthis thread)over whether theDirEntry methods should follow symbolic links ornot (when theis_X() methods had nofollow_symlinks parameter).

Initially they did not (see previous versions of this PEP and thescandir.py module), but Victor Stinner made a pretty compelling case onpython-dev that following symlinks by default is a better idea, because:

  • following links is usually what you want (in 92% of cases in thestandard library, functions usingos.listdir() andos.path.isdir() do follow symlinks)
  • that’s the precedent set by the similar functionsos.path.isdir() andpathlib.Path.is_dir(), so to dootherwise would be confusing
  • with the non-link-following approach, if you wanted to follow linksyou’d have to say something likeif(entry.is_symlink()andos.path.isdir(entry.path))orentry.is_dir(), which is clumsy

As a case in point that shows the non-symlink-following version iserror prone, this PEP’s author had a bug caused by getting thisexact test wrong in his initial implementation ofscandir.walk()in scandir.py (seeIssue #4 here).

In the end there was not total agreement that the methods shouldfollow symlinks, but there was basic consensus among the most involvedparticipants, and this PEP’s author believes that the above case isstrong enough to warrant following symlinks by default.

In addition, it’s straightforward to call the relevant methods withfollow_symlinks=False if the other behaviour is desired.

DirEntry attributes being properties

In some ways it would be nicer for theDirEntryis_X() andstat() to be properties instead of methods, to indicate they’revery cheap or free. However, this isn’t quite the case, asstat()will require an OS call on POSIX-based systems but not on Windows.Evenis_dir() and friends may perform an OS call on POSIX-basedsystems if thedirent.d_type value isDT_UNKNOWN (on certainfile systems).

Also, people would expect the attribute accessentry.is_dir toonly ever raiseAttributeError, notOSError in the case itmakes a system call under the covers. Calling code would have to haveatry/except around what looks like a simple attribute access,and so it’s much better to make themmethods.

Seethis May 2013 python-dev threadwhere this PEP author makes this case and there’s agreement from acore developers.

DirEntry fields being “static” attribute-only objects

Inthis July 2014 python-dev message,Paul Moore suggested a solution that was a “thin wrapper round the OSfeature”, where theDirEntry object had only static attributes:name,path, andis_X, with thest_X attributes onlypresent on Windows. The idea was to use this simpler, lower-levelfunction as a building block for higher-level functions.

At first there was general agreement that simplifying in this way wasa good thing. However, there were two problems with this approach.First, the assumption is theis_dir and similar attributes arealways present on POSIX, which isn’t the case (ifd_type is notpresent or isDT_UNKNOWN). Second, it’s a much harder-to-use APIin practice, as even theis_dir attributes aren’t always presenton POSIX, and would need to be tested withhasattr() and thenos.stat() called if they weren’t present.

Seethis July 2014 python-dev responsefrom this PEP’s author detailing why this option is a non-idealsolution, and the subsequent reply from Paul Moore voicing agreement.

DirEntry fields being static with an ensure_lstat option

Another seemingly simpler and attractive option was suggested byAlyssa Coghlan in thisJune 2014 python-dev message:makeDirEntry.is_X andDirEntry.lstat_result properties, andpopulateDirEntry.lstat_result at iteration time, but only ifthe new argumentensure_lstat=True was specified on thescandir() call.

This does have the advantage over the above in that you can easily getthe stat result fromscandir() if you need it. However, it has theserious disadvantage that fine-grained error handling is messy,becausestat() will be called (and hence potentially raiseOSError) during iteration, leading to a rather ugly, hand-madeiteration loop:

it=os.scandir(path)whileTrue:try:entry=next(it)exceptOSErroraserror:handle_error(path,error)exceptStopIteration:break

Or it means thatscandir() would have to accept anonerrorargument – a function to call whenstat() errors occur duringiteration. This seems to this PEP’s author neither as direct nor asPythonic astry/except around aDirEntry.stat() call.

Another drawback is thatos.scandir() is written to make code faster.Always callingos.lstat() on POSIX would not bring any speedup. In mostcases, you don’t need the fullstat_result object – theis_X()methods are enough and this information is already known.

SeeBen Hoyt’s July 2014 replyto the discussion summarizing this and detailing why he thinks theoriginalPEP 471 proposal is “the right one” after all.

Return values being (name, stat_result) two-tuples

Initially this PEP’s author proposed this concept as a function callediterdir_stat() which yielded two-tuples of (name, stat_result).This does have the advantage that there are no new types introduced.However, thestat_result is only partially filled on POSIX-basedsystems (most fields set toNone and other quirks), so they’re notreallystat_result objects at all, and this would have to bethoroughly documented as different fromos.stat().

Also, Python has good support for proper objects with attributes andmethods, which makes for a saner and simpler API than two-tuples. Italso makes theDirEntry objects more extensible and future-proofas operating systems add functionality and we want to include this inDirEntry.

See also some previous discussion:

Return values being overloaded stat_result objects

Another alternative discussed was making the return values to beoverloadedstat_result objects withname andpathattributes. However, apart from this being a strange (and strained!)kind of overloading, this has the same problems mentioned above –most of thestat_result information is not fetched byreaddir() on POSIX systems, only (part of) thest_mode value.

Return values being pathlib.Path objects

With Antoine Pitrou’s new standard librarypathlib module, itat first seems like a great idea forscandir() to return instancesofpathlib.Path. However,pathlib.Path’sis_X() andstat() functions are explicitly not cached, whereasscandirhas to cache them by design, because it’s (often) returning valuesfrom the original directory iteration system call.

And if thepathlib.Path instances returned byscandir cachedstat values, but the ordinarypathlib.Path objects explicitlydon’t, that would be more than a little confusing.

Guido van Rossum explicitly rejectedpathlib.Path caching stat inthe context of scandirhere,makingpathlib.Path objects a bad choice for scandir returnvalues.

Possible improvements

There are many possible improvements one could make to scandir, buthere is a short list of some this PEP’s author has in mind:

  • scandir could potentially be further sped up by callingreaddir/FindNextFile say 50 times perPy_BEGIN_ALLOW_THREADS blockso that it stays in the C extension module for longer, and may besomewhat faster as a result. This approach hasn’t been tested, butwas suggested by on Issue 11406 by Antoine Pitrou.[source9]
  • scandir could use a free list to avoid the cost of memory allocationfor each iteration – a short free list of 10 or maybe even 1 may help.Suggested by Victor Stinner on apython-dev thread on June 27.

Previous discussion

Copyright

This document has been placed in the public domain.


Source:https://github.com/python/peps/blob/main/peps/pep-0471.rst

Last modified:2025-02-01 08:59:27 GMT


[8]ページ先頭

©2009-2025 Movatter.jp