Add os.path.splitroot()

Ideas

The longstandingos.path.splitdrive() function splits a path into a(drive, tail) pair. But in some cases more detail is wanted, specifically a(drive, root, tail) triad.

  1. Thedrive part has the same meaning as insplitdrive()
  2. Theroot part is one of: the empty string, a forward slash, a backward slash (Windows only), or two forward slashes (POSIX only)
  3. Thetail part is everything following the root.

Similarly tosplitdrive(), asplitroot() function would ensure thatdrive + root + tail is the same as the input path.

The extra level of detail reflects an extra step in the Windows ‘current path’ hierarchy – Windows has both a ‘current drive’, and a ‘current directory’ for one or more drives, which results in several kinds of non-absolute paths, e.g. ‘foo/bar’, ‘/foo/bar’, ‘X:foo/bar’

This three-part model is used successfully by pathlib, which exposesroot as an attribute, and combinesdrive + root as an attribute calledanchor. The anchor has useful properties, e.g. comparing two paths anchors can tell us whether arelative_to() operation is possible.

Pathlib hasits own implementation ofsplitroot(), but its performance is hamstrung by its need for OS-agnosticism. By moving the implementation intontpath andposixpath we can take advantage of OS-specific rules to improve performance.

Some previous discussion in a review thread:gh-68320, gh-88302 - Allow for `pathlib.Path` subclassing by barneygale · Pull Request #31691 · python/cpython · GitHub

1 Like

@eryksun would you like to expand or correct the pitch?:slight_smile:

Barney Gale:

Similarly tosplitdrive(), asplitroot() function would ensure thatdrive + root + rel is the same as the input path.

The_split_root() method in pathlib removes repeated slashes at the root – such asr"C:\\spam"r"C:\spam". It also splits an implicit root in a UNC path – e.g.r"\\server\share"(r"\\server\share", "\\", "")[1]. Given the above quote, I assume that these non-conserving steps are intended to remain in_split_root(). Given the root is already split out, these steps won’t require callingnormpath() andsplitdrive(), so there will be just the singlesplitroot() call. Is that right?


WinAPIPathCchSkipRoot() is the basis for the builtin functionnt._path_splitroot(). Offhand I recall that it has one restriction that’s over the top. It limits the use of the extended device prefix (i.e."\\\\?\\") to just drive-letter names, volume GUID names, and the “UNC” device (e.g.r"\\?\C:\spam",r"\\?\Volume{12345678-1234-1234-1234-123456781234}\spam", andr"\\?\UNC\localhost\C$\spam"). It thus rejects a path such asr"\\?\BootPartition\spam" as an invalid parameter, even though it’s a valid path in the file API. This error case can be handled by substituting the normal device prefix (i.e."\\\\.\\") in place of the extended prefix and retrying thePathCchSkipRoot() call.


  1. _split_root() also mistakenly does this for device paths such asr"\\.\C:". This changes the meaning of the path from being a volume device to being the root directory of the filesystem that mounts the volume.↩︎

Eryk Sun:

The_split_root() method in pathlib removes repeated slashes at the root – such asr"C:\\spam"r"C:\spam". It also splits an implicit root in a UNC path – e.g.r"\\server\share"(r"\\server\share", "\\", "")[1]. Given the above quote, I assume that these non-conserving steps are intended to remain in_split_root(). Given the root is already split out, these steps won’t require callingnormpath() andsplitdrive(), so there will be just the singlesplitroot() call. Is that right?

That’s correct. I think the pathlib change is pretty simple:

diff --git a/Lib/pathlib.py b/Lib/pathlib.pyindex b959e85d18..003d980d7a 100644--- a/Lib/pathlib.py+++ b/Lib/pathlib.py@@ -293,7 +293,10 @@ def _parse_parts(cls, parts):         path = cls._flavour.join(*parts)         if altsep:             path = path.replace(altsep, sep)-        drv, root, rel = cls._split_root(path)+        drv, root, rel = cls._flavour.splitroot(path)+        if drv.startswith(sep):+            # UNC paths always have a root.+            root = sep         unfiltered_parsed = [drv + root] + rel.split(sep)         parsed = [sys.intern(x) for x in unfiltered_parsed if x and x != '.']         return drv, root, parsed

Extraneous slashes in the root are placed at the beginning of therel part and are stripped out by theparsed = ... line.

Barney Gale:
+        if drv.startswith(sep):+            # UNC paths always have a root.+            root = sep

This part should be something like the following:

        if not root and drv.startswith(sep) and (                not drv.startswith(device_prefixes) or                drv.startswith(unc_device_prefix)):            # UNC file shares always have a root.            root = sep

wheredevice_prefixes is("\\\\.\\", "\\\\?\\") andunc_device_prefix is"\\\\?\\UNC\\".

An implicit root is split for file shares such asr"\\server\share" andr"\\?\UNC\server\share", while base device paths such asr"\\.\C:" andr"\\.\NUL" have no root. The latter are still absolute, however. Theis_absolute() method should return true for all UNC paths, since they’re never relative to a working directory.

2 Likes

Is there somewhere that explains the FULL path syntax on Windows? All I remember is C: and maybe \blah but I cannot follow the discussion here…:slight_smile:

Be forewarned: this page uses some terms (like “absolute path”, “relative path”) a little differently than you might expect!

learn.microsoft.com

Naming Files, Paths, and Namespaces - Win32 apps

All file systems supported by Windows use the concept of files and directories to access data stored on a disk or device.

The docs forpathlib.PurePath.drive,root andanchor may also be useful:

Python documentation

pathlib — Object-oriented filesystem paths

Source code: Lib/pathlib.py This module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths, which p...

Barney Gale:

Be forewarned: this page uses some terms (like “absolute path”, “relative path”) a little differently than you might expect!

That’s one of its mistakes. For example, it says that “\directory” is an absolute path that doesn’t depend on the current directory. No, it does depend on the current directory, and it is not an absolute path. When opened, “\directory” is relative to the drive or UNC share of the current directory. As a symlink target, “\directory” is relative to the drive of the opened path of the symlink.

Guido van Rossum:

Is there somewhere that explains the FULL path syntax on Windows?

Here are the supported MS-DOS path types that date back to the 1980s:

  • relative: “spam\eggs”
  • relative rooted (no drive): “\spam\eggs”
  • relative drive (no root): “Z:spam\eggs”
  • absolute drive: “Z:\spam\eggs”
  • absolute UNC: “\\server\share\spam\eggs”

The current working directory can be either an absolute drive path or an absolute UNC path. If the current directory is a UNC path, the share is handled as the current drive, such as for resolving a relative rooted path.

There’s also an optional working directory on each drive-letter drive. It defaults to the root directory on the drive. It gets used to resolve relative drive paths such as “Z:spam\eggs”. The API doesn’t force this feature on applications, but Python’sos.chdir() and C_wchdir() both opt into it.

Windows supports an additional path type that wasn’t present in MS-DOS: device paths for canonical device names and mapped drives[1]. These come in two flavors: normalized and extended (literal). The prefix for a normalized device path is “\\.\” (e.g. “\\.\C:”), and the prefix for an extended device path is “\\?\” (e.g. “\\?\UNC\server\share”). Opening a volume device requires the use of a device path. For example, “C:” gets resolved relative to the working directory on the drive, while “\\.\C:” is an absolute path for the volume.

Path normalization applies to all path types when opened, except for “\\?\” extended (literal) paths. Normalization replaces forward slashes with backslashes, removes repeated slashes, resolves “.” and “..” components, and removes trailing spaces and dots from the final component. Normalized paths may be limited toMAX_PATH (260) characters, or sometimes less. The native NT limit of about 32760 characters is possible if long normalized paths are enabled for both the system and the application. (Python 3.6+ enables long paths, but it still depends on the system setting.) Using an extended path allows reliable access to long paths up to about 32760 characters, but one has to be careful to first normalize the path viaGetFullPathNameW() (i.e.os.path.abspath()).


  1. The current directory is not documented to support device paths, even if they’re for a filesystem directory such as “\\?\C:\Windows”. It may seem to work, but the API isn’t tested to support it, and it has serious bugs that result in nonsense paths.↩︎

3 Likes

That’s an awesome summary, Eryk – I think I knew all of the MS-DOS flavors but the device paths are new to me (and what confused me in the discussion).

I guess there are also some additional wrinkles like case normalization, things like NUL (what’s the list of those?), and long vs. short (8+3 IIRC) paths. Also code pages, character sets, UTF-16.

Guido van Rossum:

things like NUL (what’s the list of those?)

https://github.com/python/cpython/blob/main/Lib/pathlib.py#L33-L39

Guido van Rossum:

I guess there are also some additional wrinkles like case normalization, things like NUL (what’s the list of those?), and long vs. short (8+3 IIRC) paths. Also code pages, character sets, UTF-16.

  • In the internal NT API, filenames are 16-bit Unicode strings. It’s not strictly UTF-16 because surrogate codes are not validated as surrogate pairs.
  • The API also does not normalize filenames to a particular Unicode normal form (e.g. “NFC” or “NFKC”).
  • If a filesystem directory is case insensitive, name comparisons first translate to upper case using a locale-invariant case table. One-to-many case conversions are not supported (e.g. “ß” maps to “ß”, not to “SS”) .
  • Starting with Windows 10, NTFS supports case-sensitive directories.

For bytes paths, Python 3.6+ uses UTF-8 as the filesystem encoding. Bytes paths get decoded to wide-character strings before calling system functions. The error handler is “surrogatepass” due to the possibility of lone surrogate codes in filenames. This is sometimes called 8-bit Wobbly Transformation Format (WTF-8).


Regarding short filenames[1], they’re a legacy feature for compatibility with ancient applications.

  • ReFS and exFAT filesystems do not support short filenames.
  • NTFS allows disabling the automatic creation of short filenames, either for individual filesystems or system-wide, and they can be stripped from existing files. This can improve performance since NTFS stores short filenames as separate, specially-flagged entries in a directory.
  • FAT32 generates short filenames that can include non-ASCII OEM characters[2], which violates the documented specification. It also uses a best-fit encoding that can be problematic. For example, given OEM is code page 850, “spĀm.txt” has the associated short name “SPAM.TXT”. In this case, most people will be surprised that opening or creating “spam.txt” actually opens or replaces “spĀm.txt”.

The list of reserved DOS device names includes “NUL”, “CON”, “CONIN$”, CONOUT$", “AUX”, “PRN”, “COM<1-9>” and “LPT<1-9>”. The names are case insensitive. These devices are virtually present in the unqualified current directory on all Windows versions, just like the dive-letter names “A:” through “Z:”. The device name can be followed by a colon and any number of dots and spaces. For example:

>>> stat.S_ISCHR(os.stat('CONIN$:. . . .').st_mode)True

Unlike drive-letter names, the virtually present DOS device names cannot have a path, and the optional colon is not part of the real device name. For example:

>>> os.getcwd()'C:\\Temp'>>> nt._getfullpathname('CON:/spam')'C:\\Temp\\CON:\\spam'>>> nt._getfullpathname('CON:')'\\\\.\\CON'

Prior to Windows 11, DOS device names are reserved in a wider range of cases than drive-letter names:

  • DOS device names can have an extension that gets ignored (e.g. “CON.txt”).
  • DOS device names are present in the explicitly referenced current directory (e.g. “.\CON”), as well as the parent directory of most opened paths (e.g. “C:\Temp\CON”), except never in UNC paths.

For some reason the latter behavior is still implemented for the “NUL” device on Windows 11. For example:

>>> nt._getfullpathname('./NUL')'\\\\.\\NUL'>>> nt._getfullpathname('Temp/NUL')'\\\\.\\NUL'>>> nt._getfullpathname('C:/Temp/NUL')'\\\\.\\NUL'

DOS devices have never been virtually present in UNC share paths and device paths, in which case they’re just regular filenames, at least as far as the API is concerned. For example:

>>> nt._getfullpathname('//localhost/C$/Temp/NUL')'\\\\localhost\\C$\\Temp\\NUL'  >>> nt._getfullpathname('//./C:/Temp/NUL')'\\\\.\\C:\\Temp\\NUL'

A filesystem or filesystem redirector (e.g. SMB) may disallow creating DOS device names, even in cases that the API doesn’t reserve. For example:

>>> open('//localhost/C$/Temp/NUL', 'w')Traceback (most recent call last):  File "<stdin>", line 1, in <module>PermissionError: [Errno 13] Permission denied: '//localhost/C$/Temp/NUL

  1. Short filenames are specified in[MS-FSCC] 2.1.5.2.1.↩︎

  2. Currently on Windows 11 there’s an OEM decoding bug in the system runtime library, at least on the systems I’ve checked. The functionRtlOemStringToUnicodeString() mistakenly uses the ANSI codepage instead of the OEM codepage. Thus short names that contain non-ASCII OEM characters are returned as mojibake. This also affectsGetShortPathNameW().↩︎

5 Likes

Just in case anyone reading is struggling to relate the discussion of reserved names, normalization, etc, back to the original proposal: they’re important parts of the wider picture of how Windows paths work, and they influence how pathlib normalizes paths, but it’s perhaps worth noting that reserved names, normalization, 8+3, etc, don’t have a direct bearing on the proposedos.path.splitroot() function, because it’s designed to be conservative:input_path = drive + root + tail.

Barney Gale:

Just in case anyone reading is struggling to relate the discussion of reserved names, normalization, etc, back to the original proposal

I know it’s an off-topic side discussion. I was trying to give Guido a summary response to his questions about Windows paths and figured I may as well answer on the public forum.

On the POSIX side of thesplitroot() problem, the only gotcha I can think of is a path with two leading slashes, per the specification of“pathname” andpathname resolution:

  • Multiple successive <slash> characters are considered to be the same as one <slash>, except for the case of exactly two leading <slash> characters.
  • If a pathname begins with two successive <slash> characters, the first component following the leading <slash> characters may be interpreted in an implementation-defined manner, although more than two leading <slash> characters shall be treated as a single <slash> character.

I don’t think any of Python’s officially supported POSIX platforms has special handling for two leading slashes. Cygwin and MSYS2 reserve it for UNC paths.

Eryk Sun:

I know it’s an off-topic side discussion. I was trying to give Guido a summary response to his questions about Windows paths and figured I may as well answer on the public forum.

Ack, I didn’t mean to come across as telling either you or Guido to stop talking about closely related topics. It’s fine by me and I seem to learn something new from every one of your posts! I was trying to put my proposal in context for anyone else who might be reading this thread. Sorry for not being clear.

5 Likes

I’ve loggeda feature request anda PR. I found that several functions inntpath andposixpath were already doing their own parsing of path roots that could be replaced bysplitroot(); I think helps demonstrate the usefulness of this function.

1 Like