Movatterモバイル変換


[0]ホーム

URL:


Following system colour schemeSelected dark colour schemeSelected light colour scheme

Python Enhancement Proposals

PEP 708 – Extending the Repository API to Mitigate Dependency Confusion Attacks

Author:
Donald Stufft <donald at stufft.io>
PEP-Delegate:
Paul Moore <p.f.moore at gmail.com>
Discussions-To:
Discourse thread
Status:
Provisional
Type:
Standards Track
Topic:
Packaging
Created:
20-Feb-2023
Post-History:
01-Feb-2023,23-Feb-2023
Resolution:
Discourse message

Table of Contents

Provisional Acceptance

This PEP has beenprovisionally accepted,with the following required conditions before the PEP is made Final:

  1. An implementation of the PEP in PyPI (Warehouse)including any necessary UI elementsto allow project owners to set the tracking data.
  2. An implementation of the PEP in at least one repository other than PyPI,as you can’t really test merging indexes without at least two indexes.
  3. An implementation of the PEP in pip,which supports the intended semantics and can be used to demonstratethat the expected security benefits are achieved.This implementation will need to be “off by default” initially,which means that users will have to opt in to testing it.Ideally, we should collect explicit positive reports from users(both project owners and project users)who have successfully tried out the new feature,rather than just assuming that “no news is good news”.

Abstract

Dependency confusion attacks, in which a malicious package is installed insteadof the one the user expected, are anincreasingly common supply chain threat.Most such attacks against Python dependencies, including therecent PyTorch incident,occur with multiple package repositories, where a dependency expected to comefrom one repository (e.g. a custom index) is installed from another (e.g. PyPI).

To help address this problem, this PEP proposes extending theSimple Repository APIto allow repository operators to indicate that a project found on theirrepository “tracks” a project on different repositories, and allows projects toextend their namespaces across multiple repositories.

These features will allow installers to determine when a project being madeavailable from a particular mix of repositories is expected and should beallowed, and when it is not and should halt the install with an error to protectthe user.

Motivation

There is a long-standing class of attacks that are called “dependency confusion”attacks, which roughly boil down to an individual user expected to get packageA, but instead they gotB. In Python, this almost always happens due tothe configuration of multiple repositories (possibly including the default ofPyPI), where they expected packageA to come from repositoryX, butsomeone is able to publish packageB to repositoryY under the samename.

Dependency Confusion attacks have long been possible, but they’ve recentlygained press withpublic examples of cases where these attacks were successfully executed.

A specific example of this is the recent case where the PyTorch project had aninternal package namedtorchtriton which was only ever intended to beinstalled from their repositories located athttps://download.pytorch.org/,but that repository was designed to be used in conjunction with PyPI, andthe name oftorchtriton was not claimed on PyPI, which allowed the attackerto use that name and publish a malicious version.

There are a number of ways to mitigate against these attacks today, but they allrequire that the end user go out of their way to protect themselves, rather thanbeing protected by default. This means that for the vast bulk of users, they arelikely to remain vulnerable, even if they are ultimately aware of these types ofattacks.

Ultimately the underlying cause of these attacks come from the fact that thereis no globally unique namespace that all Python package names come from.Instead, each repository is its own distinct namespace, and when given an“abstract” name such asspam to install, an installer has to implicitly turnthat into a “concrete” name such aspypi.org:spam orexample.com:spam.Currently the standard behavior in Python installation tools is to implicitlyflatten these multiple namespaces into one that contains the files from allnamespaces.

This assumption that collapsing the namespaces is what was expected means thatwhen packages with the same name in different repositoriesare authored by different parties (such as in thetorchtriton case)dependency confusion attacks become possible.

This is made particularly tricky in that there is no “right” answer; there arevalid use cases both for wanting two repositories merged into one namespaceand for wanting two repositories to be treated as distinct namespaces. Thismeans that an installer needs some mechanism by which to determine when itshould merge the namespaces of multiple repositories and when it should not,rather than a blanket always merge or never merge rule.

This functionality could be pushed directly to the end user, since ultimatelythe end user is the person whose expectations of what gets installed from whatrepository actually matters. However, by extending the repository specificationto allow a repository to indicate when it is safe, we can enable individualprojects and repositories to “work by default”, even when theirproject naturally spans multiple distinct namespaces, while maintaining theability for an installer to be secure by default.

On its own, this PEP does not solve dependency confusion attacks, but what itdoes do is provide enough information so that installers can prevent themwithout causing too much collateral damage to otherwise valid and safe usecases.

Rationale

There are two broad use cases for merging names across repositories that thisPEP seeks to enable.

The first use case is when one repository is not defining its own names, butrather is extending names defined in other repositories. This commonly happensin cases where a project is being mirrored from one repository to another (seeBandersnatch) or when a repositoryis providing supplementary artifacts for a specific platform (seePiwheels).

In this case neither the repositories nor the projects that are being extendedmay have any knowledge that they are being extended or by whom, so this cannotrely on any information that isn’t present in the “extending” repository itself.

The second use case is when the project wants to publish to one “main”repository, but then have additional repositories that provide binaries foradditional platforms, GPUs, CPUs, etc. Currently wheel tags are not sufficientlyable to express these types of binary compatibility, so projects that wish torely on them are forced to set up multiple repositories and have their usersmanually configure them to get the correct binaries for their platform, GPU,CPU, etc.

This use case is similar to the first, but the important difference that makesit a distinct use case on its own is who is providing the information and whattheir level of trust is.

When a user configures a specific repository (or relies on the default) thereis no ambiguity as to what repository they mean. A repository is identified byan URL, and through the domain system, URLs are globally unique identifiers.This lack of ambiguity means that an installer can assume that the repositoryoperator is trustworthy and can trust metadata that they provide without needingto validate it.

On the flip side, given an installer finds a name in multiple repositories it isambiguous which of them the installer should trust. This ambiguity means that aninstaller cannot assume that the project owner on either repository istrustworthy and needs to validate that they are indeed the same project and thatone isn’t a dependency confusion attack.

Without some way for the installer to validate the metadata between multiplerepositories, projects would be forced into becoming repository operators tosafely support this use case. That wouldn’t be a particularly wrong choice tomake; however, there is a danger that if we don’t provide a way for repositoriesto let project owners express this relationship safely, they will beincentivized to let them use the repository operator’s metadata instead whichwould reintroduce the original insecurity.

Specification

This specification defines the changes in version 1.2 of the simple repositoryAPI, adding new two new metadata items: Repository “Tracks” and “AlternateLocations”.

Repository “Tracks” Metadata

To enable one repository to host a project that is intended to “extend” aproject that is hosted at other repositories, this PEP allows the extendingrepository to declare that a particular project “tracks” a project at anotherrepository or repositories by adding the URLs of the project and repositoriesthat it is extending.

This is exposed in JSON as the keymeta.tracks and in HTML as a meta elementnamedpypi:tracks on the project specific URLs, ($root/$project/).

There are a few key properties thatMUST be preserved when using thismetadata:

  • ItMUST be under the control of the repository operators themselves, notany individual publisher using that repository.
    • “Repository Operator” can also include anyone who managed the overallnamespace for a particular repository, which may be the case in situationslike hosted repository services where one entity operates the software butanother owns/manages the entire namespace of that repository.
  • All URLsMUST represent the same “project” as the project in the extendingrepository.
    • This does not mean that they need to serve the same files. It is valid forthem to include binaries built on different platforms, copies with localpatches being applied, etc. This is purposefully left vague as it’sultimately up to the expectations that the users have of the repository andits operators what exactly constitutes the “same” project.
  • ItMUST point to the repositories that “own” the namespaces, not anotherrepository that is also tracking that namespace.
  • ItMUST point to a project with the exact same name (after normalization).
  • ItMUST point to the actual URLs for that project, not the base URL forthe extended repositories.

It isNOT required that every name in a repository tracks the samerepository, or that they all track a repository at all. Mixed use repositorieswhere some names track a repository and some names do not are explicitlyallowed.

JSON

{"meta":{"api-version":"1.2","tracks":["https://pypi.org/simple/holygrail/","https://test.pypi.org/simple/holygrail/"]},"name":"holygrail","files":[{"filename":"holygrail-1.0.tar.gz","url":"https://example.com/files/holygrail-1.0.tar.gz","hashes":{"sha256":"...","blake2b":"..."},"requires-python":">=3.7","yanked":"Had a vulnerability"},{"filename":"holygrail-1.0-py3-none-any.whl","url":"https://example.com/files/holygrail-1.0-py3-none-any.whl","hashes":{"sha256":"...","blake2b":"..."},"requires-python":">=3.7","dist-info-metadata":true}]}

HTML

<!DOCTYPE html><html><head><metaname="pypi:repository-version"content="1.2"><metaname="pypi:tracks"content="https://pypi.org/simple/holygrail/"><metaname="pypi:tracks"content="https://test.pypi.org/simple/holygrail/"></head><body><ahref="https://example.com/files/holygrail-1.0.tar.gz#sha256=..."><ahref="https://example.com/files/holygrail-1.0-py3-none-any.whl#sha256=..."></body></html>

“Alternate Locations” Metadata

To enable a project to extend its namespace across multiple repositories, thisPEP allows a project owner to declare a list of “alternate locations” for theirproject. This is exposed in JSON as the keyalternate-locations and in HTMLas a meta element namedpypi-alternate-locations, which may be used multipletimes.

There are a few key properties thatMUST be observed when using thismetadata:

  • In order for this metadata to be trusted, thereMUST be agreement betweenall locations where that project is found as to what the alternate locationsare.
  • When using alternate locations, clientsMUST implicitly assume that theurl the response was fetched from was included in the list. This means thatif you fetch fromhttps://pypi.org/simple/foo/ and it has analternate-locations metadata that has the value["https://example.com/simple/foo/"], then youMUST treat it as if ithad the value["https://example.com/simple/foo/","https://pypi.org/simple/foo/"].
  • Order of the elements within the array does not have any particular meaning.

When an installer encounters a project that is using the alternate locationsmetadata itSHOULD consider that all repositories named are extending thesame namespace across multiple repositories.

Note

This alternate locations metadata is project level metadata, not artifactlevel metadata, which means it doesn’t get included as part of the coremetadata spec, but rather it is something that each repository will have toprovide a configuration option for (if they choose to support it).

JSON

{"meta":{"api-version":"1.2"},"name":"holygrail","alternate-locations":["https://pypi.org/simple/holygrail/","https://test.pypi.org/simple/holygrail/"],"files":[{"filename":"holygrail-1.0.tar.gz","url":"https://example.com/files/holygrail-1.0.tar.gz","hashes":{"sha256":"...","blake2b":"..."},"requires-python":">=3.7","yanked":"Had a vulnerability"},{"filename":"holygrail-1.0-py3-none-any.whl","url":"https://example.com/files/holygrail-1.0-py3-none-any.whl","hashes":{"sha256":"...","blake2b":"..."},"requires-python":">=3.7","dist-info-metadata":true}]}

HTML

<!DOCTYPE html><html><head><metaname="pypi:repository-version"content="1.2"><metaname="pypi:alternate-locations"content="https://pypi.org/simple/holygrail/"><metaname="pypi:alternate-locations"content="https://test.pypi.org/simple/holygrail/"></head><body><ahref="https://example.com/files/holygrail-1.0.tar.gz#sha256=..."><ahref="https://example.com/files/holygrail-1.0-py3-none-any.whl#sha256=..."></body></html>

Recommendations

This section is non-normative; it provides recommendations to installers in howto interpret this metadata that this PEP feels provides the best tradeoffbetween protecting users by default and minimizing breakages to existingworkflows. These recommendations are not binding, and installers are free toignore them, or apply them selectively as they make sense in their specificsituations.

File Discovery Algorithm

Note

This algorithm is written based on how pip currently discovers files;other installers may adapt this based on their own discovery procedures.

Currently the “standard” file discovery algorithm looks something like this:

  1. Generate a list of all files across all configured repositories.
  2. Filter out any files that do not match known hashes from a lockfile orrequirements file.
  3. Filter out any files that do not match the current platform, Python version,etc.
  4. Pass that list of files into the resolver where it will attempt to resolvethe “best” match out of those files, irrespective of which repository it camefrom.

It is recommended that installers change their file discovery algorithm to takeinto account the new metadata, and instead do:

  1. Generate a list of all files across all configured repositories.
  2. Filter out any files that do not match known hashes from a lockfile orrequirements file.
  3. If the end user has explicitly told the installer to fetch the project fromspecific repositories, filter out all other repositories and skip to 5.
  4. Look to see if the discovered files span multiple repositories; if they dothen determine if either “Tracks” or “Alternate Locations” metadata allowssafely mergingALL of the repositories where files were discoveredtogether. If that metadata doesNOT allow that, then generate an error,otherwise continue.
    • Note: This only applies toremote repositories; repositories thatexist on the local filesystemSHOULD always be implicitly allowed to bemerged to any remote repository.
  5. Filter out any files that do not match the current platform, Python version,etc.
  6. Pass that list of files into the resolver where it will attempt to resolvethe “best” match out of those files, irrespective of what repository it camefrom.

This is somewhat subtle, but the key things in the recommendation are:

  • Users who are using lock files or requirements files that include specifichashes of artifacts that are “valid” are assumed to be protected by nature ofthose hashes, since the rest of these recommendations would apply duringhash generation. Thus, we filter out unknown hashes up front.
  • If the user has explicitly told the installer that it wants to fetch a projectfrom a certain set of repositories, then there is no reason to question thatand we assume that they’ve made sure it is safe to merge those namespaces.
  • If the project in question only comes from a single repository, then there isno chance of dependency confusion, so there’s no reason to do anything butallow.
  • We check for the metadata in this PEP before filtering out based on platform,Python version, etc., because we don’t want errors that only show up oncertain platforms, Python versions, etc.
  • If nothing tells us merging the namespaces is safe, we refuse to implicitlyassume it is, and generate an error instead.
  • Otherwise we merge the namespaces, and continue on.

This algorithm ensures that an installer never assumes that two disparatenamespaces can be flattened into one, which for all practical purposeseliminates the possibility of any kind of dependency confusion attack, whilestill giving power throughout the stack in a safe way to allow people toexplicitly declare when those disparate namespaces are actually one logicalnamespace that can be safely merged.

The above algorithm is mostly a conceptual model. In reality the algorithm mayend up being slightly different in order to be more privacy preserving andfaster, or even just adapted to fit a specific installer better.

Explicit Configuration for End Users

This PEP avoids dictating or recommending a specific mechanism by which aninstaller allows an end user to configure exactly what repositories they want aspecific package to be installed from. However, it does recommend thatinstallers do providesome mechanism for end users to provide thatconfiguration, as without it users can end up in a DoS situation in casesliketorchtriton where they’re just completely broken unless they resolvethe namespace collision externally (get the name taken down on one repository,stand up a personal repository that handles the merging, etc).

This configuration also allows end users to pre-emptively secure themselvesduring what is likely to be a long transition until the default behavior issafe.

How to Communicate This

Note

This example is pip specific and assumes specifics about how pip willchoose to implement this PEP; it’s included as an example of how we cancommunicate this change, and not intended to constrain pip or any otherinstaller in how they implement this. This may ultimately be the actual basisfor communication, and if so will need be edited for accuracy and clarity.

This section should be read as if it were an entire “post” to communicate thischange that could be used for a blog post, email, or discourse post.

There’s a long-standing class of attacks that are called “dependency confusion”attacks, which roughly boil down to an individual expected to get packageA,but instead they gotB. In Python, this almost always happens due to the enduser having configured multiple repositories, where they expect packageA tocome from repositoryX, but someone is able to publish packageB withthe same name as packageA in repositoryY.

There are a number of ways to mitigate against these attacks today, but they allrequire that the end user explicitly go out of their way to protect themselves,rather than it being inherently safe.

In an effort to secure pip’s users and protect them from these types of attacks,we will be changing how pip discovers packages to install.

What is Changing?

When pip discovers that the same project is available from multiple remoterepositories, by default it will generate an error and refuse to proceed ratherthan make a guess about which repository was the correct one to install from.

Projects that natively publish to multiple repositories will be given theability to safely “link” their repositories together so that pip does not errorwhen those repositories are used together.

End users of pip will be given the ability to explicitly define one or morerepositories that are valid for a specific project, causing pip to only considerthose repositories for that project, and avoiding generating an erroraltogether.

See TBD for more information.

Who is Affected?

Users who are installing from multiple remote (e.g. not present on the localfilesystem) repositories may be affected by having pip error instead ofsuccessfully install if:

  • They install a project where the same “name” is being served by multipleremote repositories.
  • The project name that is available from multiple remote repositories has notused one of the defined mechanisms to link those repositories together.
  • The user invoking pip has not used the defined mechanism to explicitly controlwhat repositories are valid for a particular project.

Users who are not using multiple remote repositories will not be affected atall, which includes users who are only using a single remote repository, plus alocal filesystem “wheel house”.

What do I need to do?

As a pip User?

If you’re using only a single remote repository you do not have to do anything.

If you’re using multiple remote repositories, you can opt into the new behaviorby adding--use-feature=TBD to your pip invocation to see if any of yourdependencies are being served from multiple remote repositories. If they are,you should audit them to determine why they are, and what the best remediationstep will be for you.

Once this behavior becomes the default, you can opt out of it temporarily byadding--use-deprecated=TBD to your pip invocation.

If you’re using projects that are not hosted on a public repository, but youstill have the public repository as a fallback, consider configuring pip with arepository file to be explicit where that dependency is meant to come from toprevent registration of that name in a public repository to cause pip to errorfor you.

As a Project Owner?

If you only publish your project to a single repository, then you do not have todo anything.

If you publish your project to multiple repositories that are intended to beused together at the same time, configure all repositories to serve thealternate repository metadata to prevent breakages for your end users.

If you publish your project to a single repository, but it is commonly used inconjunction with other repositories, consider preemptively registering yournames with those repositories to prevent a third party from being able to causeyour userspipinstall invocations to start failing. This may not beavailable if your project name is too generic or if the repositories havepolicies that prevent defensive name squatting.

As a Repository Operator?

You’ll need to decide how you intend for your repository to be used by your endusers and how you want them to use it.

For private repositories that host private projects, it is recommended that youmirror the public projects that your users depend on into your own repository,taking care not to let a public project merge with a private project, and tellyour users to use the--index-url option to use only your repository.

For public repositories that host public projects, you should implement thealternate repository mechanism and enable the owners of those projects toconfigure the list of repositories that their project is available from if theymake it available from more than one repository.

For public repositories that “track” another repository, but providesupplemental artifacts such as wheels built for a specific platform, you shouldimplement the “tracks” metadata for your repository. However, this informationMUST NOT be settable by end users who are publishing projects to yourrepository. See TBD for more information.

Rejected Ideas

Note: Some of these are somewhat specific to pip, but any solution that doesn’twork for pip isn’t a particularly useful solution.

Implicitly allow mirrors when the list of files are the same

If every repository returns the exact same list of files, then it is safe toconsider those repositories to be the same namespace and implicitly merge them.This would possibly mean that mirrors would be automatically allowed without anywork on any user or repository operator’s part.

Unfortunately, this has two failings that make it undesirable:

  • It only solves the case of mirrors that are exact copies of each other, butnot repositories that “track” another one, which ends up being a more genericsolution.
  • Even in the case of exact mirrors, multiple repositories mirroring each otheris a distributed system will not always be fully consistent with eachother, effectively an eventually consistent system. This means thatrepositories that relied on this implicit heuristic to work would havesporadic failures due to drift between the source repository and the mirrorrepositories.

Provide a mechanism to order the repositories

Providing some mechanism to give the repositories an order, and then shortcircuiting the discovery algorithm when it finds the first repository thatprovides files for that project is another workable solution that is safe if theorder is specified correctly.

However, this has been rejected for a number of reasons:

  • We’ve spent 15+ years educating users that the ordering of repositories beingspecified is not meaningful, and they effectively have an undefined order. Itwould be difficult to backpedal on that and start saying that now ordermatters.
  • Users can easily rearrange the order that they specify their repositories inwithin a single location, but when loading repositories from multiplelocations (env var, conf file, requirements file, cli arguments) the order ishard coded into pip. While it would be a deterministic and documented order,there’s no reason to assume it’s the order that the user wants theirrepositories to be defined in, forcing them to contort how they configure pipso that the implicit ordering ends up being the correct one.
  • The above can be mitigated by providing a way to explicitly declare the orderrather than by implicitly using the order they were defined in; however, thatthen means that the protections are not provided unless the user does someexplicit configuration.
  • Ordering assumes that one repository isalways preferred over anotherrepository without any way to decide on a project by project basis.
  • Relying on ordering is subtle; if I look at an ordering of repositories, Ihave no way of knowing or ensuring in advance what names are goingto come from what repositories. I can only know in that moment what names areprovided by which repositories.
  • Relying on ordering is fragile. There’s no reason to assume that two disparaterepositories are not going to have random naming collisions—what happens ifI’m using a library from a lower priority repository and then a higherpriority repository happens to start having a colliding name?
  • In cases where ordering does the wrong thing, it does so silently, with nofeedback given to the user. This is by design because it doesn’t actually knowwhat the wrong or right thing is, it’s just hoping that order will give theright thing, and if it does then users are protected without any breakage.However, when it does the wrong thing, users are left with a very confusingbehavior coming from pip, where it’s just silently installing the wrong thing.

There is a variant of this idea which effectively says that it’s really justPyPI’s nature of open registration that causes the real problems, so if we treatall repositories but the “default” one as equal priority, and then treat thedefault one as a lower priority then we’ll fix things.

That is true in that it does improve things, but it has many of the sameproblems as the general ordering idea (though not all of them).

It also assumes that PyPI, or whatever repository is configured as the“default”, is the only repository with open registration of names.However, projects likePiwheels existwhich users are expected to use in addition to PyPI,which also effectively have open registration of namessince it tracks whatever names are registered on PyPI.

Rely on repository proxies

One possible solution is to instead of having the installer have to solve this,to instead depend on repository proxies that can intelligently merge multiplerepositories safely. This could provide a better experience for people withcomplex needs because they can have configuration and features that arededicated to the problem space.

However, that has been rejected because:

  • It requires users to opt into using them, unless we also remove the facilitiesto have more than one repository in installers to force users into using arepository proxy when they need multiple repositories.
    • Removing facilities to have more than one repository configured has beenrejected because it would be too disruptive to end users.
  • A user may need different outcomes of merging multiple repositories indifferent contexts, or may need to merge different, mutually exclusiverepositories. This means they’ll need to actually set up multiple repositoryproxies for each unique set of options.
  • It requires users to maintain infrastructure or it requires adding features ininstallers to automatically spin up a repository for each invocation.
  • It doesn’t actually change the requirement to need to have a solution to theseproblems, it just shifts the responsibility of implementation from installersto some repository proxy, but in either case we still need something thatfigures out how to merge these disparate namespaces.
  • Ultimately, most users do not want to have to stand up a repository proxy justto safely interact with multiple repositories.

Rely only on hash checking

Another possible solution is to rely on hash checking, since with hash checkingenabled users cannot get an artifact that they didn’t expect; it doesn’t matterif the namespaces are incorrectly merged or not.

This is certainly a solution; unfortunately it also suffers from problems thatmake it unworkable:

  • It requires users to opt in to it, so users are still unprotected by default.
  • It requires users to do a bunch of labor to manage their hashes, which issomething that most users are unlikely to be willing to do.
  • It is difficult and verbose to get the protection when users are not using arequirements.txt file as the source of their dependencies (this affectsbuild time dependencies, and dependencies provided at the command line).
  • It only sort of solves the problem, in a way it just shifts the responsibilityof the problem to be whatever system is generating the hashes that theinstaller would use. If that system isn’t a human manually validating hashes,which it’s unlikely it would be, then we’ve just shifted the question of howto merge these namespaces to whatever tool implements the maintenance of thehashes.

Require all projects to exist in the “default” repository

Another idea is that we can narrow the scope of--extra-index-url such thatits only supported use is to refer to supplemental repositories to the defaultrepository, effectively saying that the default repository defines thenamespace, and every additional repository just extends it with extra packages.

The implementation of this would roughly be to require that the projectMUSTbe registered with the default repository in order for any additionalrepositories to work.

This sort of works if you successfully narrow the scope in that way, butultimately it has been rejected because:

  • Users are unlikely to understand or accept this reduced scope, and thus arelikely to attempt to continue to use it in the now unsupported fashion.
    • This is complicated by the fact that with the scope now narrowed, users whohave the excluded workflow no longer have any alternative besides setting upa repository proxy, which takes infrastructure and effort that theypreviously didn’t have to do.
  • It assumes that just because a name in an “extra” repository is the same as inthe default repository, that they are the same project. If we were startingfrom scratch in a brand new ecosystem then maybe we could make this assumptionfrom the start and make it stick, but it’s going to be incredibly difficult toget the ecosystem to adjust to that change.
    • This is a fundamental issue with this approach; the underlying problem thatdrives dependency confusion is that we’re taking disparate namespaces andflattening them into one. This approach essentially just declares that OK,and attempts to mitigate it by requiring everyone to register their names.
  • Because of the above assumption, in cases where a name in an extra repositorycollides by accident with the default repository, it’s going to appear to workfor those users, but they are going to be silently in a state of dependencyconfusion.
    • This is made worse by the fact that the person who owns the name that isallowing this to work is going to be completely unaware of the role thatthey’re playing for that user, and might possibly delete their project orhand it off to someone else, potentially allowing them to inadvertentlyallow a malicious user to take it over.
  • Users are likely to attempt to get back to a working state by registeringtheir names in their default repository as a defensive name squat. Theirability to do this will depend on the specific policies of their defaultrepository, whether someone already has that name, whether it’s too generic,etc. As a best case scenario it will cause needless placeholder projects thatserve no purpose other than to secure some internal use of a name.

Move to Globally Unique Names

The main reason this problem exists is that we don’t have globally unique names,we have locally unique names that exist under multiple namespaces that we areattempting to merge into a single flat namespace. If we could instead come upwith a way to have globally unique names, we could sidestep the entire issue.

This idea has been rejected because:

  • Generating globally unique but secure names that are also meaningful to humansis a nearly impossible feat without piggybacking off of some kind ofcentralized database. To my knowledge the only systems that have managed to dothis end up piggybacking off of the domain system and refer to packages byURLs with domains etc.
  • Even if we come up with a mechanism to get globally unique names, our abilityto retrofit that into our decades old system is practically zero withoutburning it all to the ground and starting over. The best we could probably dois declare that all non globally unique names are implicitly names on the PyPIdomain name, and force everyone with a non PyPI package to rename theirpackage.
  • This would upend so many core assumptions and fundamental parts of our currentsystem it’s hard to even know where to start to list them.

Only recommend that installers offer explicit configuration

One idea that has come up is to essentially just implement the explicitconfiguration and don’t make any other changes to anything else. The specificproposal for a mapping policy is what actually inspired the explicitconfiguration option, and created a file that looked something like:

{"repositories":{"PyTorch":["https://download.pytorch.org/whl/nightly"],"PyPI":["https://pypi.org/simple"]},"mapping":[{"paths":["torch*"],"repositories":["PyTorch"],"terminating":true},{"paths":["*"],"repositories":["PyPI"]}]}

The recommendation to have explicit configuration pushes the decision on how toimplement that onto each installer, allowing them to choose what works best fortheir users.

Ultimately only implementing some kind of explicit configuration was rejectedbecause by its nature it’s opt in, so it doesn’t protect average users who areleast capable to solve the problem with the existing tools; by adding additionalprotections alongside the explicit configuration, we are able to protect allusers by default.

Additionally, relying on only explicit configuration also means that every enduser has to resolve the same problem over and over again, even in cases likemirrors of PyPI, Piwheels, PyTorch, etc. In each and every case they have to sitthere and make decisions (or find some example to cargo cult) in order to besecure. Adding extra features into the mix allows us to centralize thoseprotections where we can, while still giving advanced end users the ability tocompletely control their own destiny.

Scopes à la npm

There’s been some suggestion thatscopes similar to how npm has implemented themmay ultimately solve this. Ultimately scopes do not change anything about thisproblem. As far as I know scopes in npm are not globally unique, they’re tied toa specific registry just like unscoped names are. However what scopes do enableis an obvious mechanism for grouping related projects and the ability for a useror organization on npm.org to claim an entire scope, which makes explicitconfiguration significantly easier to handle because you can be assured thatthere’s a whole little slice of the namespace that wholly belongs to you, andyou can easily write a rule that assigns an entire scope to a specific nonpublic registry.

Unfortunately, it basically ends up being an easier version of the idea to onlyuse explicit configuration, which works ok in npm because its not particularlycommon for people to use their own registries, but in Python we encourage you todo just that.

Define and Standardize the “Explicit Configuration”

This PEP recommends installers to have a mechanism for explicit configuration ofwhich repository a particular project comes from, but it does not define whatthat mechanism is. We are purposefully leave that undefined, as it is closelytied to the UX of each individual installer and we want to allow each individualinstaller the ability to expose that configuration in whatever way that they seefit for their particular use cases.

Further, when the idea of defining that mechanism came up, none of the otherinstallers seemed particularly interested in having that mechanism defined forthem, suggesting that they were happy to treat that as part of their UX.

Finally, that mechanism, if we did choose to define it, deserves it’s own PEPrather than baking it as part of the changes to the repository API in this PEPand it can be a future PEP if we ultimately decide we do want to go down thepath of standardization for it.

Acknowledgements

Thanks to Trishank Kuppusamy for kick starting the discussion that lead to thisPEP with hisproposal.

Thanks to Paul Moore, Pradyun Gedam, Steve Dower, and Trishank Kuppusamy forproviding early feedback and discussion on the ideas in this PEP.

Thanks to Jelle Zijlstra, C.A.M. Gerlach, Hugo van Kemenade, and Stefano Riverafor copy editing and improving the structure and quality of this PEP.

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.


Source:https://github.com/python/peps/blob/main/peps/pep-0708.rst

Last modified:2025-02-01 08:55:40 GMT


[8]ページ先頭

©2009-2025 Movatter.jp