coder/coderPublic

NotificationsYou must be signed in to change notification settings
Fork924
Star10.1k

RFC: Multi-tenancy#7638

ammario started this conversation inRFCs

ammario

May 22, 2023

· 3 comments· 7 replies

Return to top

Discussion options

ammario
May 22, 2023
Maintainer

Coder deployment scale is frequently blocked by our lack of user management features. For example, it's impossible to have multiple groups of templates with distinct, isolated admins (#7633). It's also impossible to have two sets of users with no knowledge of each other and no ability to interact, a must-have for security and compliance.

A common workaround is standing up multiple Coder deployments. Not only does this bring more administrative burden to the operators, it makes it more difficult for us to license our software, threatening our revenue and thus our sustainability as a project.

This issue is meant to be a long-running conversation on multi-tenancy, a potential solution to our user management problems as well as the path towards delivering Coder as SaaS.

Challenges

This is an incomplete list of the major engineering work required to support multi-tenancy:

Our current code-first approach to deployment configuration is challenging to adapt to a multi-tenant environment. We have several control plane options, such as--disable-password-auth, which should be org-scoped.
OIDC providers are globally registered, whereas in a multi-tenant system, they would need to be registered per organization. We may still need a concept of global providers; for instance, in a SaaS setting, we might have global SSO providers for GitHub and Google.
We need to build out the UI/CLI/API that exposes Organizations to users.
A decision must be reached on whether user accounts can belong to multiple organizations (site-namespaced) or should be limited to one (org-namespaced).

Next Steps

Collect feedback on the demand for multi-tenancy / SaaS
Complete documenting the necessary engineering work

You must be logged in to vote

Replies: 3 comments 7 replies

Comment options

spikecurtis
May 23, 2023
Collaborator

Provisioner isolation is also a challenge.

Our current (single-tenant) recommendation is that operators deploy a runtime environment for provisionerd (often combined with Coderd in practice) that is credentialed to access the IaaS APIs used to create workspaces. While this could work for multi-tenancy, it means operators would deploy multiple provisionerd environments--one per organization. This replicates the problem with standing up multiple Coder deployments (albeit with fewer moving parts).

If we want to ease that burden, then Coder must build "multi-tenant" provisionerd that provides at least:

Strong isolation between provisioner jobs from different organizations
Secure injection of IaaS credentials from organization- or template-scope to provisioner jobs

You must be logged in to vote

3 replies

Comment options

ammario May 23, 2023
Maintainer Author

I see this as an advantage of the multi-tenant system, not a challenge. Multiple organizations guarantee provisioner isolation, and, within an org, we will have "provisioner groups" that provide isolation. In an enterprise or SaaS setting, two organizations, even with malicious intent for the other, can safely co-exist on the deployment.

Of course, two orgs that want to unify provisioning infrastrcture can both deploy provisioners alongside eachother.

Comment options

spikecurtis May 24, 2023
Collaborator

Multiple organizations guarantee provisioner isolation, and, within an org, we will have "provisioner groups" that provide isolation.

I don't fully understand this sentence. What are the technical means that provide the isolation?

Like, if the isolation mechanism is separate deployments, I don't get why we're saying separate Coder deployments is a significant administrative burden, but separate provisioner deployments is not.

Comment options

ammario May 24, 2023
Maintainer Author

I don't fully understand this sentence. What are the technical means that provide the isolation?

Templates in one organization cannot use the same provisioner as templates in another organization.

Like, if the isolation mechanism is separate deployments, I don't get why we're saying separate Coder deployments is a significant administrative burden, but separate provisioner deployments is not.

I'm thinking about the pains as many of the same ones that would afflict us in SaaS. While there is tenant isolation, we would want to unify operations around the control plane, esp. geo-distribution, security, disaster recovery / backups, and deployability. It would be quite cumbersome to run different instances of the coder process and different databases for each organization.

There is also the licensing benefit. It's very difficult for us to enforce seat limits when an enterprise is using multiple deployments.

Comment options

smolinari
May 23, 2023

I might be in left field with this, but I'd like to suggest that Coder just be a "dumb" server and allow clients to command it to do its work via API only. By that I mean, leave the burden and the responsibility of user management and access control, etc. all to the client. I don't believe you'll ever find a one size fits all formula to make everyone happy in terms of multi-tenancy without ending up having to build out a full blown platform yourselves.

Coder should just be a service inside the (client's) platform and offer an API to handle the things Coder can do best. The UI itself should just be for admins and not (really) for the devs. Making the platform for their devs should be their job. A great example of the same kind of mentality can be found with the suite of Argo tools. ArgoCD, Workflows, etc.

The "dev UI" experience Coder offers, at best, should only be a simple example of what is possible for the platform engineers to build themselves and also for demoing. There should be a clear "break" between admin and dev operations too i.e. between template management and using workspaces, which currently isn't the case AFAICT.

Do correct me if I am wrong.

My 2 cents.... 😁

Scott

You must be logged in to vote

0 replies

Comment options

szab100
May 23, 2023

+1 for@smolinari 's thoughts above, that is how we use Coder v2 OSS today: users are spread across multiple Coder deployments, user & token management done entirely via API, groups & access control, high level features are provided by our services. I also agree that there is no "one size fits all" formula, especially when it comes to Enterprises with iften very complex internal user/team structures, policies, processes and a variety of existing developer tools / ecosystem to seamlessly integrate with.

However, the proposed changed above are perfectly fine! Users who need such isolated Orgs capability should benefit from them.

One suggestion though is that we found--disable-password-auth to be useless, since it blocks user-management via the API when enabled (while it should just disallow password-based auth). The workaround is that we create users with long, generated passwords that we never share with them. Instead of making it org-scoped, consider (also) making it user-scoped for more flexibility, eg allow creating users with disabled password and/or allow disabling the password for already existing users (through API / CLI / UI).

You must be logged in to vote

4 replies

Comment options

Emyrk Jun 13, 2023
Collaborator

@szab100 If we were to add aLoginTypeNone on a user by user basis that prevents logging in via any means (oidc, password, etc), that would satisfy what you are looking for.

A follow up question would be, once this user is given an api key/token (via some other means like an admin making them one), should that user then be able to generate more tokens/keys? This is not a hard problem to solve, just curious on your take.

Comment options

szab100 Jun 16, 2023

Hey@Emyrk, thanks for the reply. Yes, that would solve the problem (not a huge one, really, since passwords are never given to users).

The token generation is indeed an interesting question. If the token's scope isall and the user's Role has permissions on ResourceAPIKey, then probably yes, it should allow generating new tokens/keys. So this looks like an RBAC / AuthZ question rather than whether login is allowed or not.

This does not really affect us, since we run a reverse-proxy in front of the Coder deployments, which selects & decrypts the user's session token (scope='all') from our app's combined Cookie value (having a b64-encoded JSON map of the user's encrypted tokens for multiple Coder deployments). After decrypting the token (with a key unique to the Coder deployment), the rev-proxy sets the cleartext token as Coder-Session-Token header, but only for a set of whitelisted "safe" request patterns needed for accessing workspaces, eg. various GET Coder API endpoints, like user info, workspace agents & connections (/coordinate,/pty etc).. The only POST we allow for user requests is/api/v2/authcheck 🙈 And we always issue user tokens through our API (returns encrypted tokens) using an admin user token, so similarly to user passwords, we never handout cleartext Coder tokens to our users either.

Speaking of tokens, we are working on a feature called "workspace sharing" (multi-tenancy on workspace level 😄, primarily for pair-programming / troubleshooting) where users can share a single workspace with guests (connect using ssh + access workspace apps). We can do this on our end by handing out the workspace owner's regular (scope=full), encrypted session token with some extra fields in our Cookie's JSON to restrict guest access to a given workspace name / enforce shorter expiration or revocation (+ a signature to avoid user-tampering). But in order to limit access to a single workspace, our rev-proxy needs to read into Coder's DB (with heavy caching, since workspace & agent IDs are immutable) using Coder's internal api (eg. coderd'sdatabase.Store &workspaceapps.Provider) to look up workspace details (name / id) from ws-agent or ws-apps type of requests. We have a working POC for this, but thought it would be both easier & more efficient to simply pass the token through with the requests to Coder, so it would limit the access to the single workspace the token is related to. We actually thought of contributing this small feature. The problem is that even with such workspace-scoped tokens, we can't avoid looking up workspace details for incoming requests, since when UserA (guest) wants to access UserB's shared workspace, we need to know which workspace the requests are related to, eg whether to set the guest token (to access UserB's shared workspace) OR UserA's normal access token (to access his own workspaces) on the outgoing Coder requests.

Maybe a better approach to support "workspace sharing" would be letting users explicitly grant temporary access to their workspaces to other Coder users, eg. UserA authorizes UserB to access its Workspace-X for 30 mins (or until revoked), so UserB will be granted to access UserA's Workspace-X using UserB's own (full) session token. This is trickier than the workspace-scoped user-tokens explained above, so probably only worth if you want to build out a complete Workspace Sharing feature for Coder. But then guest workspace access can be properly handled, like displaying shared workspaces on the UI (similarly to GSuite's "Shared with me" tab), proper Audit log entries, etc. Maybe in case of SSH, even spawn guest shells with a differentiated guest_user (configured with agent) in the Workspace, eg.guest on Linux K8S Pods, so customers can set up workspace permissions accordingly, like denying 'sudo' / set restricted / jailed shells and prevent guests reading the owner's home dir (we use /workspace as PVC mount for work area, but /home can still contain temp credentials, etc), while giving write access to the work area to both the owner & guest and running web-based workspace apps with dedicated users instead of the owner's.

Or a third (simplest) option would be to further improve "Application Scoped API Keys", so that those api-keys can be requested to limit access to a single workspace's single application, as well as the/api/v2/applications/auth-redirect endpoint to support the required additional query params (egworkspace_id andapp_name), so the redirect URI with the encrypted token it returns with will only allow the guest user(s) to access that particular app / workspace. In this case, we could simply give out these redirect URLs to guest users, giving them access to VSCode only (which we can run with a restricted user instead of the main workspace user). This endpoint should also return with a HTTP header containing the created app-scoped token's ID, which callers can store in their DB and delete the token when the user wants to "unshare" the workspace, which will kill the guest access (except guest who were already issued a "SignedToken", unless Coder keeps a list of unvalidated JWTs!). This would equal to Gitpod's link-based "live workspace sharing" feature, which fulfills our requirements.

Hmm, actually, I just realized that this 3rd option (only giving guests access to VSCode Web) is something we can already solve with the rev-proxy solution without any DB lookups, by adding a "guest_tokens" JSON map to our session Cookie, where the keys are the FQDN application subdomains of the shared workspaces' VSCode apps. When we process requests, if it's a subdomain (host != coder access url) request, we simply just need to see if the user has a "guest token" for that subdomain (it's actually the owner's token & is still encrypted + the subdomain / expiration signed so users can't modify) and if so, just pass the "guest token" for those requests and use the user's regular token for all other (permitted) requests, so guests can still fully access their own workspaces. No problem with SignedTokens either, as we can maintain a global guest token revocation list, so as soon as our rev-proxy finds out that the guest token was revoked, it can simply omit the app's JWT Cookie from downstream requests ==> the guest access is killed immediately.

So sorry for the super lengthy comment (and thanks for being my rubber duck.. 🤣)! Hope at least some of this is useful for you to further improve Coder, we should be good without any changes actually, while being able to narrow APIKeys' scope down to either a single workspace and/or a single [workspace + app] would probably be beneficial to many users, but all up to you. ✌️

Comment options

Emyrk Jun 22, 2023
Collaborator

Yes, that would solve the problem

Not a huge problem, but it is now possible to set LoginTypeNone!
#8009

The token generation is indeed an interesting question. If the token's scope is all and the user's Role has permissions on ResourceAPIKey, then probably yes, it should allow generating new tokens/keys. So this looks like an RBAC / AuthZ question rather than whether login is allowed or not.

Exactly what I was thinking. I mainly asked this to check if your situation required a new scope to be added, and I could get that in for you quickly. If this is not a blocker though, then we do not need to do this now.

... shared workspaces ...

This is a feature we have considered. I personally think it would be pretty neat, and coming up with the implementation is tricky from all the things you have mentioned. I imagine we might need to pair it well with something in the agent/terraform to pass the logged in "user" into the workspace at some controlled permission.

while being able to narrow APIKeys' scope down to either a single workspace and/or a single [workspace + app] would probably be beneficial to many users, but all up to you.

This is interesting because we actually had this in the initial version of the RBAC.

#3426

The description is a more high level why it was done, but there was some technical reasons as well. In short, we can always answer this question "List me all workspaces I have access to, 50 at a time for pagination"quickly because of this simplification. Essentially we take the rego policy (our authz language) and convert it to SQL and then we can useLIMIT andOFFSET to paginate. It's clean and great.

If we add resource specific permissions, we have the problem that the SQL query now growswithout bounds. If you have 200 workspaces shared with you, the query has to include:AND id = ANY(1,2,3,4...,200). This isprobably solvable since our "without bounds" is still likely some reasonable number that postgres can handle, but it is unfortunate that our query size for list calls could blow up.

The second technical reason assigning perms to a single workspace is difficult, is that it makes discoverability of who can access a workspace exceptionally hard. To answer the question "who can access workspace x", I now need to query all users and check their roles. Then maybe in your case also query all api keys since the "user" might be able too, but they have no api keys that can.

This discoverability is bad from a UX POV, but also from a management POV. How do we ensure credentials are not hanging around and your workspace actually has some backdoor access you forgot to revoke?

Now workspace sharing isstill something we want and we need a solution. The current proposed idea by yours truly (me 😄) is to use ACL's on the workspace itself.

We actually do this for templates.

coder/coderd/database/dump.sql

Lines 566 to 567 ina353043

	user_acl jsonb DEFAULT'{}'::jsonbNOT NULL,
	group_acl jsonb DEFAULT'{}'::jsonbNOT NULL,

This solves the two technical problems above.

The SQL query is bounded because the query saysAND user_id = ANY(template.user_acl) (deal with jsonb, but that's the idea). This is a bounded query because the number of shared users does not affect the SQL query length. There is another benefit in how we convert rego -> sql, but not worth getting into here.
Discoverability is easy as it is tied to the workspace. If you can fetch the workspace, you have the ACL list of anyone aside from the owner.

So I think how workspace sharing would work is an ACL list on the workspace itself. How that lends itself to your application means you need both the workspace and the user to know if the access can be granted.

Comment options

Emyrk Jun 22, 2023
Collaborator

So sorry for the super lengthy comment

Love long comments. Seems you are doing some really cool stuff ontop of Coder!

Movatterモバイル変換

RFC: Multi-tenancy#7638

Uh oh!

Uh oh!

ammarioMay 22, 2023 Maintainer

Challenges

Next Steps

Replies: 3 comments· 7 replies

Uh oh!

spikecurtisMay 23, 2023 Collaborator

Uh oh!

ammarioMay 23, 2023 Maintainer Author

Uh oh!

Uh oh!

spikecurtisMay 24, 2023 Collaborator

Uh oh!

ammarioMay 24, 2023 Maintainer Author

Uh oh!

smolinariMay 23, 2023

Uh oh!

Uh oh!

szab100May 23, 2023

Uh oh!

EmyrkJun 13, 2023 Collaborator

Uh oh!

Uh oh!

szab100Jun 16, 2023

Uh oh!

EmyrkJun 22, 2023 Collaborator

Uh oh!

EmyrkJun 22, 2023 Collaborator

Uh oh!

ammario
May 22, 2023
Maintainer

Replies: 3 comments 7 replies

spikecurtis
May 23, 2023
Collaborator

ammario May 23, 2023
Maintainer Author

spikecurtis May 24, 2023
Collaborator

ammario May 24, 2023
Maintainer Author

smolinari
May 23, 2023

szab100
May 23, 2023

Emyrk Jun 13, 2023
Collaborator

szab100 Jun 16, 2023

Emyrk Jun 22, 2023
Collaborator

Emyrk Jun 22, 2023
Collaborator