Kubernetes (often abbreviatedk8s) is an open-source system for automating deployment, and management of applications running incontainers. Kubernetes was selected in 2015 by the Cloud Services team as the replacement for Grid Engine in the Toolforge project.[1] Usage of k8s in Tools began in mid-2016,[2] and the current cluster design dates back to early 2020.[3]
This document tries to document the Kubernetes cluster used in Toolforge, and its direct support services (e.g. etcd). It does not cover specifics about services running in the cluster (e.g. theJobs service andbuild service), nor does it cover Toolforge services that are fully unrelated to the Kubernetes cluster (e.g.Redis).
The four main sections of this document correspond to the four categories of documentation inThe Grand Unified Theory of Documentation system in a structure inspired by how theTor Project Admins do it.
kubectlkubectl is the official Kubernetes command line interface tool. Assuming you are listed as a maintainer of theadmin tool (or the toolsbeta equivalent) you will automatically have superuser credentials provisioned in your NFS home directory.
To use the CLI tool, log in to a bastion host on the project where the cluster you want to interact with is located. If you want to just experiment, you should use the toolsbeta cluster for that. Most read-only commands can be used out of the box, for example to list pods in thetool-fourohfour namespace used by the 404 handler:
$kubectlgetpod-ntool-fourohfourNAME READY STATUS RESTARTS AGEfourohfour-7766466794-gtpgk 1/1 Running 0 7d20hfourohfour-7766466794-qctt8 1/1 Running 0 6d18h
However, all write actions and some read-only actions (e.g. interacting with nodes or secrets) will give you a permission error:
$kubectldeletepod-ntool-fourohfourfourohfour-7766466794-gtpgkError from server (Forbidden): pods "fourohfour-7766466794-gtpgk" is forbidden: User "taavi" cannot delete resource "pods" in API group "" in the namespace "tool-fourohfour"
If you're sure you want to continue, you need to usekubectl sudo:
$kubectlsudodeletepod-ntool-fourohfourfourohfour-7766466794-gtpgkpod "fourohfour-7766466794-gtpgk" deleted
kubectl sudo, as the name implies, really has full access to the entire cluster. You should only use it when you need to do something that your normal account does not have access to.Pods are the basic unit of compute in Kubernetes. A pod consists of one or more OS-level containers that share a network namespace.
Pods can be listed with thekubectl get pod command. Log in to a toolsbeta bastion,become fourohfour and run:
$kubectlgetpodsNAME READY STATUS RESTARTS AGEfourohfour-bd4ffc5ff-479sj 1/1 Running 0 43sfourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 35s
The-o (--output) flag can be used to customize the output. For example,-o wide will display more information:
$kubectlgetpods-owideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESfourohfour-bd4ffc5ff-479sj 1/1 Running 0 91s 192.168.120.158 toolsbeta-test-k8s-worker-nfs-1 <none> <none>fourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 83s 192.168.145.16 toolsbeta-test-k8s-worker-nfs-2 <none> <none>
Or-o json will display the data in JSON:
$kubectlgetpods-ojson|head-n5{ "apiVersion": "v1", "items": [ { "apiVersion": "v1",
So far we have only been accessing data in the namespace we are in. To access data in any namespace, we need to switch back to our user account. Now we can use the-n (--namespace) flag to specify which namespace to access.
$kubectlgetpod-ntool-fourohfourNAME READY STATUS RESTARTS AGEfourohfour-bd4ffc5ff-479sj 1/1 Running 0 4m27sfourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 4m19s$kubectlgetpod-ntool-admin-owideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESadmin-cb6d84bd8-pshh7 1/1 Running 0 5d21h 192.168.25.21 toolsbeta-test-k8s-worker-nfs-4 <none> <none>
Or we can use-A (--all-namespaces) to list data in the entire cluster:
$kubectlgetpod-A|head-n5NAMESPACE NAME READY STATUS RESTARTS AGEapi-gateway api-gateway-nginx-6ddddd6f64-mbnlg 1/1 Running 0 12dapi-gateway api-gateway-nginx-6ddddd6f64-tdl6c 1/1 Running 0 8dbuilds-admission builds-admission-7897cf7759-jtxb5 1/1 Running 0 28hbuilds-admission builds-admission-7897cf7759-nvmzt 1/1 Running 0 26h
To view the combined standard output and standard error for a pod, usekubectl logs:
$kubectlgetpod-nmaintain-kubeusersNAME READY STATUS RESTARTS AGEmaintain-kubeusers-55b649885c-px8c6 1/1 Running 0 87m$kubectllogs-nmaintain-kubeusersmaintain-kubeusers-55b649885c-px8c6|wc-l176
Some useful flags for this command are:
--tail NUMBER to only show the specified number of most recent lines--follow to do, well, exactly what it saysThe main Toolforge cluster consists of a bit over 50 "normal" NFS-enabled workers, and somespecial workers used for specific purposes. These workers can be added and removed using cookbooks. Both adding and removing a node is fairly straightforward, although due to the time it takes to replace the entire cluster we prefer to update existing nodes instead of replacing the entire cluster during most routine maintenance (e.g. Kubernetes upgrades or node reboots). It is however totally fine to replace nodes in Toolsbeta if you want to try the process.
These cookbooks can be run from the cloudcumin hosts (recommended) or from your laptop if you have them set up locally. Use ofscreen ortmux is recommended.
To create a normalworker_nfs in toolsbeta, use:
$sudocookbookwmcs.toolforge.add_k8s_node--cluster-nametoolsbeta--roleworker_nfsRemoving a worker is equally straightforward. To remove the oldestworker_nfs node in toolsbeta, use:
$sudocookbookwmcs.toolforge.remove_k8s_node--cluster-nametoolsbeta--roleworker_nfsIf you have a specific node that you want to remove, pass that as a parameter:
$sudocookbookwmcs.toolforge.remove_k8s_node--cluster-nametoolsbeta--roleworker_nfs--hostname-to-removetoolsbeta-test-k8s-worker-nfs-1Sometimes a node is misbehaving or needs maintenance done on it, and needs to be drained from all workload. This is easiest done with the cookbook:
$sudocookbookwmcs.toolforge.k8s.worker.drain--cluster-nametoolsbeta--hostname-to-draintoolsbeta-test-k8s-worker-nfs-1To "uncordon" (allow new pods to be scheduled to it again) the node, run the following on a bastion in the relevant project:
$kubectlsudouncordontoolsbeta-test-k8s-worker-nfs-1node/toolsbeta-test-k8s-worker-nfs-1 uncordoned
You can also just "cordon" a node which will prevent new workloads from being scheduled but won't drain existing ones:
$kubectlsudocordontoolsbeta-test-k8s-worker-nfs-1node/toolsbeta-test-k8s-worker-nfs-1 cordoned
That is also reversed with the uncordon command.
We have not built a new cluster since the 2020 cluster redesign. The documentation written during the 2020 redesign is atPortal:Toolforge/Admin/Kubernetes/Deploying, although it is likely somewhat outdated.
Kubernetes upstream releases new versions about three times a year.[4] We cannot skip any upgrades and thus must upgrade sequentially. This process is documented atPortal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes.
The Calico upgrade procedure is roughly as follows:
release-notes folder. Review the changelog. Repeat this step for any versions we are skipping past.$sudocookbookwmcs.toolforge.k8s.calico.copy_images_to_registry--task-id"$TASK_ID"--calico-version"$TARGET_VERSION"
kubeVersion statements intoolforge-deploy.git.The ingress-nginx upgrade procedure is roughly as follows:
$sudocookbookwmcs.toolforge.k8s.image.copy_to_registry--task-id"$TASK_ID"--origin-image"registry.k8s.io/ingress-nginx/controller:v$NEW_VERSION@sha256:$HASH"--dest-image-name"nginx-ingress-controller"--dest-image-version"v$NEW_VERSION"
local andtoolsbeta (example), and deploy and test that.tools. (TODO: example)We have upgraded the cluster OS once, from Buster to Bookworm, and during the same time changed the container runtime from Docker to containerd.[6] There is no set process or specific automation for this, but the approach taken last time was:
Thewmcs.toolforge.k8s.reboot cookbook can be used to reboot the entire cluster, for example to apply Kernel or container runtime updates, or in case the NFS server is having issues. Start from reading the--help output for the cookbook. For example, in the NFS issue case in toolsbeta, you could run:
$sudocookbookwmcs.toolforge.k8s.reboot--cluster-nametoolsbeta--all-workersAllkubeadm-managed kubelet client certs are rotated when upgrading Kubernetes, but they can be manually rotated with kubeadm as well. It is possible to configure the kubelet to request upgraded certs on its own when they near expiration. So far, we have not set this flag in the config, expecting our upgrade cycle to be 6 months, roughly.
To renew certificates run thewmcs.toolforge.k8s.kubeadm_certs_renewcookbook. You would usually run the cookbook for all control nodes in a given cluster. The cookbook is idempotent and can be run at any time safely. For example, to renew the certs on all toolsbeta control nodes, you would run>
$sudocookbookwmcs.toolforge.k8s.kubeadm_certs_renew--projecttoolsbeta--control-hostname-listtoolsbeta-test-k8s-control-4toolsbeta-test-k8s-control-5toolsbeta-test-k8s-control-6See also upstream docs:https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal
We run etcd from the Debian packages, so an etcd upgrade is automatically a Debian upgrade and vice versa.
We have not upgraded etcd yet since the 2020 cluster redesign. This section should be filled when we do that for thefirst next time.
In theToolforge Kubernetes component workflow improvements enchancement proposal we introduced a standard "components" system for various components that run in the Kubernetes cluster.
This is documented in thetoolforge-deploy.git README file.
Tool quotas are managed by maintain-kubeusers and configured inin the values file in toolforge-deploy.git.[7] To update quotas for a specific tool:
In case something goes wrong with the credentials for a certain tool user, you can delete themaintain-kubeusers configmap which will causemaintain-kubeusers to re-generate the credentials for that user. On a bastion in the relevant project, run:
$kubectlsudodeletecm-ntool-$TOOLmaintain-kubeusers
Please have a look at the logs for maintain-kubeusers and file a bug so the issue can be fixed.
Requests forobserver access must be approved by the Toolforge admins in a Phabricator task. Once approved, they can be implemented on a control plane node with:
$sudo-iwmcs-enable-cluster-monitor<tool-name>TheKubernetes capacity alert runbook documents how to find where a sudden increase in workload has come from.
Given all tools running on a single worker share that worker's IP address, occasionally you need to figure out which tool on a given worker is misbehaving. That process is documented onPortal:Toolforge/Admin/Kubernetes/Pod tracing.
This has been moved tothe Jobs Service documentation.
| Repository | Related to functionality | Description |
|---|---|---|
| builds-admission | Build Service | Validate build service user-created pipelines |
| envvars-admission | Envvars Service | Inject configured envvars to pods |
| ingress-admission | Webservice | Validate created ingress objects use the domain allowed for that tool |
| registry-admission | Jobs Service | Validate new pods use images in the Toolforge docker registry or Harbor |
| volume-admission | Jobs Service | Inject NFS and configuration file mounts to pods that are configured to have them |
maintain-kubeusers is responsible for creating Kubernetes credentials and a namespace (tool-[tool name]) for each tool, and removing access for disabled tools. It is also in charge of maintaining quotas and other resources for each tool (like kyverno security policies, etc). In addition, it creates admin credentials all maintainers of theadmin tool.
The service is written as a long-running daemon, and it talks to LDAP directly for tool data. It exports Prometheus metrics, but those are not used for any alerts or dashboards at this moment.
Kubelet has two certs:
At this time the serving certificate is a self-signed one managed by kubelet, which should not need manual rotation. Proper, CA-signed rotating certs are stabilizing as a feature set in Kubernetes 1.17, and we should probably switch to that for consistency and as a general improvement. The client cert of kubelet is signed by the cluster CA and expires in 1 year. They are automatically renewed when the cluster is upgraded, but can berenewed manually as well.
Some tools (e.g.k8s-status) need more access to the Kubernetes API than what the default credentials require. For these tools, an "observer" role has been created that grants read-only access to non-sensitive data about the cluster and workloads that run on it.[8] The role is deployed from a file deployedfrom Puppet (althoughphab:T328539 proposes moving it to maintain-kubeusers), and role bindings are created manually using a script.
Using observer status in a job withserviceAccountName: ${tool}-obs is not supported by the Jobs service or webservice. The k8s-status tool usesa custom script for managing a web service with such access included.
Requests for such access should be approved by the Toolforge admins beforeaccess is granted.
The main thing worth backing up is the contents of the etcd cluster. It is not currently backed up.
The Toolforge bastion nodes havekubectl installed. As the bastion nodes have NFS mounts, and maintain-kubeusers provisions certificates to NFS, everything will then work out of the box.
TheKubernetes documentation is both more detailed and up-to-date. Here is, however, a quick overview of the major Kubernetes components.
Kubernetes stores all state inetcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup.
We currently use a 3 node cluster, hosted on VMs separate from the main control plane. They're all smallish Debian Buster instances configured largely by the sameetcd puppet code we use in production. The main interesting thing about them is that they'relocaldisk instances as etcd is rather sensitive to iowait.
The API server the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components,must go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality:
When you are interacting with the Kubernetes API, this is the server that is serving your requests.
The API server runs as a static pod on the control plane nodes. It listens on port 6443/tcp, and all access from outside the Kubernetes cluster should go viaHAProxy. Requests are authenticated with either tokens (mostly for internal usage) or client certificates signed via the certificates API.
The controller-manager and scheduler contain most of the actual logic. The scheduler is responsible for assigning pods to nodes and the controller-manager is for most other actions, for example launching CronJobs at scheduled times or ensuring ReplicaSets have the correct number of Pods running. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.
The primary service running on each node is the Kubelet, which is an interface between the Kubernetes API and the container runtime (containerd in our case). Kubelet is responsible for ensuring the pods running on the node match with what the API server wants to run on that node, and reports back metrics to the API. It also proxies logs requests when necessary. Pod health checks are also done by the Kubelet.
In addition, there are two networking-related services running on each node:
A reference of various used Kubernetes labels and their meanings is available onPortal:Toolforge/Admin/Kubernetes/Labels.
The Kubernetes cluster runs multiple pieces of software responsible for cluster monitoring:
These are all deployed via thewmcs-k8s-metrics component using the standard component deployment model.
Toolforge Prometheus servers scrapes cadvisor, kube-state-metrics and Prometheus exporter endpoints in the apps that have them. For this, the Prometheus server have anexternal API certificate provisioned via Puppet that needs to be renewed yearly. The scrape targets are defined in theprofile::toolforge::prometheus Puppet module.
Alerts are managed viathe Alerts GitLab repo and sent viametricsinfra infrastructure.
We useCalico as the Container Network Interface (CNI) for our cluster. In practice Calico is responsible for allocating a small private subnets in192.168.0.0/16 for each node, and then routing those subnets to provide full connectivity across all nodes.
We deploy Calico following theirself-managed on-premises model. We donot use their operator deployment - instead, we take their manifest deployment and builda Helm chart from it. Instructions for upgrading Calico in this setup are in theupgrade Calico section.
In the cluster we use the defaultCoreDNS DNS plugin. It resolves cluster-internal names (e.g. services) internally and forwards the remaining queries to the main Cloud VPS recursor service. CoreDNS configuration is managed by Kubeadm and generally works well enough, although we should considerincreasing the number of replicas.
We usekubernetes/ingress-nginx to route HTTP requests to specific tools inside the Kubernetes cluster. Ingress objects are created by webservice (soon jobs-api), and the ingressadmission controller restricts each tool to[toolname].toolforge.org.
HAProxy receives external requests, terminates ssl and forwards the traffic to the kubernetes ingress nodes for them to redirect to theingress-nginx component pods.
We implement a simple rate limiting per tool and per source ip within HAProxy. There are a couple of panels in the maintoolforge dashboard, and ainfra-k8s-haproxy dashboard with additional data.
You can check in the logs for tools hitting the rate limits. As of today, anything withip-rate ortool-rate > 250 is being rate-limited:
root@tools-k8s-haproxy-8:~#tail-n1000/var/log/haproxy/haproxy.log|grep-E'tool-rate=[2-9][0-9]{2,}'...2025-11-18T16:54:23.368375+00:00 tools-k8s-haproxy-8 haproxy[199871]: 2001:::::1234 [18/Nov/2025:16:54:23.367] k8s-ingress-https~ k8s-ingress-https/<NOSRV> 0/-1/-1/-1/0 503 3330 - - PR-- 2051/2027/0/0/0 0/0 "GET /stalktoy/2.98.170.90 HTTP/1.1" 0/0000000000000000/0/0/0 meta3.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"meta3.toolforge.org" ip-rate:1 tool-rate=251
The worker nodes are Puppetized which means they have the standard Cloud VPS SSSD setup for using LDAP data for user accounts.
In addition, most (as of February 2024) worker nodes have theshared storage NFS volumes mounted, and these nodes have thekubernetes.wmcloud.org/nfs-mounted=true andtoolforge.org/nfs-mounted=true for tools to run NFS-requiring workloads on them. The volume-admission-controlleradmission controller mounts all volumes to pods with thetoolforge: tool label.
Thereare plans to introduce non-NFS workers to the pool once the Bookworm OS upgrades have finished. These would be used by tools with build service images, buildservice builds and infrastructure components with no need for NFS connectivity. Given the reliability issues with NFS, new features should be designed in a way that they at least do not make it harder to move away from NFS.
We usekyverno to enforce a set of security and isolation constraints to all tool account Pod workloads running in the cluster.
Examples of things we ensure:
With kyverno, we not only validate that pods are correct, but we also mutate (modify) them to inject some values we want them to be set to something in particular, like the uid/gid of each tool account.
Each tool account has a Kyverno policy resource created by maintain-kubeusers.
Privileged workloads, like custom components we deploy, or internal kube-system components, are not subject to any Kyverno policy enforcement as of this writing.
We have a testing deployment in thetoolsbeta Cloud VPS project. It is almost identical to thetools cluster except it is much smaller.
Thelima-kilo project can be used to run parts of a Toolforge Kubernetes cluster on a local machine.
The recommended way for someone to run a workload on Toolforge is to use theJobs service (admin docs). The service will create deployment, cronjob and other objects in tool namespaces.
Before the Jobs framework was introduced, many users usedthe Kubernetes API directly to run their tools. This is now deprecated, but tools are still using it because it works.
TheBuild service (admin docs) runs builds in theimage-build namespace. All of this is managed via the build service API, users do not have direct access to that namespace. These builds run without NFS access.
There are a few different types of workers.
| Cookbook name | Name prefix | Description |
|---|---|---|
worker | worker | Normal workers. As of February 2024, these do not have NFS access. |
worker_nfs | worker-nfs | Normal workers with NFS. |
control | control | Special purpose nodes for the Kubernetes control plane. |
ingress | ingress | Special purpose workers exclusively for theweb ingress and theAPI gateway. |
Addition and removal of all of these types is fully automated via the cookbooks.
We only allow running images from the Toolforge Docker registry (for "pre-built" images) and from the Toolforge Harbor server. This is for the following purposes:
This is enforced with anadmission controller.
The decision to follow this approach was last discussed and re-evaluated atWikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc.
We use Puppet to provision the Kubernetes nodes, and also related non-K8s managed infrastructure such as etcd and HAProxy. However, configuration for what's inside the cluster should not be managed by Puppet for several reasons: