Movatterモバイル変換

Portal:Toolforge/Admin/Kubernetes

From Wikitech

Kubernetes (often abbreviatedk8s) is an open-source system for automating deployment, and management of applications running incontainers. Kubernetes was selected in 2015 by the Cloud Services team as the replacement for Grid Engine in the Toolforge project.^[1] Usage of k8s in Tools began in mid-2016,^[2] and the current cluster design dates back to early 2020.^[3]

For help on using Kubernetes in Toolforge, see theKubernetes help documentation.

Subpages

About this document

This document tries to document the Kubernetes cluster used in Toolforge, and its direct support services (e.g. etcd). It does not cover specifics about services running in the cluster (e.g. theJobs service andbuild service), nor does it cover Toolforge services that are fully unrelated to the Kubernetes cluster (e.g.Redis).

The four main sections of this document correspond to the four categories of documentation inThe Grand Unified Theory of Documentation system in a structure inspired by how theTor Project Admins do it.

Tutorial

Access`kubectl`

kubectl is the official Kubernetes command line interface tool. Assuming you are listed as a maintainer of theadmin tool (or the toolsbeta equivalent) you will automatically have superuser credentials provisioned in your NFS home directory.

To use the CLI tool, log in to a bastion host on the project where the cluster you want to interact with is located. If you want to just experiment, you should use the toolsbeta cluster for that. Most read-only commands can be used out of the box, for example to list pods in thetool-fourohfour namespace used by the 404 handler:

$kubectlgetpod-ntool-fourohfourNAME                          READY   STATUS    RESTARTS   AGEfourohfour-7766466794-gtpgk   1/1     Running   0          7d20hfourohfour-7766466794-qctt8   1/1     Running   0          6d18h

However, all write actions and some read-only actions (e.g. interacting with nodes or secrets) will give you a permission error:

$kubectldeletepod-ntool-fourohfourfourohfour-7766466794-gtpgkError from server (Forbidden): pods "fourohfour-7766466794-gtpgk" is forbidden: User "taavi" cannot delete resource "pods" in API group "" in the namespace "tool-fourohfour"

If you're sure you want to continue, you need to usekubectl sudo:

$kubectlsudodeletepod-ntool-fourohfourfourohfour-7766466794-gtpgkpod "fourohfour-7766466794-gtpgk" deleted

kubectl sudo, as the name implies, really has full access to the entire cluster. You should only use it when you need to do something that your normal account does not have access to.

Manage pods

Pods are the basic unit of compute in Kubernetes. A pod consists of one or more OS-level containers that share a network namespace.

List pods

Pods can be listed with thekubectl get pod command. Log in to a toolsbeta bastion,become fourohfour and run:

$kubectlgetpodsNAME                         READY   STATUS    RESTARTS   AGEfourohfour-bd4ffc5ff-479sj   1/1     Running   0          43sfourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          35s

The-o (--output) flag can be used to customize the output. For example,-o wide will display more information:

$kubectlgetpods-owideNAME                         READY   STATUS    RESTARTS   AGE   IP                NODE                              NOMINATED NODE   READINESS GATESfourohfour-bd4ffc5ff-479sj   1/1     Running   0          91s   192.168.120.158   toolsbeta-test-k8s-worker-nfs-1   <none>           <none>fourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          83s   192.168.145.16    toolsbeta-test-k8s-worker-nfs-2   <none>           <none>

Or-o json will display the data in JSON:

$kubectlgetpods-ojson|head-n5{    "apiVersion": "v1",    "items": [        {            "apiVersion": "v1",

So far we have only been accessing data in the namespace we are in. To access data in any namespace, we need to switch back to our user account. Now we can use the-n (--namespace) flag to specify which namespace to access.

$kubectlgetpod-ntool-fourohfourNAME                         READY   STATUS    RESTARTS   AGEfourohfour-bd4ffc5ff-479sj   1/1     Running   0          4m27sfourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          4m19s$kubectlgetpod-ntool-admin-owideNAME                    READY   STATUS    RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATESadmin-cb6d84bd8-pshh7   1/1     Running   0          5d21h   192.168.25.21   toolsbeta-test-k8s-worker-nfs-4   <none>           <none>

Or we can use-A (--all-namespaces) to list data in the entire cluster:

$kubectlgetpod-A|head-n5NAMESPACE            NAME                                                    READY   STATUS             RESTARTS         AGEapi-gateway          api-gateway-nginx-6ddddd6f64-mbnlg                      1/1     Running            0                12dapi-gateway          api-gateway-nginx-6ddddd6f64-tdl6c                      1/1     Running            0                8dbuilds-admission     builds-admission-7897cf7759-jtxb5                       1/1     Running            0                28hbuilds-admission     builds-admission-7897cf7759-nvmzt                       1/1     Running            0                26h

View logs for pod

To view the combined standard output and standard error for a pod, usekubectl logs:

$kubectlgetpod-nmaintain-kubeusersNAME                                  READY   STATUS    RESTARTS   AGEmaintain-kubeusers-55b649885c-px8c6   1/1     Running   0          87m$kubectllogs-nmaintain-kubeusersmaintain-kubeusers-55b649885c-px8c6|wc-l176

Some useful flags for this command are:

--tail NUMBER to only show the specified number of most recent lines
--follow to do, well, exactly what it says

Restart a pod

Manage workers

The main Toolforge cluster consists of a bit over 50 "normal" NFS-enabled workers, and somespecial workers used for specific purposes. These workers can be added and removed using cookbooks. Both adding and removing a node is fairly straightforward, although due to the time it takes to replace the entire cluster we prefer to update existing nodes instead of replacing the entire cluster during most routine maintenance (e.g. Kubernetes upgrades or node reboots). It is however totally fine to replace nodes in Toolsbeta if you want to try the process.

Add a worker

These cookbooks can be run from the cloudcumin hosts (recommended) or from your laptop if you have them set up locally. Use ofscreen ortmux is recommended.

To create a normalworker_nfs in toolsbeta, use:

$sudocookbookwmcs.toolforge.add_k8s_node--cluster-nametoolsbeta--roleworker_nfs

Remove a worker

Removing a worker is equally straightforward. To remove the oldestworker_nfs node in toolsbeta, use:

$sudocookbookwmcs.toolforge.remove_k8s_node--cluster-nametoolsbeta--roleworker_nfs

If you have a specific node that you want to remove, pass that as a parameter:

$sudocookbookwmcs.toolforge.remove_k8s_node--cluster-nametoolsbeta--roleworker_nfs--hostname-to-removetoolsbeta-test-k8s-worker-nfs-1

Drain and undrain a node

Sometimes a node is misbehaving or needs maintenance done on it, and needs to be drained from all workload. This is easiest done with the cookbook:

$sudocookbookwmcs.toolforge.k8s.worker.drain--cluster-nametoolsbeta--hostname-to-draintoolsbeta-test-k8s-worker-nfs-1

To "uncordon" (allow new pods to be scheduled to it again) the node, run the following on a bastion in the relevant project:

$kubectlsudouncordontoolsbeta-test-k8s-worker-nfs-1node/toolsbeta-test-k8s-worker-nfs-1 uncordoned

You can also just "cordon" a node which will prevent new workloads from being scheduled but won't drain existing ones:

$kubectlsudocordontoolsbeta-test-k8s-worker-nfs-1node/toolsbeta-test-k8s-worker-nfs-1 cordoned

That is also reversed with the uncordon command.

How-to

Cluster management

Build a new cluster

We have not built a new cluster since the 2020 cluster redesign. The documentation written during the 2020 redesign is atPortal:Toolforge/Admin/Kubernetes/Deploying, although it is likely somewhat outdated.

Upgrade Kubernetes

Kubernetes upstream releases new versions about three times a year.^[4] We cannot skip any upgrades and thus must upgrade sequentially. This process is documented atPortal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes.

Upgrade Calico

The Calico upgrade procedure is roughly as follows:

Determine which version you are going to upgrade to by going to theupstream Kubernetes requirements page and switching versions from the navigation bar until you find the newest version with support for the current cluster Kubernetes release.
Locate the changelog by going to theCalico GitHub repository, switching to the branch that corresponds to the version you just picked, and then opening therelease-notes folder. Review the changelog. Repeat this step for any versions we are skipping past.

Run the cookbook to copy the images to our registry:

$sudocookbookwmcs.toolforge.k8s.calico.copy_images_to_registry--task-id"$TASK_ID"--calico-version"$TARGET_VERSION"

Clone therepos/cloud/toolforge/calico repository from GitLab, follow the instructions in README to update the manifests and send them for review.
Test in lima-kilo and toolsbeta.
Deploy to tools.
Update thekubeVersion statements intoolforge-deploy.git.

Upgrade ingress-nginx

The ingress-nginx upgrade procedure is roughly as follows:

Determine which version you are going to upgrade to fromthe upstream release list. (Note that ingress-nginx version and the Helm chart versions are slightly different; you're going to need both.) Review the changelogs between the currently live release and the new target release.

Retrieve the new version image hash from the corresponding GitHub release tag page, and copy the new container image to our registry:

$sudocookbookwmcs.toolforge.k8s.image.copy_to_registry--task-id"$TASK_ID"--origin-image"registry.k8s.io/ingress-nginx/controller:v$NEW_VERSION@sha256:$HASH"--dest-image-name"nginx-ingress-controller"--dest-image-version"v$NEW_VERSION"

Note that the current process used to mirror the image will cause the image digest to change!^[5]

UpdateHelm values in toolforge-deploy forlocal andtoolsbeta (example), and deploy and test that.
Repeat fortools. (TODO: example)

Upgrade worker operating system

We have upgraded the cluster OS once, from Buster to Bookworm, and during the same time changed the container runtime from Docker to containerd.^[6] There is no set process or specific automation for this, but the approach taken last time was:

Pick which Debian release you're going to upgrade to
Ensure the container runtime version in that release is supported by Kubernetes,Calico andcadvisor
Import kubeadm packages for the new Debian release
Add a new worker in toolsbeta
1. Test carefully that it works
2. Do this for all types to test out all configuration combinations (with/without NFS, with/without extra volume)
3. Remove matching number of old workers
Replace a control node in toolsbeta
Add a few new nodes in tools
Wait a few days
Replace all tools workers
1. In paraller, replace remaining toolsbeta workers
Replace tools controls

Roll reboot cluster

Thewmcs.toolforge.k8s.reboot cookbook can be used to reboot the entire cluster, for example to apply Kernel or container runtime updates, or in case the NFS server is having issues. Start from reading the--help output for the cookbook. For example, in the NFS issue case in toolsbeta, you could run:

$sudocookbookwmcs.toolforge.k8s.reboot--cluster-nametoolsbeta--all-workers

Renew kubelet certificates

Allkubeadm-managed kubelet client certs are rotated when upgrading Kubernetes, but they can be manually rotated with kubeadm as well. It is possible to configure the kubelet to request upgraded certs on its own when they near expiration. So far, we have not set this flag in the config, expecting our upgrade cycle to be 6 months, roughly.

To renew certificates run thewmcs.toolforge.k8s.kubeadm_certs_renewcookbook. You would usually run the cookbook for all control nodes in a given cluster. The cookbook is idempotent and can be run at any time safely. For example, to renew the certs on all toolsbeta control nodes, you would run>

$sudocookbookwmcs.toolforge.k8s.kubeadm_certs_renew--projecttoolsbeta--control-hostname-listtoolsbeta-test-k8s-control-4toolsbeta-test-k8s-control-5toolsbeta-test-k8s-control-6

etcd

Add etcd node

Remove etcd nodes

Upgrade etcd

Tracked inPhabricator
Task T361237

We run etcd from the Debian packages, so an etcd upgrade is automatically a Debian upgrade and vice versa.

~~We have not upgraded etcd yet since the 2020 cluster redesign.~~ This section should be filled when we do that for the~~first~~ next time.

Component system

In theToolforge Kubernetes component workflow improvements enchancement proposal we introduced a standard "components" system for various components that run in the Kubernetes cluster.

Deployment

This is documented in thetoolforge-deploy.git README file.

Manage (tool) users

Modify quotas

Tool quotas are managed by maintain-kubeusers and configured inin the values file in toolforge-deploy.git.^[7] To update quotas for a specific tool:

Send a patch to the values file changing the quotas. The format should be relatively self-explanatory, and the defaults and supported keys are listed in thedefault values file. Always change the version when making any kind of change or it will not be applied.
Merge the patch to main anddeploy it like any other component change.

Regenerate .kube/config

In case something goes wrong with the credentials for a certain tool user, you can delete themaintain-kubeusers configmap which will causemaintain-kubeusers to re-generate the credentials for that user. On a bastion in the relevant project, run:

$kubectlsudodeletecm-ntool-$TOOLmaintain-kubeusers

Please have a look at the logs for maintain-kubeusers and file a bug so the issue can be fixed.

Enable observer access

Requests forobserver access must be approved by the Toolforge admins in a Phabricator task. Once approved, they can be implemented on a control plane node with:

$sudo-iwmcs-enable-cluster-monitor<tool-name>

Manage user workloads

Find newly added workloads

TheKubernetes capacity alert runbook documents how to find where a sudden increase in workload has come from.

Pod tracing

Given all tools running on a single worker share that worker's IP address, occasionally you need to figure out which tool on a given worker is misbehaving. That process is documented onPortal:Toolforge/Admin/Kubernetes/Pod tracing.

Update prebuilt images

This has been moved tothe Jobs Service documentation.

Reference

Admission controllers

Custom admission controllers in the Toolforge cluster
Repository	Related to functionality	Description
builds-admission	Build Service	Validate build service user-created pipelines
envvars-admission	Envvars Service	Inject configured envvars to pods
ingress-admission	Webservice	Validate created ingress objects use the domain allowed for that tool
registry-admission	Jobs Service	Validate new pods use images in the Toolforge docker registry or Harbor
volume-admission	Jobs Service	Inject NFS and configuration file mounts to pods that are configured to have them

Authentication, authorization, certificates and RBAC

cert-manager

External certificates

maintain-kubeusers

maintain-kubeusers is responsible for creating Kubernetes credentials and a namespace (tool-[tool name]) for each tool, and removing access for disabled tools. It is also in charge of maintaining quotas and other resources for each tool (like kyverno security policies, etc). In addition, it creates admin credentials all maintainers of theadmin tool.

The service is written as a long-running daemon, and it talks to LDAP directly for tool data. It exports Prometheus metrics, but those are not used for any alerts or dashboards at this moment.

kubelet certificates

Kubelet has two certs:

A client cert to communicate with the API server
A serving certificate for the Kubelet API

At this time the serving certificate is a self-signed one managed by kubelet, which should not need manual rotation. Proper, CA-signed rotating certs are stabilizing as a feature set in Kubernetes 1.17, and we should probably switch to that for consistency and as a general improvement. The client cert of kubelet is signed by the cluster CA and expires in 1 year. They are automatically renewed when the cluster is upgraded, but can berenewed manually as well.

Observer access

Some tools (e.g.k8s-status) need more access to the Kubernetes API than what the default credentials require. For these tools, an "observer" role has been created that grants read-only access to non-sensitive data about the cluster and workloads that run on it.^[8] The role is deployed from a file deployedfrom Puppet (althoughphab:T328539 proposes moving it to maintain-kubeusers), and role bindings are created manually using a script.

Using observer status in a job withserviceAccountName: ${tool}-obs is not supported by the Jobs service or webservice. The k8s-status tool usesa custom script for managing a web service with such access included.

Requests for such access should be approved by the Toolforge admins beforeaccess is granted.

Backups

Tracked inPhabricator
Task T339934

The main thing worth backing up is the contents of the etcd cluster. It is not currently backed up.

Bastion nodes

The Toolforge bastion nodes havekubectl installed. As the bastion nodes have NFS mounts, and maintain-kubeusers provisions certificates to NFS, everything will then work out of the box.

Kubernetes design

TheKubernetes documentation is both more detailed and up-to-date. Here is, however, a quick overview of the major Kubernetes components.

Control plane

etcd

Kubernetes stores all state inetcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup.

We currently use a 3 node cluster, hosted on VMs separate from the main control plane. They're all smallish Debian Buster instances configured largely by the sameetcd puppet code we use in production. The main interesting thing about them is that they'relocaldisk instances as etcd is rather sensitive to iowait.

API server

The API server the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components,must go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality:

Authentication & Authorization
Validation
Read / Write access to all the API endpoints
Watch functionality for endpoints, which notifies clients when state changes for a particular resource

When you are interacting with the Kubernetes API, this is the server that is serving your requests.

The API server runs as a static pod on the control plane nodes. It listens on port 6443/tcp, and all access from outside the Kubernetes cluster should go viaHAProxy. Requests are authenticated with either tokens (mostly for internal usage) or client certificates signed via the certificates API.

controller-manager and scheduler

The controller-manager and scheduler contain most of the actual logic. The scheduler is responsible for assigning pods to nodes and the controller-manager is for most other actions, for example launching CronJobs at scheduled times or ensuring ReplicaSets have the correct number of Pods running. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.

Worker

The primary service running on each node is the Kubelet, which is an interface between the Kubernetes API and the container runtime (containerd in our case). Kubelet is responsible for ensuring the pods running on the node match with what the API server wants to run on that node, and reports back metrics to the API. It also proxies logs requests when necessary. Pod health checks are also done by the Kubelet.

In addition, there are two networking-related services running on each node:

kube-proxy manages iptables NAT rules for Service addresses.
The container network interface (or CNI,Calico in our cluster) manages the rest of the cluster networking. In practice this means an overlay network where each pod is assigned an cluster-internal IP address which can be routed across the entire cluster.

Labels

A reference of various used Kubernetes labels and their meanings is available onPortal:Toolforge/Admin/Kubernetes/Labels.

Monitoring and metrics

Alert runbooks

Kubernetes metrics stack

The Kubernetes cluster runs multiple pieces of software responsible for cluster monitoring:

metrics-server (per-container and node metrics for Kubernetes internal use)
cadvisor (per-container metrics for Prometheus)
kube-state-metrics (cluster-level metrics from Prometheus)

These are all deployed via thewmcs-k8s-metrics component using the standard component deployment model.

Prometheus integration

Toolforge Prometheus servers scrapes cadvisor, kube-state-metrics and Prometheus exporter endpoints in the apps that have them. For this, the Prometheus server have anexternal API certificate provisioned via Puppet that needs to be renewed yearly. The scrape targets are defined in theprofile::toolforge::prometheus Puppet module.

Alerts are managed viathe Alerts GitLab repo and sent viametricsinfra infrastructure.

Networking

Calico

We useCalico as the Container Network Interface (CNI) for our cluster. In practice Calico is responsible for allocating a small private subnets in192.168.0.0/16 for each node, and then routing those subnets to provide full connectivity across all nodes.

We deploy Calico following theirself-managed on-premises model. We donot use their operator deployment - instead, we take their manifest deployment and builda Helm chart from it. Instructions for upgrading Calico in this setup are in theupgrade Calico section.

DNS

In the cluster we use the defaultCoreDNS DNS plugin. It resolves cluster-internal names (e.g. services) internally and forwards the remaining queries to the main Cloud VPS recursor service. CoreDNS configuration is managed by Kubeadm and generally works well enough, although we should considerincreasing the number of replicas.

Ingress

We usekubernetes/ingress-nginx to route HTTP requests to specific tools inside the Kubernetes cluster. Ingress objects are created by webservice (soon jobs-api), and the ingressadmission controller restricts each tool to[toolname].toolforge.org.

HAProxy (external service access)

HAProxy receives external requests, terminates ssl and forwards the traffic to the kubernetes ingress nodes for them to redirect to theingress-nginx component pods.

Tracked inPhabricator
Task T409794

We implement a simple rate limiting per tool and per source ip within HAProxy. There are a couple of panels in the maintoolforge dashboard, and ainfra-k8s-haproxy dashboard with additional data.

You can check in the logs for tools hitting the rate limits. As of today, anything withip-rate ortool-rate > 250 is being rate-limited:

root@tools-k8s-haproxy-8:~#tail-n1000/var/log/haproxy/haproxy.log|grep-E'tool-rate=[2-9][0-9]{2,}'...2025-11-18T16:54:23.368375+00:00 tools-k8s-haproxy-8 haproxy[199871]: 2001:::::1234 [18/Nov/2025:16:54:23.367] k8s-ingress-https~ k8s-ingress-https/<NOSRV> 0/-1/-1/-1/0 503 3330 - - PR-- 2051/2027/0/0/0 0/0 "GET /stalktoy/2.98.170.90 HTTP/1.1" 0/0000000000000000/0/0/0 meta3.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"meta3.toolforge.org" ip-rate:1 tool-rate=251

NFS and LDAP

The worker nodes are Puppetized which means they have the standard Cloud VPS SSSD setup for using LDAP data for user accounts.

In addition, most (as of February 2024) worker nodes have theshared storage NFS volumes mounted, and these nodes have thekubernetes.wmcloud.org/nfs-mounted=true andtoolforge.org/nfs-mounted=true for tools to run NFS-requiring workloads on them. The volume-admission-controlleradmission controller mounts all volumes to pods with thetoolforge: tool label.

Thereare plans to introduce non-NFS workers to the pool once the Bookworm OS upgrades have finished. These would be used by tools with build service images, buildservice builds and infrastructure components with no need for NFS connectivity. Given the reliability issues with NFS, new features should be designed in a way that they at least do not make it harder to move away from NFS.

Pod isolation

We usekyverno to enforce a set of security and isolation constraints to all tool account Pod workloads running in the cluster.

Examples of things we ensure:

pods have a limited set of privileges and cannot escalate
pods cannot "hijack" files from other tools via jumping NFS data dir
pods don't run as root

With kyverno, we not only validate that pods are correct, but we also mutate (modify) them to inject some values we want them to be set to something in particular, like the uid/gid of each tool account.

Each tool account has a Kyverno policy resource created by maintain-kubeusers.

Privileged workloads, like custom components we deploy, or internal kube-system components, are not subject to any Kyverno policy enforcement as of this writing.

Testing and local deployments

We have a testing deployment in thetoolsbeta Cloud VPS project. It is almost identical to thetools cluster except it is much smaller.

Thelima-kilo project can be used to run parts of a Toolforge Kubernetes cluster on a local machine.

User workloads

Jobs service

The recommended way for someone to run a workload on Toolforge is to use theJobs service (admin docs). The service will create deployment, cronjob and other objects in tool namespaces.

Raw Kubernetes API users

Before the Jobs framework was introduced, many users usedthe Kubernetes API directly to run their tools. This is now deprecated, but tools are still using it because it works.

Build service builds

TheBuild service (admin docs) runs builds in theimage-build namespace. All of this is managed via the build service API, users do not have direct access to that namespace. These builds run without NFS access.

Worker types

There are a few different types of workers.

Cookbook name	Name prefix	Description
`worker`	worker	Normal workers. As of February 2024, these do not have NFS access.
`worker_nfs`	worker-nfs	Normal workers with NFS.
`control`	control	Special purpose nodes for the Kubernetes control plane.
`ingress`	ingress	Special purpose workers exclusively for theweb ingress and theAPI gateway.

Addition and removal of all of these types is fully automated via the cookbooks.

Discussion

Bring-your-own-image

We only allow running images from the Toolforge Docker registry (for "pre-built" images) and from the Toolforge Harbor server. This is for the following purposes:

Making it easy to enforce our Open Source Code only guideline
Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
Access control is provided totally by us, less dependent on dockerhub
Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment

This is enforced with anadmission controller.

The decision to follow this approach was last discussed and re-evaluated atWikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc.

GitOps tools

Puppet

We use Puppet to provision the Kubernetes nodes, and also related non-K8s managed infrastructure such as etcd and HAProxy. However, configuration for what's inside the cluster should not be managed by Puppet for several reasons:

We already have a deployment management system for what's inside the cluster (toolforge-deploy.git), and we should not introduce two systems for the same purpose
Puppet cannot be used to provision a local environment as is
puppet.git merges require global root, which not all Toolforge admins have

Single cluster reliance

References

Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Portal:Toolforge/Admin/Kubernetes&oldid=2363663"

Category:

Toolforge admin

[8]ページ先頭

Movatterモバイル変換

Subpages

About this document

Tutorial

Accesskubectl

Manage pods

List pods

View logs for pod

Restart a pod

Manage workers

Add a worker

Remove a worker

Drain and undrain a node

How-to

Cluster management

Build a new cluster

Upgrade Kubernetes

Upgrade Calico

Upgrade ingress-nginx

Upgrade worker operating system

Roll reboot cluster

Renew kubelet certificates

etcd

Add etcd node

Remove etcd nodes

Upgrade etcd

Component system

Deployment

Manage (tool) users

Modify quotas

Regenerate .kube/config

Enable observer access

Manage user workloads

Find newly added workloads

Pod tracing

Update prebuilt images

Reference

Admission controllers

Authentication, authorization, certificates and RBAC

cert-manager

External certificates

maintain-kubeusers

kubelet certificates

Observer access

Backups

Bastion nodes

Kubernetes design

Control plane

etcd

API server

controller-manager and scheduler

Worker

Labels

Monitoring and metrics

Alert runbooks

Kubernetes metrics stack

Prometheus integration

Networking

Calico

DNS

Ingress

HAProxy (external service access)

NFS and LDAP

Pod isolation

Testing and local deployments

User workloads

Jobs service

Raw Kubernetes API users

Build service builds

Worker types

Discussion

Bring-your-own-image

GitOps tools

Puppet

Single cluster reliance

References

Access`kubectl`