Movatterモバイル変換


[0]ホーム

URL:


Jump to content
Wikitech
Search

Portal:Toolforge/Admin/Runbooks/TektonDown

From Wikitech
<Portal:Toolforge |Admin |Runbooks

Toolforge Admin

[edit]

This is when the tekton-pipelines-controller pod in the tekton-pipelines namespace of tools/toolsbeta k8s cluster is down or can't be reached.

The procedures in this runbook requireadmin permissions to complete.

Error / Incident

This usually comes in the form of analert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for.

Debugging

The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) cloudcontrol servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:

  • check that the pods are running:
toolsbeta-test-k8s-control-4:/#sudo-iroot@ttoolsbeta-test-k8s-control-4:/#kubectlgetpods-ntekton-pipelinesNAMEREADYSTATUSRESTARTSAGEtekton-pipelines-controller-5c78ddd49b-dj4hz1/1Running034dtekton-pipelines-webhook-5d899cc8c-zwf7p1/1Running034d
  • You can also check the log of the pod's deployment withkubectl logs deploy/tekton-pipelines-controller -n tekton-pipelines.
  • If the pods don't exist or the deployment does not exist, you can try redeploying the jobs-api by following the instructions in thetoolforge repo (it will do nothing if there's nothing to do).

Doing a manual curl for the stats

You can try doing a curl directly to the pods for the statisticts, by checking the configuration of prometheus, you'll get the cert, key and url:

root@tools-prometheus-6:~#grep'job_name.*tekton'-A40/srv/prometheus/tools/prometheus.yml-job_name:tekton-pipelines-controllerscheme:httpstls_config:insecure_skip_verify:truecert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"key_file:"/etc/ssl/private/toolforge-k8s-prometheus.key"kubernetes_sd_configs:-api_server:https://k8s.tools.eqiad1.wikimedia.cloud:6443role:podtls_config:insecure_skip_verify:truecert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"key_file:"/etc/ssl/private/toolforge-k8s-prometheus.key"namespaces:names:-tekton-pipelinesrelabel_configs:...-source_labels:-__meta_kubernetes_pod_nameregex:"(tekton-pipelines-controller-[a-zA-Z0-9]+-[a-zA-Z0-9]+)"target_label:__metrics_path__replacement:"/api/v1/namespaces/tekton-pipelines/pods/${1}:9090/proxy/metrics"

Then you can curl directly the pods by name, like:

root@tools-prometheus-6:~#curl\--insecure\--cert/etc/ssl/localcerts/toolforge-k8s-prometheus.crt\--key/etc/ssl/private/toolforge-k8s-prometheus.key\'https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/tekton-pipelines/pods/tekton-pipelines-controller-6f6bd874d9-kz9g2:9090/proxy/metrics'....

Common issues

Add new issues here when you encounter them!

Prometheus k8s cert expired

If tekton seems up, you can check if the certificates that prometheus uses to connect to k8s have expired:

root@tools-prometheus-6:/srv/prometheus/tools#grepcert_file/srv/prometheus/tools/prometheus.ymlcert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"...root@tools-prometheus-6:/srv/prometheus/tools#opensslx509-in/etc/ssl/localcerts/toolforge-k8s-prometheus.crt-textCertificate:...ValidityNotBefore:Jun211:55:072022GMTNotAfter:Jun211:55:072023GMT<--thisoneshouldbelaterthantoday

To refresh and fix the issue followPortal:Toolforge/Admin/Kubernetes/Certificates#Operations.

Related information

Old incidents

Add any incident tasks here!

  • phab:T338025 - [T338025] [tools] Prometheus k8s cert expired
Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Portal:Toolforge/Admin/Runbooks/TektonDown&oldid=2248440"
Categories:

[8]ページ先頭

©2009-2025 Movatter.jp