This is when the tekton-pipelines-controller pod in the tekton-pipelines namespace of tools/toolsbeta k8s cluster is down or can't be reached.
This usually comes in the form of analert in alertmanager.
There you will get which project (tools, toolsbeta, ...) is the one it's failing for.
The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) cloudcontrol servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:
toolsbeta-test-k8s-control-4:/#sudo-iroot@ttoolsbeta-test-k8s-control-4:/#kubectlgetpods-ntekton-pipelinesNAMEREADYSTATUSRESTARTSAGEtekton-pipelines-controller-5c78ddd49b-dj4hz1/1Running034dtekton-pipelines-webhook-5d899cc8c-zwf7p1/1Running034d
kubectl logs deploy/tekton-pipelines-controller -n tekton-pipelines.You can try doing a curl directly to the pods for the statisticts, by checking the configuration of prometheus, you'll get the cert, key and url:
root@tools-prometheus-6:~#grep'job_name.*tekton'-A40/srv/prometheus/tools/prometheus.yml-job_name:tekton-pipelines-controllerscheme:httpstls_config:insecure_skip_verify:truecert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"key_file:"/etc/ssl/private/toolforge-k8s-prometheus.key"kubernetes_sd_configs:-api_server:https://k8s.tools.eqiad1.wikimedia.cloud:6443role:podtls_config:insecure_skip_verify:truecert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"key_file:"/etc/ssl/private/toolforge-k8s-prometheus.key"namespaces:names:-tekton-pipelinesrelabel_configs:...-source_labels:-__meta_kubernetes_pod_nameregex:"(tekton-pipelines-controller-[a-zA-Z0-9]+-[a-zA-Z0-9]+)"target_label:__metrics_path__replacement:"/api/v1/namespaces/tekton-pipelines/pods/${1}:9090/proxy/metrics"
Then you can curl directly the pods by name, like:
root@tools-prometheus-6:~#curl\--insecure\--cert/etc/ssl/localcerts/toolforge-k8s-prometheus.crt\--key/etc/ssl/private/toolforge-k8s-prometheus.key\'https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/tekton-pipelines/pods/tekton-pipelines-controller-6f6bd874d9-kz9g2:9090/proxy/metrics'....
Add new issues here when you encounter them!
If tekton seems up, you can check if the certificates that prometheus uses to connect to k8s have expired:
root@tools-prometheus-6:/srv/prometheus/tools#grepcert_file/srv/prometheus/tools/prometheus.ymlcert_file:"/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"...root@tools-prometheus-6:/srv/prometheus/tools#opensslx509-in/etc/ssl/localcerts/toolforge-k8s-prometheus.crt-textCertificate:...ValidityNotBefore:Jun211:55:072022GMTNotAfter:Jun211:55:072023GMT<--thisoneshouldbelaterthantoday
To refresh and fix the issue followPortal:Toolforge/Admin/Kubernetes/Certificates#Operations.
Add any incident tasks here!