Troubleshoot connectivity issues in your cluster

Autopilot Standard

Google Kubernetes Engine (GKE) cluster network connectivity issues can causeconnection timeouts, refused connections, unreachable Pods, Service disruptions,or inter-Pod communication failures.

Use this document to troubleshoot cluster connectivity by capturing networkpackets (by usingtoolbox andtcpdump) and diagnosing Pod network issues.

This information is important for Platform admins and operators andApplication developers diagnosing and fixing network problems affectingGKE workloads. For more information about the common roles andexample tasks that we reference in Google Cloudcontent, seeCommon GKE user roles andtasks.

Connectivity issues related to capturing network packets in GKE

This section describes how to troubleshoot connectivity issues related to capturingnetwork packets, including symptoms like connection timeouts, connection refusederrors, or unexpected application behavior. These connectivity issues can occurat the node level or at the Pod level.

Connectivity problems in your cluster network are often in the followingcategories:

Pods not reachable: A Pod might not be accessible from inside or outsidethe cluster due to network misconfigurations.
Service disruptions: A Service might be experiencing interruptions ordelays.
Inter-Pod communication issues: Pods might not be able to communicatewith each other effectively.

Connectivity issues in your GKE cluster can originate fromvarious causes, including the following:

Network misconfigurations: Incorrect network policies, firewall rules,or routing tables.
Application bugs: Errors in application code affecting networkinteractions.
Infrastructure problems: Network congestion, hardware failures, orresource limitations.

The following section shows how to resolve the issue on the problematic nodes orPods.

Identify the node on which the problematic Pod is running by using thefollowing command:
```
kubectlgetpodsPOD_NAME-o=wide-nNAMESPACE
```
Replace the following:
- POD_NAME with the name of the Pod.
- NAMESPACE with the Kubernetes namespace.
Connect to the node:
```
gcloudcomputesshNODE_NAME\--zone=ZONE
```
Replace the following:
- NODE_NAME: name of your node.
- ZONE: name of the zone in which the node runs.
To debug a specific Pod, identify theveth interface associated with thePod:
```
iproute|grepPOD_IP
```
ReplacePOD_IP with Pod IP address.
Run thetoolbox commands.

`toolbox` commands

toolbox is a utility that provides a containerized environment within yourGKE nodes for debugging and troubleshooting. Thissection describes how to install thetoolbox utility and use it to troubleshoot the node.

While connected to the node, start thetoolbox tool:
```
toolbox
```
This downloads the files that facilitate thetoolbox utility.
In thetoolbox root prompt, installtcpdump:
- For clusters with external IP addresses or Cloud NAT:
```
aptupdate-y &&aptinstall-ytcpdump
```
  Note: This command needs internet access on your GKE nodes.Ensure that either your GKE nodes have external IP addresses orthat a Cloud NAT Gateway is configured for the primary IPv4 address range ofyour cluster subnet. For private clusters, ensure that a Cloud NAT gatewayis configured before running this command.
- For private clusters without Cloud NAT:
  If you have a private cluster without Cloud NAT, you can't installtcpdump usingapt. Instead, download thelibpcap andtcpdumprelease files from the official repository and copy thefiles to the VM usinggcloud compute scp orgcloud storage cp.Then, install the libraries manually using the following steps:
```
# Copy the downloaded packages to a staging areacp/media/root/home/USER_NAME/tcpdump-VERSION.tar.gz/usr/sbin/cp/media/root/home/USER_NAME/libpcap-VERSION.tar.gz/usr/sbin/cd/usr/sbin/# Extract the archivestar-xvzftcpdump-VERSION.tar.gztar-xvzflibpcap-VERSION.tar.gz# Build and install libpcap (a dependency for tcpdump)cdlibpcap-VERSION./configure;make;makeinstall# Build and install tcpdumpcd../tcpdump-VERSION./configure;make;makeinstall# Verify tcpdump installationtcpdump--version
```
  Replace the following:
  - USER_NAME: your username on the system where the files are located.
  - VERSION: the specific version number of thetcpdump andlibpcap packages.
Start the packet capture:
```
tcpdump-ieth0-s100"portPORT"\-w/media/root/mnt/stateful_partition/CAPTURE_FILE_NAME
```
Replace the following:
- PORT: name of your port number.
- CAPTURE_FILE_NAME: name of your capture file.
Stop the packet capture and interrupt thetcpdump.
Leave the toolbox by typingexit.
List the packet capture file and check its size:
```
ls-ltr/mnt/stateful_partition/CAPTURE_FILE_NAME
```
Copy the packet capture from the node to the current working directory on your computer:
```
gcloudcomputescpNODE_NAME:/mnt/stateful_partition/CAPTURE_FILE_NAME\--zone=ZONE
```
Replace the following:
- NODE_NAME: name of your node.
- CAPTURE_FILE_NAME: name of your capture file.
- ZONE: name of your zone.

Alternative commands

You can also use the following ways to troubleshoot connectivity issues on theproblematic Pods:

Ephemeral debug workloadattached to the Pod container.
Run a shell directly on the target Pod usingkubectl exec, then installand launch thetcpdump command.

Pod network connectivity issues

As mentioned in theNetwork Overviewdiscussion, it is important to understand how Pods are wired from theirnetwork namespaces to the root namespace on the node in order totroubleshoot effectively. For the following discussion, unless otherwisestated, assume that the cluster uses GKE's native CNI ratherthan Calico's. That is, nonetwork policyhas been applied.

Pods on select nodes have no availability

If Pods on select nodes have no network connectivity, ensure thatthe Linux bridge is up:

ipaddressshowcbr0

Note: The virtual network bridgecbr0 is only created if there are Pods which sethostNetwork: false.

If the Linux bridge is down, raise it:

sudoiplinksetcbr0up

Ensure that the node is learning Pod MAC addresses attached to cbr0:

arp-an

Pods on select nodes have minimal connectivity

If Pods on select nodes have minimal connectivity, you should first confirmwhether there are any lost packets by runningtcpdump in the toolbox container:

sudotoolboxbash

Installtcpdump in the toolbox if you have not done so already:

aptinstall-ytcpdump

Runtcpdump against cbr0:

tcpdump-nicbr0hostHOSTNAMEandportPORT_NUMBERand[TCP|UDP|ICMP]

Should it appear that large packets are being dropped downstream from thebridge (for example, the TCP handshake completes, but no SSL hellos arereceived), ensure that the MTU for each Linux Pod interface is correctly set tothe MTU of the cluster's VPC network.

ipaddressshowcbr0

When overlays are used (for example, Weave or Flannel), this MTU must be furtherreduced to accommodate encapsulation overhead on the overlay.

GKE MTU

The MTU selected for a Pod interface is dependent on the Container NetworkInterface (CNI) used by the cluster Nodes and the underlying VPC MTU setting.For more information, seePods.

Autopilot is configured to always inherit the VPC MTU.

The Pod interface MTU value is either1460 or inherited from the primaryinterface of the Node.

CNI	MTU	GKE Standard
kubenet	1460	Default
kubenet (GKE version 1.26.1 and later)	Inherited	Default
Calico	1460	Enabled by using`--enable-network-policy`. For details, seeControl communication between Pods and Services using network policies.
netd	Inherited	Enabled by using any of the following: Intranode visibility Workload Identity Federation for GKE IPv4/IPv6 dual-stack networking
GKE Dataplane V2	Inherited	Enabled by using`--enable-dataplane-v2`. For details, seeUsing GKE Dataplane V2.

Intermittent failed connections

Connections to and from the Pods are forwarded by iptables. Flows are trackedas entries in the conntrack table and, where there are many workloads per node,conntrack table exhaustion may manifest as a failure. These can be logged in theserial console of the node, for example:

nf_conntrack: table full, dropping packet

If you are able to determine that intermittent issues are driven by conntrackexhaustion, you may increase the size of the cluster (thus reducing the numberof workloads and flows per node), or increasenf_conntrack_max:

new_ct_max=$(awk'$1 == "MemTotal:" { printf "%d\n", $2/32; exit; }'/proc/meminfo)sysctl-wnet.netfilter.nf_conntrack_max="${new_ct_max:?}"\  &&echo"net.netfilter.nf_conntrack_max=${new_ct_max:?}" >>/etc/sysctl.conf

You can also use NodeLocal DNSCache toreduce connection tracking entries.

"bind: Address already in use " reported for a container

A container in a Pod is unable to start because according to the container logs,the port where the application is trying to bind to is already reserved. Thecontainer is crash looping. For example, in Cloud Logging:

resource.type="container"textPayload:"bind: Address already in use"resource.labels.container_name="redis"2018-10-16 07:06:47.000 CEST 16 Oct 05:06:47.533 # Creating Server TCP listening socket *:60250: bind: Address already in use2018-10-16 07:07:35.000 CEST 16 Oct 05:07:35.753 # Creating Server TCP listening socket *:60250: bind: Address already in use

When Docker crashes, sometimes a running container gets left behind and isstale. The process is still running in the network namespace allocated for thePod, and listening on its port. Because Docker and the kubelet don't know aboutthe stale container they try to start a new container with a new process, whichis unable to bind on the port as it gets added to the network namespace alreadyassociated with the Pod.

To diagnose this problem:

You need the UUID of the Pod in the.metadata.uuid field:

kubectl get pod -o custom-columns="name:.metadata.name,UUID:.metadata.uid" ubuntu-6948dd5657-4gsggname                      UUIDubuntu-6948dd5657-4gsgg   db9ed086-edba-11e8-bdd6-42010a800164

Get the output of the following commands from the node:

dockerps-aps-eopid,ppid,stat,wchan:20,netns,comm,args:50,cgroup--cumulative-H|grep[PodUUID]

Check running processes from this Pod. Because the UUID of the cgroupnamespaces contain the UUID of the Pod, you can grep for the Pod UUID inps output. Grep also the line before, so you will have thedocker-containerd-shim processes having the container ID in the argumentas well. Cut the rest of the cgroup column to get a simpler output:

# ps -eo pid,ppid,stat,wchan:20,netns,comm,args:50,cgroup --cumulative -H | grep -B 1 db9ed086-edba-11e8-bdd6-42010a800164 | sed s/'blkio:.*'/''/1283089     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim 276e173b0846e24b704d4 12:1283107 1283089 Ss   sys_pause            4026532393         pause           /pause                                     12:1283150     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim ab4c7762f5abf40951770 12:1283169 1283150 Ss   do_wait              4026532393         sh              /bin/sh -c echo hello && sleep 6000000     12:1283185 1283169 S    hrtimer_nanosleep    4026532393           sleep           sleep 6000000                            12:1283244     959 Sl   futex_wait_queue_me  4026531993       docker-co       docker-containerd-shim 44e76e50e5ef4156fd5d3 12:1283263 1283244 Ss   sigsuspend           4026532393         nginx           nginx: master process nginx -g daemon off; 12:1283282 1283263 S    ep_poll              4026532393           nginx           nginx: worker process

From this list, you can see the container ids, which should be visible indocker ps as well.
In this case:
- docker-containerd-shim 276e173b0846e24b704d4 for pause
- docker-containerd-shim ab4c7762f5abf40951770 for sh with sleep (sleep-ctr)
- docker-containerd-shim 44e76e50e5ef4156fd5d3 for nginx (echoserver-ctr)

Check those in thedocker ps output:

# docker ps --no-trunc | egrep '276e173b0846e24b704d4|ab4c7762f5abf40951770|44e76e50e5ef4156fd5d3'44e76e50e5ef4156fd5d383744fa6a5f14460582d0b16855177cbed89a3cbd1f   registry.k8s.io/echoserver@sha256:3e7b182372b398d97b747bbe6cb7595e5ffaaae9a62506c725656966d36643cc                   "nginx -g 'daemon off;'"                                                                                                                                                                                                                                                                                                                                                                     14 hours ago        Up 14 hours                             k8s_echoserver-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0ab4c7762f5abf40951770d3e247fa2559a2d1f8c8834e5412bdcec7df37f8475   ubuntu@sha256:acd85db6e4b18aafa7fcde5480872909bd8e6d5fbd4e5e790ecc09acc06a8b78                                                "/bin/sh -c 'echo hello && sleep 6000000'"                                                                                                                                                                                                                                                                                                                                                   14 hours ago        Up 14 hours                             k8s_sleep-cnt_ubuntu-6948dd5657-4gsgg_default_db9ed086-edba-11e8-bdd6-42010a800164_0276e173b0846e24b704d41cf4fbb950bfa5d0f59c304827349f4cf5091be3327   registry.k8s.io/pause-amd64:3.1

In normal cases, you see all container ids fromps showing up indockerps. If there is one you don't see, it's a stale container, and probably youwill see a child process of thedocker-containerd-shim process listeningon the TCP port that is reporting as already in use.

To verify this, executenetstat in the container's network namespace. Getthe pid of any container process (so NOTdocker-containerd-shim) for thePod.

From the preceding example:

1283107 - pause
1283169 - sh
1283185 - sleep
1283263 - nginx master
1283282 - nginx worker

# nsenter -t 1283107 --net netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program nametcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Pathunix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mastunix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mastgke-zonal-110-default-pool-fe00befa-n2hx ~ # nsenter -t 1283169 --net netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program nametcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Pathunix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mastunix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mast

You can also executenetstat usingip netns, but you need to link thenetwork namespace of the process manually, as Docker is not doing the link:

# ln -s /proc/1283169/ns/net /var/run/netns/1283169gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns list1283169 (id: 2)gke-zonal-110-default-pool-fe00befa-n2hx ~ # ip netns exec 1283169 netstat -anpActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program nametcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      1283263/nginx: mastActive UNIX domain sockets (servers and established)Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Pathunix  3      [ ]         STREAM     CONNECTED     3097406  1283263/nginx: mastunix  3      [ ]         STREAM     CONNECTED     3097405  1283263/nginx: mastgke-zonal-110-default-pool-fe00befa-n2hx ~ # rm /var/run/netns/1283169

Mitigation:

The short term mitigation is to identify stale processes by the method outlinedpreceding, and end the processes using thekill [PID] command.

Long term mitigation involves identifying why Docker is crashing and fixing that.Possible reasons include:

Zombie processes piling up, so running out of PID namespaces
Bug in docker
Resource pressure / OOM

What's next

For general information about diagnosing Kubernetes DNS issues, seeDebugging DNS Resolution.
If you can't find a solution to your problem in the documentation, seeGet support for further help,including advice on the following topics:
- Opening a support case by contactingCloud Customer Care.
- Getting support from the community byasking questions on StackOverflow and using thegoogle-kubernetes-engine tag to search for similarissues. You can also join the#kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using thepublic issue tracker.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Troubleshoot connectivity issues in your cluster Stay organized with collections Save and categorize content based on your preferences.