CN116896499B

Movatterモバイル変換

Info

Publication number: CN116896499B
Application number: CN202310691625.6A
Authority: CN
Inventors: 王喆; 郭歌; 刘承亮; 朱韦桥
Original assignee: Institute of Computing Technologies of CARS
Current assignee: Institute of Computing Technologies of CARS
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2024-03-19
Anticipated expiration: 2043-06-12
Also published as: CN116896499A

Abstract

The invention discloses a kubernetesPod network error checking system and a kubernetesPod network error checking method, which relate to the field of network abnormal state detection and utilize a detection scheduling component to obtain cluster information; and (3) respectively carrying out periodic network connectivity detection on the pod in the cluster by using an external detection agent and an internal detection agent of the cluster, determining the problem of the pod network which is about to appear or has appeared based on the network connectivity detection result, carrying out fault positioning, finding the source of the problem in the shortest time possible, solving the problem, recovering the service capability provided by a container in the pod, and providing reliable guarantee for the stable operation of the cluster.

Description

kubernetes Pod network error checking system and method

Technical Field

The invention relates to the field of kubernetes Pod network connectivity detection, in particular to an automatic kubernetes Pod network error checking system and method.

Background

Kubernetes is becoming increasingly popular for use in enterprises, many of which are increasingly containerized and migrated to Kubernetes platforms. The Kubernetes packages containers in Pod units and provides powerful container orchestration capabilities including Pod resource management, scheduling, and network management. During production runs, it is very common for Pod in kubernetes platforms to have various problems, such as abnormal Pod status, no access, etc. However, due to the complexity of the Kubernetes own network, the overlay network architecture of the flannel vxlan is adopted, so that the problem of the Pod network is not easy to find, the problem finding process is relatively long, and the stability of the application providing service is affected.

The existing kubernetes monitoring scheme generally adopts a combination of grafana and Prometheus, and the Prometheus is responsible for collecting index data of a monitored object, storing the index data into a time sequence database, and generating a visual chart according to the grafana statistical data. However, the data come from the externally exposed index of the monitored object or the inherent CPU, memory and storage resource use condition of the cluster, and lack of observation on Pod network connectivity (IP address connectivity, port accessibility and domain name accessibility).

Disclosure of Invention

The invention aims to provide a kubernetes Pod network error checking system and method, which are used for detecting network connectivity (IP address communication, port accessibility, domain name accessibility and the like) of Pod in kubernetes clusters in real time and can be used for quickly checking out reasons of Pod network state abnormality and positioning faults.

In order to achieve the above object, the present invention provides the following solutions:

a kubernetes Pod network error checking system, the system comprising: the detection scheduling component is arranged outside the kubernetes cluster, the at least one out-of-cluster detection agent and the intra-cluster detection agent are arranged in the kubernetes cluster; each node in the kubernetes cluster deploys one intra-cluster detection agent;

The detection scheduling component is used for acquiring cluster information of the kubernetes cluster, sending the cluster information to the intra-cluster detection agent and the outer-cluster detection agent, receiving a network connectivity result detected by the intra-cluster detection agent and a network communication state result detected by the intra-cluster detection agent, and determining a network abnormality reason of the pod to be detected according to the network connectivity result and the network communication state result; the cluster information comprises pod information, node information, service information, ingress information and CoreDNS information;

the intra-cluster detection agent is used for periodically detecting the network connectivity of the pod to be detected and the same node pod and the network connectivity of the pod to be detected and different node pods based on the cluster information, and sending the network connectivity result to the detection scheduling component; the same-node pod refers to a pod which is positioned in the same node as the pod to be detected; the different node pod refers to a pod which is in a different node from the pod to be detected;

the cluster outside detection agent is used for periodically detecting the network communication state of the detected pod and the cluster outside service based on the cluster information and sending the network communication state result to the detection scheduling component; the out-of-cluster services include an Ingress service and a nodeport service.

The invention also provides a kubernetes Pod network error checking method, which comprises the following steps:

the detection scheduling component acquires the cluster information of the kubernetes cluster and sends the cluster information to the intra-cluster detection agent and the outer-cluster detection agent; the cluster information comprises pod information, node information, service information, ingress information and CoreDNS information;

the intra-cluster detection agent periodically detects the network connectivity of the pod to be detected and the same node pod and the network connectivity of the pod to be detected and different node pods based on the cluster information, and sends the network connectivity result to the detection scheduling component; the same-node pod refers to a pod which is positioned in the same node as the pod to be detected; the different node pod refers to a pod which is in a different node from the pod to be detected;

the cluster outside detection agent periodically detects the network communication state of the detected pod and the outside-cluster service based on the cluster information and sends the network communication state result to the detection scheduling component; the out-of-cluster services include an Ingress service and a nodeport service.

And the detection scheduling component receives a network connectivity result detected by the intra-cluster detection agent and a network communication state result detected by the intra-cluster detection agent, and determines a network abnormality reason of the pod to be detected according to the network connectivity result and the network communication state result.

Optionally, the intra-cluster detection agent periodically detects network connectivity between the pod to be detected and the same node pod and network connectivity between the pod to be detected and different node pods based on the cluster information, and sends the network connectivity result to the detection scheduling component, which specifically includes:

the detection agent in the first cluster periodically executes a ping command, a telnet command tests the network connectivity of the detected pod, a first network connectivity result is obtained, and the first network connectivity result is sent to the detection scheduling component; the first intra-cluster detection agent is an intra-cluster detection agent which is in the same node with the detected pod;

the detection agent in the second cluster periodically executes a ping command, a telnet command and a curl command to test the cross-node network connectivity of the detected pod to obtain a second network connectivity result, and sends the second network connectivity result to the detection scheduling component; the second intra-cluster detection agent is an intra-cluster detection agent at a different node from the detected pod;

the detection agent in the second cluster performs connectivity detection of a ping command with a node where the detected Pod is located, node VTEP equipment corresponding to the node where the detected Pod is located, and cni network bridge corresponding to the node where the detected Pod is located respectively, so as to obtain a third network connectivity result, and the third network connectivity result is sent to the detection scheduling component; the third network connectivity result comprises network connectivity states between the detection agent in the second cluster and the node where the detected Pod is located, the node VTEP equipment and the cni0 bridge respectively; the network connectivity results include the first network connectivity result, the second network connectivity result, and the third network connectivity result;

And sending the network connectivity result to the detection scheduling component.

Optionally, the detecting agent in the first cluster periodically executes a ping command and a telnet command to test network connectivity with the detected pod, so as to obtain a first network connectivity result, which specifically includes:

the detection agent in the first cluster executes a ping command to the detected pod regularly to determine a first PodIP address ping on state of the detected pod;

if the first Pod IP address ping state is Pod IP address ping failure, the detected Pod mark IP is failed, and Tcpdump packet grabbing analysis is carried out;

if the first Pod IP address ping state is Pod IP address ping, enabling the detected Pod mark IP to be reachable, executing a telnet command to the detected Pod by the detection agent in the first cluster, and determining a first Pod port telnet on state of the detected Pod;

if the first Pod port telnet on state is Pod port telnet on, the detected Pod is marked that the Pod port is reachable;

if the first Pod port telnet on state is Pod port telnet off, the detected Pod marked Pod port is off, and Tcpdump packet grabbing analysis is carried out; the first network connectivity result includes the first Pod IP address ping on state and the first Pod port telnet on state.

Optionally, the second intra-cluster detection agent periodically executes a ping command, a telnet command, and a curl command to test the cross-node network connectivity with the detected pod to obtain a second network connectivity result, which specifically includes:

executing a ping command to the detected Pod by the detection agent in the second cluster, and determining a second Pod IP address ping on state of the detected Pod;

if the second Pod IP address ping state is Pod IP address ping, enabling the detected Pod mark IP to be reachable, executing a telnet command to the detected Pod by the detection agent in the second cluster, and determining a second Pod port telnet on state of the detected Pod;

if the second Pod port telnet on state is Pod port telnet off, the detected Pod mark Pod port is off;

if the second Pod port telnet on state is Pod port telnet on, marking Pod port reachable to the detected Pod, executing a curl command to the detected Pod by the detection agent in the second cluster, and determining a service domain reachable state of the detected Pod;

if the service domain name reachable state is that the service domain name is reachable, the detected pod marks the service domain name reachable;

If the service domain name reachable state is the service domain name unreachable, the detected pod marking service domain name is unreachable, the detection agent in the second cluster checks the CoreDNS service state, and if the CoreDNS service state is abnormal, the detected pod marking CoreDNS is abnormal; if the CoreDNS service state is normal, the detected pod performs Tcpdump packet grabbing analysis;

if the ping state of the second Pod IP address is that the Pod IP address is not ping, the detection agent in the second cluster performs connectivity detection of a ping command with the node where the detected Pod is located, the node VTEP device corresponding to the node where the detected Pod is located, and the cni network bridge corresponding to the node where the detected Pod is located, so as to obtain a third network connectivity result.

Optionally, when the first detection of the second PodIP address ping status is Pod IP address ping failure or the second Pod port telnet status is Pod port telnet failure or the service domain name reachable status is service domain name unreachable, the detection scheduling component allocates a new second intra-cluster detection agent again, replaces the second intra-cluster detection agent with the new second intra-cluster detection agent, and returns to the step of executing a ping command to the detected Pod by the second intra-cluster detection agent, determining a second Pod IP address ping on status of the detected Pod, until the second Pod IP address ping on status, the second Pod port telnet on status and the service domain name reachable status detected for the second time are obtained;

When the two detection results are consistent, the reason for network abnormality is the detected pod;

and when the two detection results are inconsistent, marking the second intra-cluster detection agent and the new second intra-cluster detection agent as abnormal, and performing manual confirmation.

Optionally, the second intra-cluster detection agent performs connectivity detection of a ping command with a node where the detected Pod is located, node VTEP equipment corresponding to the node where the detected Pod is located, and a cni bridge corresponding to the node where the detected Pod is located, to obtain a third network connectivity result, which specifically includes:

the second intra-cluster detection agent performs a ping command to the cni bridge of the detected Pod, determining cni an IP state of the bridge;

if the IP state of the cni0 network bridge is cni0 network bridge IP is on, marking the detected Pod abnormality;

if the IP state of the cni0 bridge is cni0 bridge IP failure, the second intra-cluster detection agent executes a ping command to the node VTEP device of the detected Pod to determine the IP state of the VTEP;

if the IP state of the VTEP is the IP of the VTEP, marking the cni0 network bridge as abnormal;

if the IP state of the VTEP is that the IP of the VTEP is not passed, the detection agent in the second cluster executes a ping command to the node where the detected Pod is located, and determines the IP state of the node where the detected Pod is located;

If the IP state of the node where the detected Pod is located is the IP of the node where the detected Pod is located, marking the node VTEP equipment abnormality;

if the IP state of the node where the detected Pod is located is that the IP of the node where the detected Pod is located is not enabled, marking that the node where the detected Pod is located is abnormal, and performing Tcpdump packet grabbing analysis.

Optionally, the detecting agent outside the cluster periodically detects the network communication state between the detected pod and the service outside the cluster based on the cluster information, which specifically includes:

the cluster external detection agent executes telnet command and the curl command to the detected pod to access the associated Ingress domain name so as to obtain the access state of the Ingress domain name;

if the access state of the access domain name is that the access domain name is accessible, marking that the service cluster is accessible; if the access state of the access domain name is that the access domain name is inaccessible, the cluster external detection agent checks the access service state; if the Ingress service state is abnormal, marking the Ingress abnormal; if the Ingress service state is normal, the cluster external detection agent checks the service/endpoint state; if the service/endpoint state is normal, checking the Ingress configuration and checking the DNS service; if the service/endpoint state is abnormal, then the service/endpoint is marked as abnormal.

the cluster external detection agent executes telnet command and curl command to the detected pod to access the associated service node port to obtain a service node port reachable state;

if the service nodeport reachable state is reachable, marking that the service cluster is accessible;

if the service node reachable state is not reachable, the cluster external detection agent checks the service/endpoint state; if the service/endpoint state is normal, performing Tcpdump packet grabbing analysis; if the service/endpoint state is abnormal, then the service/endpoint is marked as abnormal.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention aims to provide a kubernetes Pod network error checking system and a kubernetes Pod network error checking method, which are used for obtaining cluster information (including node, pod, service, ingress and other information) by detecting a scheduling component; the method comprises the steps that a cluster outside detection agent and a cluster inside detection agent respectively detect periodical network connectivity (IP connectivity, port connectivity and domain name accessibility) of pod in kubernetes clusters so as to update a cluster pod network connection topological graph in real time, and fault location is carried out on pod network problems which are about to occur or have occurred according to the cluster pod network connection topological graph. When network connectivity faults occur in Pod in the cluster, analysis results are given to general problems to locate fault reasons, necessary data support is given to complex problems, so that operation and maintenance personnel can conveniently locate the problems, the troubleshooting efficiency of Pod network abnormal reasons is improved, and meanwhile, quick and accurate fault location can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a kubernetes Pod network error checking system according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a kubernetes Pod network error checking provided in embodiment 2 of the present invention;

FIG. 3 is a test logic of a detection agent in a same node cluster according to embodiment 2 of the present invention;

FIG. 4 is a test logic of a detection agent in a different node cluster according to embodiment 2 of the present invention;

FIG. 5 is an Ingress manner-Cluster outer detection agent test logic provided in embodiment 2 of the present invention;

fig. 6 is a diagram illustrating a nodeport mode, i.e., a cluster outer detection proxy test logic, provided in embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the existing kubernetes monitoring scheme, when an operation and maintenance person checks a problem, the connectivity of the problem Pod is always detected repeatedly, and if necessary, a packet grabbing tool tcpdump is used for packet grabbing analysis. The additional tool containers are required to be deployed in the cluster to support various network commands, which clearly increases the workload of operation and maintenance personnel, prolongs the problem solving time and is not beneficial to the development of operation and maintenance work.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

In order to realize the detection of the communication state of the pod network in the kubernetes cluster, an automatic network error checking tool needs to be deployed inside and outside the cluster, and the following 3 types of components are specifically: the detection schedule component, off-cluster detection agent(s), intra-cluster detection agent (one for each node deployment in the cluster), as shown in fig. 1. In this regard, the present embodiment provides a kubernetes Pod network error checking system, where the system includes: the detection scheduling component is arranged outside the kubernetes cluster, the at least one out-of-cluster detection agent and the intra-cluster detection agent are arranged in the kubernetes cluster; each node in the kubernetes cluster deploys one intra-cluster detection agent.

The detection scheduling component is used for acquiring cluster information of the kubernetes cluster, sending the cluster information to the intra-cluster detection agent and the outer-cluster detection agent, receiving a network connectivity result detected by the intra-cluster detection agent and a network communication state result detected by the intra-cluster detection agent, and determining a network abnormality reason of the pod to be detected according to the network connectivity result and the network communication state result; the cluster information comprises pod information, node information, service information, information and CoreDNS information.

The detection scheduling component configures an access certificate of kubernetes cluster, and can acquire the following cluster information through a cluster api server interface, including:

pod information (IP address, port, host)

node information (host IP, host name)

service information (clusterIP, service name, service port, nodeport, associated endpoint information)

Ingress information (Ingress service name, associated service)

CoreDNS information (pod running state of CoreDNS in clusters, CPU and memory occupancy)

And after the detection scheduling component obtains the relevant configuration of the cluster network, the information is sent to the detection agents outside and inside the cluster through the interfaces. The detection agent runs in or out of the cluster in a containerized manner, comprises a data and command receiving interface and a result output interface, and is internally provided with a network detection toolkit (ping\telnet\curl\tcpdump and the like).

The detection scheduling component is further configured to display the real-time network connectivity result and the network communication status result on a cluster pod connectivity topology map.

The intra-cluster detection agent is used for periodically detecting the network connectivity of the pod to be detected and the same node pod and the network connectivity of the pod to be detected and different node pods based on the cluster information, and sending the network connectivity result to the detection scheduling component; the same-node pod refers to a pod which is positioned in the same node as the pod to be detected; the different node pod refers to a pod that is in a different node than the pod to be detected.

For any pod to be detected in the cluster, at least two intra-cluster detection agents are communicated with the pod to be detected, one is the intra-cluster detection agent which is positioned at the same node as the detected pod, and the other is the intra-cluster detection agent from different nodes.

Example 2

As shown in fig. 2, the present embodiment provides a kubernetes Pod network error checking method implemented based on the kubernetes Pod network error checking system of embodiment 1, where the method includes:

s1: the detection scheduling component acquires the cluster information of the kubernetes cluster and sends the cluster information to the intra-cluster detection agent and the outer-cluster detection agent; the cluster information comprises pod information, node information, service information, information and CoreDNS information.

pod information (IP address, port, host)

node information (host IP, host name)

Ingress information (Ingress service name, associated service)

S2: the intra-cluster detection agent periodically detects the network connectivity of the pod to be detected and the same node pod and the network connectivity of the pod to be detected and different node pods based on the cluster information, and sends the network connectivity result to the detection scheduling component; the same-node pod refers to a pod which is positioned in the same node as the pod to be detected; the different node pod refers to a pod that is in a different node than the pod to be detected.

S21: and (3) periodically (every 10 s/30 s) executing a ping command and a telnet command by a detection agent in the cluster of the same node as the detected Pod, testing the network connectivity of the detected Pod, updating the network communication state of the detected Pod and the same node Pod according to the network reachability condition, wherein the network communication state specifically comprises the ping pass (or the non-pass) of the Pod IP address, the telnet pass (or the non-pass) of the Pod port, and returning the result to the detection scheduling component. The method comprises the steps that a detection agent in a first cluster periodically executes a ping command, a telnet command tests network connectivity of the detected pod to obtain a first network connectivity result, and sends the first network connectivity result to a detection scheduling component; the first intra-cluster detection agent is an intra-cluster detection agent co-node with the detected pod. Specifically, as shown in fig. 3, the intra-cluster in the figure refers to an intra-cluster detection agent, and the target pod in the figure is the detected pod, and the test process is as follows:

And the detection agent in the first cluster periodically executes a ping command to the detected pod, and determines the ping on state of the first PodIP address of the detected pod.

And if the first Pod IP address ping state is Pod IP address ping failure, the detected Pod mark IP is failed, and Tcpdump packet grabbing analysis is carried out.

If the first Pod IP address ping state is Pod IP address ping, the detected Pod mark IP is reachable, the detection agent in the first cluster executes telnet command to the detected Pod, and the first Pod port telnet on state of the detected Pod is determined.

And if the first Pod port telnet on state is Pod port telnet on, marking the detected Pod with a Pod port reachable.

S22: and executing a ping command, a telnet command and a curl command for each to-be-detected Pod by a detection agent of a different node from the to-be-detected Pod (N to-be-detected pods are arranged in a cluster, a scheduling component randomly distributes a detection component which is not in the same node with each to-be-detected Pod), testing the cross-node network connectivity of the to-be-detected Pod, updating the network communication state between the to-be-detected Pod and the different node pods according to the network reachability condition, wherein the network communication state specifically comprises a Pod address ping on (or off), a Pod port telnet on (or off) and a service name reachable (or unreachable). The detection agent in the second cluster periodically executes ping commands, telnet commands and curl commands to test the cross-node network connectivity of the detected pod, obtains a second network connectivity result and sends the second network connectivity result to the detection scheduling component; the second intra-cluster detection agent is an intra-cluster detection agent that is at a different node than the detected pod. As shown in fig. 4, the specific detection process is:

And the detection agent in the second cluster executes a ping command to the detected Pod, and determines a ping on state of a second Pod IP address of the detected Pod.

If the second Pod IP address ping status is Pod IP address ping status, the detected Pod mark IP is reachable, the second intra-cluster detection agent executes telnet command to the detected Pod, and the second Pod port telnet status of the detected Pod is determined.

And if the second Pod port telnet on state is Pod port telnet off, the detected Pod mark Pod port is off.

If the second Pod port telnet on state is Pod port telnet on, the Pod port is marked to be reachable by the detected Pod, and the detection agent in the second cluster executes a curl command to the detected Pod to determine the service domain name reachable state of the detected Pod.

And if the service domain name reachable state is that the service domain name is reachable, the detected pod marks the service domain name reachable.

If the service domain name reachable state is the service domain name unreachable, the detected pod marking service domain name is unreachable, the detection agent in the second cluster checks the CoreDNS service state, and if the CoreDNS service state is abnormal, the detected pod marking CoreDNS is abnormal; and if the CoreDNS service state is normal, the detected pod performs Tcpdump packet grabbing analysis.

If the second Pod IP address ping status is Pod IP address ping is not on, step S23 is executed: and the detection agent in the second cluster performs connectivity detection of the ping command with the node where the detected Pod is located, the node VTEP equipment corresponding to the node where the detected Pod is located and the cni network bridge corresponding to the node where the detected Pod is located, so as to obtain a third network connectivity result.

In the detection process of step S22, in order to ensure accuracy of the network connectivity detection result, two detection determinations may be performed.

When the network status of the detected pod is abnormal (IP-not-passing or port-not-passing or domain name-not-passing), the detection scheduling component allocates a second intra-cluster detection agent again, which is the same as the detected pod, at a different node, and performs the detection procedure of step S22 described above. The executed result is also returned to the detection scheduling component for comparison with the last result. When the two results agree, the network problem can be considered as in terms of detected pod; if the detection results are inconsistent, marking the two agent states as abnormal, and waiting for manual confirmation. Specifically, if the first detection of the second Pod IP address ping status is Pod IP address ping failure or the second Pod port telnet status is Pod port telnet failure or the service domain name reachable status is service domain name unreachable, the detection scheduling component allocates a new second intra-cluster detection agent again, replaces the second intra-cluster detection agent with the new second intra-cluster detection agent, and returns to the step of executing a ping command to the detected Pod by the second intra-cluster detection agent, determining a second Pod IP address ping on status of the detected Pod, until the second Pod IP address ping on status, the second Pod port telnet on status and the service domain name reachable status detected for the second time are obtained; when the two detection results are consistent, the reason for network abnormality is the detected pod; and when the two detection results are inconsistent, marking the second intra-cluster detection agent and the new second intra-cluster detection agent as abnormal, and performing manual confirmation.

As shown in fig. 4, the more specific detection process of the second intra-cluster detection agent performing the connectivity detection of the ping command with the node where the detected Pod is located, the node VTEP device corresponding to the node where the detected Pod is located, and the cni bridge corresponding to the node where the detected Pod is located is:

The second intra-cluster detection agent performs a ping command to the cni bridge of the detected Pod, determining cni the IP state of the bridge.

And if the IP state of the cni0 network bridge is cni0 network bridge IP on, marking the detected Pod abnormal.

If the IP state of the cni0 bridge is cni0 bridge IP failure, the second intra-cluster detection agent executes a ping command to the node VTEP device of the detected Pod to determine the IP state of the VTEP.

And if the IP state of the VTEP is the IP of the VTEP, marking the cni0 network bridge abnormal.

If the IP state of the VTEP is that the IP of the VTEP is not passed, the detection agent in the second cluster executes a ping command to the node where the detected Pod is located, and determines the IP state of the node where the detected Pod is located.

If the IP state of the node where the detected Pod is located is the IP on of the node where the detected Pod is located, marking the node VTEP equipment abnormality.

For the condition that the states of the detection agents and the detected pod networks in different node clusters are abnormal, the error reasons which can be judged specifically include:

Pod status exception: only the pod IP is not reachable, the IP of the node where the pod IP is located and the IP of the CNI equipment are all reachable, and the state of the pod can be judged to be abnormal.

Cni0 bridge anomaly: cni0 the bridge IP is not reachable, but the VTEP device IP is reachable, and a bridge failure can be determined.

VTEP equipment failure: the node IP is reachable, the VTEP device IP is not reachable, and the VTEP device (e.g., a flanneld.1 interface failure) can be determined.

Node anomaly: and if the node IP is not reachable, judging that the node where the detection agent is located and the node where the detected pod is located are not reachable, and requiring packet grabbing confirmation.

CoreDNS anomaly: the CoreDNS service state within the cluster is shown as an abnormal state, which can be obtained from an api server.

For the case that the failure cause cannot be directly judged, the intra-cluster detection agent or the outer-cluster detection agent can provide Tcpdump packet grabbing information to facilitate the manual judgment of the problem, such as Tcpdump packet grabbing analysis shown in fig. 3 and 4. When the packet is grabbed, the detection proxy IP address and port are used as the source address and port, the IP address and port of the pod/cni/vtep/node are used as the destination IP (and port), and the data are screened according to the IP address and port, so that the fault location is facilitated.

S3: the cluster outside detection agent periodically detects the network communication state of the detected pod and the outside-cluster service based on the cluster information and sends the network communication state result to the detection scheduling component; the out-of-cluster services include an Ingress service and a nodeport service.

And detecting task splitting outside the cluster. The off-cluster detection agent may initiate one or more according to the size of the pod in the cluster, thereby reducing the detection workload of the off-cluster detection agent. In the context of multiple out-of-cluster detection agents, each out-of-cluster detection agent is responsible for handling network connectivity detection for the pod in a portion of the cluster. In the cluster, each cluster external detection agent has a respective detection range and is responsible for pod detection in the corresponding detection range.

The error causes that can be determined specifically include:

and after executing the telnet command and the curl command on the service node port or the Ingress domain name associated with the detected pod by the external cluster detection agent, updating a result fed back by the detected pod to the external cluster detection agent, wherein the network communication state of the detected pod and the external cluster service is accessible (or inaccessible) by the Ingress domain name, and the nodeport port is accessible (or inaccessible) by the external cluster detection agent. Namely, the step S3 specifically includes:

(1) The cluster external detection agent executes telnet command and the associated Ingress domain name accessed by the curl command to the detected pod to obtain the access state of the Ingress domain name, as shown in fig. 5, the cluster external detection agent is indicated in the figure, and svc/ep is the shorthand of service/endpoint.

If the access status of the access domain name is that the access domain name is accessible, then the mark service cluster outside the access; if the access state of the Ingress domain name is that the Ingress domain name is inaccessible, marking that the Ingress service is not reachable and checking the state of the Ingress service by the cluster external detection agent; if the Ingress service state is abnormal, marking the Ingress abnormal; if the Ingress service state is normal, the cluster external detection agent checks the service/endpoint state; if the service/endpoint state is normal, checking the Ingress configuration and checking the DNS service; if the service/endpoint state is abnormal, then the service/endpoint is marked as abnormal.

An anomaly of the Ingress controller: acquiring the running state of Ingress controller by reading an api server interface of the kubernetes cluster; if cluster Ingress controller is out of state, then the label Ingress controller is out of order. The state of the Ingress controller is the access state of the Ingress domain name.

Service/endpoint anomalies: and reading the cluster api server interface to obtain a relevant state (service/endpoint state), and if the cluster endpoint cannot be associated with a correct pod (the service/endpoint state is abnormal), considering that the service/endpoint is abnormal.

If the abnormal conditions of the Ingress and service/endpoint do not exist, the DNS service currently in use needs to be checked manually to verify whether the domain name can be resolved to an IP address correctly.

(2) The cluster outside detection agent executes telnet command or curl command to the detected pod to access the associated service node port to obtain the reachable state of the service node port, as shown in fig. 6.

If the service nodeport reachable state is reachable, marking that the service cluster is accessible; if the service node port reachable state is not reachable, marking that the service node port is not reachable outside the service cluster, and checking the service/endpoint state by the cluster outer detection agent.

If the service/endpoint state is normal, performing Tcpdump packet grabbing analysis; if the service/endpoint state is abnormal, then the service/endpoint is marked as abnormal.

S4: and the detection scheduling component receives a network connectivity result detected by the intra-cluster detection agent and a network communication state result detected by the intra-cluster detection agent, and determines a network abnormality reason of the pod to be detected according to the network connectivity result and the network communication state result.

The detection scheduling component collects pod connectivity information sent back by each detection agent, stores the pod connectivity information in a database and displays the pod connectivity information on the cluster pod connectivity topological graph. In the figure, all normal pod marks green, yellow is displayed for partial connectivity (at least one of IP reachable, port reachable and domain name reachable is not satisfied), problems and error reasons are marked, completely unreachable pod marks red and error reasons, and the topological graph is refreshed regularly. And the operation and maintenance personnel master the cluster real-time pod network situation according to the topological graph, timely perform problem repair according to the pod current situation and error reasons of the problems, and restore the cluster to a normal state.

According to the scheme, the network connectivity of the pod in the kubernetes cluster is detected in real time, when the network connectivity fault occurs in the pod in the cluster, the analysis result is given to the general problem to locate the fault reason, the necessary data support is given to the complex problem, the problem is located by operation and maintenance personnel conveniently, and technical means and information for solving the problem are provided.

The embodiment belongs to a kubernetes monitoring scheme, which comprises a management component (detection scheduling component) deployed outside a cluster, wherein the component obtains cluster information (including node, pod, service, ingress information) through kubernetesAPI Server; and the cluster outside detection agent and the cluster inside detection agent respectively perform periodic network connectivity detection (IP connectivity, port connectivity and domain name accessibility) on the pod in the kubernetes cluster to obtain a cluster pod network connectivity topological graph. The network connectivity of all the pod in the cluster is marked and refreshed regularly in the cluster pod network communication topological graph, an operation and maintenance personnel can acquire the network state of the cluster pod in real time, fault location is carried out on the pod network problems which are about to occur or have occurred based on the network connectivity, the root cause of the problems (such as DNS abnormality, CNI network plug-in abnormality and the like) is found in the shortest time as possible, the problem is solved, the service capability provided by a container in the pod is recovered, and reliable guarantee is provided for the stable operation of the cluster.

The present embodiment has the following effects: (1) The operation and maintenance personnel can grasp the connectivity condition of each pod network in the kuubernes cluster and the state of a CNI network plug-in the cluster node in real time and refresh at fixed time; (2) The pod information of the problems in the cluster can be mastered by operation and maintenance personnel in real time, so that countermeasures can be conveniently taken at the first time; (3) The method can give out problem reasons and cluster components (such as coredns faults and CNI anomalies) with problems to general network faults, and the problems can be quickly positioned; (4) The network packet capturing capability can be provided for complex network faults, and data support is provided for analyzing fault reasons.

Each embodiment is mainly described and is different from other embodiments, and the same similar parts among the embodiments are mutually referred.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The kubernetes Pod network error checking method is characterized in that the method is realized based on a kubernetes Pod network error checking system, and the system comprises the following steps: the detection scheduling component is arranged outside the kubernetes cluster, the at least one out-of-cluster detection agent and the intra-cluster detection agent are arranged in the kubernetes cluster; each node in the kubernetes cluster deploys one intra-cluster detection agent;

the method comprises the following steps:

The cluster outside detection agent periodically detects the network communication state of the detected pod and the outside-cluster service based on the cluster information and sends the network communication state result to the detection scheduling component; the cluster external service comprises an Ingress service and a nodeport service;

the detection scheduling component receives a network connectivity result detected by the intra-cluster detection agent and a network communication state result detected by the intra-cluster detection agent, and determines a network abnormality reason of the pod to be detected according to the network connectivity result and the network communication state result;

the intra-cluster detection agent periodically detects network connectivity of a pod to be detected and a same node pod and network connectivity of the pod to be detected and different node pods based on the cluster information, and sends the network connectivity result to the detection scheduling component, and specifically includes:

(1) The detection agent in the first cluster periodically executes a ping command, a telnet command tests the network connectivity of the detected pod, a first network connectivity result is obtained, and the first network connectivity result is sent to the detection scheduling component; the first intra-cluster detection agent is an intra-cluster detection agent which is in the same node with the detected pod;

The method comprises the steps that a detection agent in a first cluster periodically executes a ping command, a telnet command tests network connectivity of the detected pod to obtain a first network connectivity result, and specifically comprises the following steps:

the detection agent in the first cluster executes a ping command to the detected Pod regularly to determine a first Pod IP address ping on state of the detected Pod;

if the first Pod port telnet on state is Pod port telnet off, the detected Pod marked Pod port is off, and Tcpdump packet grabbing analysis is carried out; the first network connectivity result comprises the first Pod IP address ping on state and the first Pod port telnet on state;

(2) The detection agent in the second cluster periodically executes a ping command, a telnet command and a curl command to test the cross-node network connectivity of the detected pod to obtain a second network connectivity result, and sends the second network connectivity result to the detection scheduling component; the second intra-cluster detection agent is an intra-cluster detection agent at a different node from the detected pod;

the second intra-cluster detection agent periodically executes a ping command, a telnet command and a curl command to test the cross-node network connectivity with the detected pod to obtain a second network connectivity result, and the method specifically comprises the following steps:

if the ping state of the second Pod IP address is that the Pod IP address is not ping, the second intra-cluster detection agent performs connectivity detection of a ping command with a node where the detected Pod is located, node VTEP equipment corresponding to the node where the detected Pod is located, and cni network bridge corresponding to the node where the detected Pod is located, so as to obtain a third network connectivity result;

The second intra-cluster detection agent performs connectivity detection of a ping command with a node where a detected Pod is located, node VTEP equipment corresponding to the node where the detected Pod is located, and cni network bridge corresponding to the node where the detected Pod is located, so as to obtain a third network connectivity result, and specifically includes:

If the IP state of the node where the detected Pod is located is that the IP of the node where the detected Pod is located is not communicated, marking that the node where the detected Pod is located is abnormal, and carrying out Tcpdump packet grabbing analysis;

(4) And sending the network connectivity result to the detection scheduling component.

2. The method of claim 1, wherein when the second Pod IP address ping status is detected as Pod IP address ping failed or the second Pod port telnet failed or the service domain name reachable status is service domain name failed for the first time, the detection scheduling component reassigns a new second intra-cluster detection agent and replaces the second intra-cluster detection agent with the new second intra-cluster detection agent, and returns to the step of "the second intra-cluster detection agent performs a ping command to the detected Pod, determining a second Pod IP address ping failed status of the detected Pod" until the second Pod IP address ping failed status, the second Pod port telnet failed status, and the service domain name reachable status detected for the second time are obtained;

3. The method of claim 1, wherein the out-of-cluster detection agent periodically detects the network communication status of the detected pod with the out-of-cluster service based on the cluster information, comprising:

the cluster external detection agent executes telnet command or curl command to the detected pod to access the associated Ingress domain name so as to obtain the access state of the Ingress domain name;

if the access state of the access domain name is that the access domain name is accessible, marking that the service cluster is accessible; if the access state of the access domain name is that the access domain name is inaccessible, the cluster external detection agent checks the access service state; if the Ingress service state is abnormal, marking the Ingress abnormal; if the Ingress service state is normal, the cluster external detection agent checks the service and endpoint states; if the service and endpoint states are normal, checking the Ingress configuration and checking the DNS service; if the service and endpoint states are abnormal, then the service and endpoint are marked as abnormal.

4. The method of claim 1, wherein the out-of-cluster detection agent periodically detects the network communication status of the detected pod with the out-of-cluster service based on the cluster information, comprising:

the cluster external detection agent executes telnet command or curl command to the detected pod to access the associated service node port to obtain a service node port reachable state;

if the service node reachable state is not reachable, the cluster external detection agent checks the service and endpoint states; if the service and endpoint states are normal, performing Tcpdump packet grabbing analysis; if the service and endpoint states are abnormal, then the service/endpoint exception is marked.