Nodes
Kubernetes runs yourworkloadby placing containers into Pods to run onNodes.A node may be a virtual or physical machine, depending on the cluster. Each nodeis managed by thecontrol planeand contains the services necessary to runPods.
Typically you have several nodes in a cluster; in a learning or resource-limitedenvironment, you might have only one node.
Thecomponents on a node include thekubelet, acontainer runtime, and thekube-proxy.
Management
There are two main ways to have Nodes added to theAPI server:
- The kubelet on a node self-registers to the control plane
- You (or another human user) manually add a Node object
After you create a Nodeobject,or the kubelet on a node self-registers, the control plane checks whether the new Node objectis valid. For example, if you try to create a Node from the following JSON manifest:
{"kind":"Node","apiVersion":"v1","metadata": {"name":"10.240.79.157","labels": {"name":"my-first-k8s-node" } }}
Kubernetes creates a Node object internally (the representation). Kubernetes checksthat a kubelet has registered to the API server that matches themetadata.name
field of the Node. If the node is healthy (i.e. all necessary services are running),then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activityuntil it becomes healthy.
Note:
Kubernetes keeps the object for the invalid Node and continues checking to see whetherit becomes healthy.
You, or acontroller, must explicitlydelete the Node object to stop that health checking.
The name of a Node object must be a validDNS subdomain name.
Node name uniqueness
Thename identifies a Node. Two Nodescannot have the same name at the same time. Kubernetes also assumes that a resource with the samename is the same object. In case of a Node, it is implicitly assumed that an instance using thesame name will have the same state (e.g. network settings, root disk contents) and attributes likenode labels. This may lead to inconsistencies if an instance was modified without changing its name.If the Node needs to be replaced or updated significantly, the existing Node object needs to beremoved from API server first and re-added after the update.
Self-registration of Nodes
When the kubelet flag--register-node
is true (the default), the kubelet will attempt toregister itself with the API server. This is the preferred pattern, used by most distros.
For self-registration, the kubelet is started with the following options:
--kubeconfig
- Path to credentials to authenticate itself to the API server.--cloud-provider
- How to talk to acloud providerto read metadata about itself.--register-node
- Automatically register with the API server.--register-with-taints
- Register the node with the given list oftaints (comma separated<key>=<value>:<effect>
).No-op if
register-node
is false.--node-ip
- Optional comma-separated list of the IP addresses for the node.You can only specify a single address for each address family.For example, in a single-stack IPv4 cluster, you set this value to be the IPv4 address that thekubelet should use for the node.Seeconfigure IPv4/IPv6 dual stackfor details of running a dual-stack cluster.If you don't provide this argument, the kubelet uses the node's default IPv4 address, if any;if the node has no IPv4 addresses then the kubelet uses the node's default IPv6 address.
--node-labels
-Labels to add when registering the nodein the cluster (see label restrictions enforced by theNodeRestriction admission plugin).--node-status-update-frequency
- Specifies how often kubelet posts its node status to the API server.
When theNode authorization mode andNodeRestriction admission pluginare enabled, kubelets are only authorized to create/modify their own Node resource.
Note:
As mentioned in theNode name uniqueness section,when Node configuration needs to be updated, it is a good practice to re-registerthe node with the API server. For example, if the kubelet is being restarted witha new set of--node-labels
, but the same Node name is used, the change willnot take effect, as labels are only set (or modified) upon Node registration with the API server.
Pods already scheduled on the Node may misbehave or cause issues if the Nodeconfiguration will be changed on kubelet restart. For example, already runningPod may be tainted against the new labels assigned to the Node, while otherPods, that are incompatible with that Pod will be scheduled based on this newlabel. Node re-registration ensures all Pods will be drained and properlyre-scheduled.
Manual Node administration
You can create and modify Node objects usingkubectl.
When you want to create Node objects manually, set the kubelet flag--register-node=false
.
You can modify Node objects regardless of the setting of--register-node
.For example, you can set labels on an existing Node or mark it unschedulable.
You can set optional node role(s) for nodes by adding one or morenode-role.kubernetes.io/<role>: <role>
labels to the node where characters of<role>
are limited by thesyntax rules for labels.
Kubernetes ignores the label value for node roles; by convention, you can set it to the same string you used for the node role in the label key.
You can use labels on Nodes in conjunction with node selectors on Pods to controlscheduling. For example, you can constrain a Pod to only be eligible to run ona subset of the available nodes.
Marking a node as unschedulable prevents the scheduler from placing new pods ontothat Node but does not affect existing Pods on the Node. This is useful as apreparatory step before a node reboot or other maintenance.
To mark a Node unschedulable, run:
kubectl cordon$NODENAME
SeeSafely Drain a Nodefor more details.
Note:
Pods that are part of aDaemonSet toleratebeing run on an unschedulable Node. DaemonSets typically provide node-local servicesthat should run on the Node even if it is being drained of workload applications.Node status
A Node's status contains the following information:
You can usekubectl
to view a Node's status and other details:
kubectl describe node <insert-node-name-here>
SeeNode Status for more details.
Node heartbeats
Heartbeats, sent by Kubernetes nodes, help your cluster determine theavailability of each node, and to take action when failures are detected.
For nodes there are two forms of heartbeats:
- Updates to the
.status
of a Node. - Lease objectswithin the
kube-node-lease
namespace.Each Node has an associated Lease object.
Node controller
The nodecontroller is aKubernetes control plane component that manages various aspects of nodes.
The node controller has multiple roles in a node's life. The first is assigning aCIDR block to the node when it is registered (if CIDR assignment is turned on).
The second is keeping the node controller's internal list of nodes up to date withthe cloud provider's list of available machines. When running in a cloudenvironment and whenever a node is unhealthy, the node controller asks the cloudprovider if the VM for that node is still available. If not, the nodecontroller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller isresponsible for:
- In the case that a node becomes unreachable, updating the
Ready
conditionin the Node's.status
field. In this case the node controller sets theReady
condition toUnknown
. - If a node remains unreachable: triggeringAPI-initiated evictionfor all of the Pods on the unreachable node. By default, the node controllerwaits 5 minutes between marking the node as
Unknown
and submittingthe first eviction request.
By default, the node controller checks the state of each node every 5 seconds.This period can be configured using the--node-monitor-period
flag on thekube-controller-manager
component.
Rate limits on eviction
In most cases, the node controller limits the eviction rate to--node-eviction-rate
(default 0.1) per second, meaning it won't evict podsfrom more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zonebecomes unhealthy. The node controller checks what percentage of nodes in the zoneare unhealthy (theReady
condition isUnknown
orFalse
) at the same time:
- If the fraction of unhealthy nodes is at least
--unhealthy-zone-threshold
(default 0.55), then the eviction rate is reduced. - If the cluster is small (i.e. has less than or equal to
--large-cluster-size-threshold
nodes - default 50), then evictions are stopped. - Otherwise, the eviction rate is reduced to
--secondary-node-eviction-rate
(default 0.01) per second.
The reason these policies are implemented per availability zone is because oneavailability zone might become partitioned from the control plane while the others remainconnected. If your cluster does not span multiple cloud provider availability zones,then the eviction mechanism does not take per-zone unavailability into account.
A key reason for spreading your nodes across availability zones is so that theworkload can be shifted to healthy zones when one entire zone goes down.Therefore, if all nodes in a zone are unhealthy, then the node controller evicts atthe normal rate of--node-eviction-rate
. The corner case is when all zones arecompletely unhealthy (none of the nodes in the cluster are healthy). In such acase, the node controller assumes that there is some problem with connectivitybetween the control plane and the nodes, and doesn't perform any evictions.(If there has been an outage and some nodes reappear, the node controller doesevict pods from the remaining nodes that are unhealthy or unreachable).
The node controller is also responsible for evicting pods running on nodes withNoExecute
taints, unless those pods tolerate that taint.The node controller also addstaintscorresponding to node problems like node unreachable or not ready. This meansthat the scheduler won't place Pods onto unhealthy nodes.
Resource capacity tracking
Node objects track information about the Node's resource capacity: for example, the amountof memory available and the number of CPUs.Nodes thatself register report their capacity duringregistration. If youmanually add a Node, thenyou need to set the node's capacity information when you add it.
The Kubernetesscheduler ensures thatthere are enough resources for all the Pods on a Node. The scheduler checks that the sumof the requests of containers on the node is no greater than the node's capacity.That sum of requests includes all containers managed by the kubelet, but excludes anycontainers started directly by the container runtime, and also excludes anyprocesses running outside of the kubelet's control.
Note:
If you want to explicitly reserve resources for non-Pod processes, seereserve resources for system daemons.Node topology
Kubernetes v1.27 [stable]
(enabled by default: true)If you have enabled theTopologyManager
feature gate, thenthe kubelet can use topology hints when making resource assignment decisions.SeeControl Topology Management Policies on a Nodefor more information.
Swap memory management
Kubernetes v1.30 [beta]
(enabled by default: true)To enable swap on a node, theNodeSwap
feature gate must be enabled onthe kubelet (default is true), and the--fail-swap-on
command line flag orfailSwapOn
configuration settingmust be set to false.To allow Pods to utilize swap,swapBehavior
should not be set toNoSwap
(which is the default behavior) in the kubelet config.
Warning:
When the memory swap feature is turned on, Kubernetes data such as the contentof Secret objects that were written to tmpfs now could be swapped to disk.A user can also optionally configurememorySwap.swapBehavior
in order tospecify how a node will use swap memory. For example,
memorySwap:swapBehavior:LimitedSwap
NoSwap
(default): Kubernetes workloads will not use swap.LimitedSwap
: The utilization of swap memory by Kubernetes workloads is subject to limitations.Only Pods of Burstable QoS are permitted to employ swap.
If configuration formemorySwap
is not specified and the feature gate isenabled, by default the kubelet will apply the same behaviour as theNoSwap
setting.
WithLimitedSwap
, Pods that do not fall under the Burstable QoS classification (i.e.BestEffort
/Guaranteed
Qos Pods) are prohibited from utilizing swap memory.To maintain the aforementioned security and node health guarantees, these Podsare not permitted to use swap memory whenLimitedSwap
is in effect.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory
: The total amount of physical memory available on the node.totalPodsSwapAvailable
: The total amount of swap memory on the node that is available for use by Pods(some swap memory may be reserved for system use).containerMemoryRequest
: The container's memory request.
Swap limitation is configured as:(containerMemoryRequest / nodeTotalMemory) * totalPodsSwapAvailable
.
It is important to note that, for containers within Burstable QoS Pods, it is possible toopt-out of swap usage by specifying memory requests that are equal to memory limits.Containers configured in this manner will not have access to swap memory.
Swap is supported only withcgroup v2, cgroup v1 is not supported.
For more information, and to assist with testing and provide feedback, pleasesee the blog-post aboutKubernetes 1.28: NodeSwap graduates to Beta1,KEP-2400 and itsdesign proposal.
What's next
Learn more about the following:
- Components that make up a node.
- API definition for Node.
- Nodesection of the architecture design document.
- Graceful/non-graceful node shutdown.
- Node autoscaling tomanage the number and size of nodes in your cluster.
- Taints and Tolerations.
- Node Resource Managers.
- Resource Management for Windows nodes.