Monitor Nomad

The Nomad client and server agents collect a wide range of runtime metrics.These metrics are useful for monitoring the health and performance of Nomadclusters. Careful monitoring can spot trends before they cause issues and helpdebug issues if they arise.

All Nomad agents, both servers and clients, report basic system and Go runtimemetrics.

Nomad servers all report many metrics, but some metrics are specific to theleader server. Since leadership may change at any time, these metrics should bemonitored on all servers. Missing (or 0) metrics from non-leaders may be safelyignored.

Nomad clients have separate metrics for the host they are running on as well asfor each allocation being run. Both of these metricsmust be explicitlyenabled.

By default, the Nomad agent collects telemetry data at a1 secondinterval. Note that Nomad supportsgauges, counters, andtimers.

There are three ways to obtain metrics from Nomad:

Query the/v1/metrics API endpoint to return metricsfor the current Nomad process. This endpoint supports Prometheus formattedmetrics.
Send the USR1 signal to the Nomad process. This will dump the currenttelemetry information to STDERR (on Linux).
Configure Nomad to automatically forward metrics to a third-party providersuch asDataDog,Prometheus,statsd, andCirconus.

Alerting

The recommended practice for alerting is to leverage the alerting capabilitiesof your monitoring provider. Nomad’s intention is to surface metrics that enableusers to configure the necessary alerts using their existing monitoring systemsas a scaffold, rather than to natively support alerting. Here are a few commonpatterns.

Export metrics from Nomad to Prometheus using theStatsDexporter, definealerting rules inPrometheus, and useAlertmanager for summarization androuting/notifications (to PagerDuty, Slack, etc.). A similar workflow issupported forDatadog.
Periodically submit test jobs into Nomad to determine if your applicationdeployment pipeline is working end-to-end. This pattern is well-suited tobatch processing workloads.
Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagiosmonitor when a new Nomad job is added. When a job is removed, remove theNagios monitor. Map Consul alerts to the Nagios monitor. This provides ajob-specific alerting system.
Write a script that looks at the history of each batch job to determinewhether or not the job is in an unhealthy state, updating your monitoringsystem as appropriate. In many cases, it may be ok if a given batch job failsoccasionally, as long as it goes back to passing.

Key performance indicators

Nomad servers' memory, CPU, disk, and network usage all scales linearly withcluster size and scheduling throughput. The most important aspect of ensuringNomad operates normally is monitoring these system resources to ensure theservers are not encountering resource constraints.

Raft consensus protocol

Nomad uses the Raft consensus protocol for leader election and statereplication. Spurious leader elections can be caused by networkingissues between the servers, insufficient CPU resources, orinsufficient disk IOPS. Users in cloud environments often bump theirservers up to the next instance class with improved networking and CPUto stabilize leader elections, or switch to higher-performance disks.

Thenomad.raft.leader.lastContact metric is a general indicator ofRaft latency which can be used to observe how Raft timing isperforming and guide infrastructure provisioning. If this numbertrends upwards, look at CPU, disk IOPs, and networklatency.nomad.raft.leader.lastContact should not get too close tothe leader lease timeout of 500ms.

Thenomad.raft.replication.appendEntries metric is an indicator ofthe time it takes for a Raft transaction to be replicated to a quorumof followers. If this number trends upwards, check the disk I/O on thefollowers and network latency between the leader and the followers.

The details for how to examine CPU, IO operations, and networking arespecific to your platform and environment. On Linux, thesysstatpackage contains a number of useful tools. Here are examples toconsider.

CPU -vmstat 1, cloud provider metrics for "CPU %"
IO -iostat,sar -d, cloud provider metrics for "volumewrite/read ops" and "burst balance"
Network -sar -n,netstat -s, cloud provider metrics forinterface "allowance"

Thenomad.raft.fsm.apply metric is an indicator of the time it takesfor a server to apply Raft entries to the internal state machine. Ifthis number trends upwards, look at thenomad.nomad.fsm.* metrics tosee if a specific Raft entry is increasing in latency. You can comparethis to warn-level logs on the Nomad servers forattempting to applylarge raft entry. If a specific type of message appears here, theremay be a job with a large job specification or dispatch payload thatis increasing the time it takes to apply Raft messages. Try shrinking the sizeof the job either by putting distinct task groups into separate jobs,downloading templates instead of embedding them, or reducing thecount ontask groups.

Scheduling

TheScheduling documentation describes the workflow of how evaluations becomescheduled plans and placed allocations.

Progress

There is a class of bug possible in Nomad where the two parts of the schedulingpipeline, the workers and the leader's plan applier,disagree about thevalidity of a plan. In the pathological case this can cause a job to neverfinish scheduling, as workers produce the same plan and the plan applierrepeatedly rejects it.

While this class of bug is very rare, it can be detected by repeated log lineson the Nomad servers containingplan for node rejected:

nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5

While it is possible for these log lines to occur infrequently due to normalcluster conditions, they should not appear repeatedly and prevent the job fromeventually running (look up the evaluation ID logged to find the job).

Plan rejection tracker

Nomad provides a mechanism to track the history of plan rejections per clientand mark them as ineligible if the number goes above a given threshold within atime window. This functionality can be enabled using theplan_rejection_tracker server configuration.

When a node is marked as ineligible due to excessive plan rejections, thefollowing node event is registered:

Node marked as ineligible for scheduling due to multiple plan rejections, refer to https://developer.hashicorp.com/nomad/s/port-plan-failure for more information

Along with the log line:

[WARN]  nomad.state_store: marking node as ineligible due to multiple plan rejections: node_id=67af2541-5e96-6f54-9095-11089d627626

If a client is marked as ineligible due to repeated plan rejections, trydraining the node and shutting it down. Misconfigurations not caught byvalidation can cause nodes to enter this state:#11830.

If theplan for node rejected logdoes appear repeatedly with the samenode_id referenced but the client is not being set as ineligible you can tryadjusting theplan_rejection_tracker configuration of servers.

Performance

The following metrics allow observing changes in throughput at the variouspoints in the scheduling process.

nomad.worker.invoke_scheduler.<type> - The time to run thescheduler of the given type. Each scheduler worker handles oneevaluation at a time, entirely in-memory. If this metric increases,examine the CPU and memory resources of the scheduler.
nomad.blocked_evals.total_blocked - The number of blockedevaluations. Blocked evaluations are created when the schedulercannot place all allocations as part of a plan. Blocked evaluationswill be re-evaluated so that changes in cluster resources can beused for the blocked evaluation's allocations. An increase inblocked evaluations may mean that the cluster's clients are low inresources or that job have been submitted that can never have alltheir allocations placed.
nomad.broker.total_unacked - The number of unacknowledgedevaluations. When an evaluation has been processed, the worker sendsan acknowledgment RPC to the leader to signal to the eval brokerthat processing is complete. The unacked evals are those that arein-flight in the scheduler and have not yet been acknowledged. Anincrease in unacknowledged evaluations may mean that the schedulershave a large queue of evaluations to process. See theinvoke_scheduler metric (above) and examine the CPU and memoryresources of the scheduler. Nomad also emits a similar metric foreach individual scheduler. For examplenomad.broker.batch_unackedshows the number of unacknowledged evaluations for the batchscheduler.
nomad.broker.total_pending - The number of pending evaluations in the evalbroker. Nomad processes only one evaluation for a given job concurrently. Whenan unacked evaluation is acknowledged, Nomad will discard all but the latestevaluation for a job. An increase in this metric may mean that the clusterstate is changing more rapidly than the schedulers can keep up.
nomad.plan.evaluate - The time to evaluate a scheduler plansubmitted by a worker. This operation happens on the leader toserialize the plans of all the scheduler workers. This happensentirely in memory on the leader. If this metric increases, examinethe CPU and memory resources of the leader.
nomad.plan.wait_for_index - The time required for the planner to wait forthe Raft index of the plan to be processed. If this metric increases, referto theConsensus Protocol (Raft) section above. If this metric approaches 5seconds, scheduling operations may fail and be retried. If possible reducescheduling load until metrics improve.
nomad.plan.submit - The time to submit a scheduler plan from theworker to the leader. This operation requires writing to Raft andincludes the time fromnomad.plan.evaluate andnomad.plan.wait_for_index (above). If this metric increases, referto theConsensus Protocol (Raft) section above.
nomad.plan.queue_depth - The number of scheduler plans waitingto be evaluated after being submitted. If this metric increases,examine thenomad.plan.evaluate andnomad.plan.submit metrics todetermine if the problem is in general leader resources or Raftperformance.

Upticks in any of the above metrics indicate a decrease in schedulerthroughput.

Capacity

The importance of monitoring resource availability is workload specific. Batchprocessing workloads often operate under the assumption that the cluster shouldbe at or near capacity, with queued jobs running as soon as adequate resourcesbecome available. Clusters that are primarily responsible for long runningservices with an uptime requirement may want to maintain headroom at 20% ormore. The following metrics can be used to assess capacity across the cluster ona per client basis.

nomad.client.allocated.cpu
nomad.client.unallocated.cpu
nomad.client.allocated.disk
nomad.client.unallocated.disk
nomad.client.allocated.memory
nomad.client.unallocated.memory

Task resource consumption

The metrics listedhere can be used to track resourceconsumption on a per task basis. For user facing services, it is common to alertwhen the CPU is at or above the reserved resources for the task.

Job and task status

SeeJob Summary Metrics for monitoring the health and status of workloadsrunning on Nomad.

Runtime metrics

Runtime metrics apply to all clients and servers. The following metrics aregeneral indicators of load and memory pressure.

nomad.runtime.num_goroutines
nomad.runtime.heap_objects
nomad.runtime.alloc_bytes

It is recommended to alert on upticks in any of the above, server memory usagein particular.

Serf federated deployments

Nomad uses the membership and failure detection capabilities of the Serf libraryto maintain a single, global gossip pool for all servers in a federateddeployment. An uptick inmember.flap and/ormsg.suspect is a reliable indicatorthat membership is unstable.

If these metrics increase, look at CPU load on the servers and networklatency and packet loss for theSerf address.

Client introduction

When you configure Nomad withclient_introduction.enforcement set to eitherwarn orstrict, the server handling the client registration requestincrements thenomad.client.introduction.enforcement counter each time a newregistration is made without an introduction token.

Monitoring this metric can help identify clients that are not registering withan introduction token, which is important when migrating to a tougherenforcement level.

Edit this page on GitHub

Movatterモバイル変換